Chrome Extension
WeChat Mini Program
Use on ChatGLM

ESEN: Efficient GPU sharing of Ensemble Neural Networks

NEUROCOMPUTING(2024)

Cited 0|Views1
No score
Abstract
Ensemble neural networks are widely applied in cloud -based inference services due to their remarkable performance, while the growing demand for low -latency services leads researchers to pay more attention to the execution efficiency of these models, especially the device utilization. It is highly desirable to fully utilize GPUs by multiplexing different inference tasks on the same GPU with advanced sharing technique, such as Multi -Process -Service (MPS). However, we find it struggling when applying MPS to Ensemble Neural Networks, which consist of multiple related sub -models. The critical challenge in this predicament revolves around the efficient allocation of resources within an ensemble, aiming to minimize job completion time. To tackle this challenge, we initially examine the interplay among individual neural networks within an ensemble, outlining a guideline for achieving the shortest job completion time. Subsequently, we establish a mathematical model to formalize the resource requirements of each neural network. We introduce a search -based allocation algorithm designed to swiftly identify optimal solutions. Finally, we introduce ESEN, comprising the search -based resource allocation algorithm and efficient model execution mechanisms within PyTorch. ESEN is augmented with customized execution mechanisms for user-friendly implementation. Experimental results demonstrate that proposed ESEN can attain an efficiency improvement up to 17.84% and a GPU utilization increase of 28.09% compared to the default strategy. With the optimization of GPU resource allocation, ESEN significantly improves the efficiency of ensemble models. It provides a low -latency and high -accuracy solution for online interactive services.
More
Translated text
Key words
GPU sharing,Ensemble Neural Network,MPS,Inference services,Resource allocation
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined