谷歌浏览器插件
订阅小程序
在清言上使用

Multi-Level Analysis of GPU Utilization in ML Training Workloads.

Design, Automation, and Test in Europe(2024)

引用 0|浏览2
暂无评分
摘要
Training time has become a critical bottleneck due to the recent proliferation of large-parameter ML models. GPUs continue to be the prevailing architecture for training ML models. However, the complex execution flow of ML frameworks makes it difficult to understand GPU computing resource utilization. Our main goal is to provide a better understanding of how efficiently ML training workloads use the computing resources of modern GPUs. To this end, we first describe an ideal reference execution of a GPU-accelerated ML training loop and identify relevant metrics that can be measured using existing profiling tools. Second, we produce a coherent integration of the traces obtained from each profiling tool. Third, we leverage the metrics within our integrated trace to analyze the impact of different software optimizations (e.g., mixed-precision, various ML frameworks, and execution modes) on the throughput and the associated utilization at multiple levels of hardware abstraction (i.e., whole GPU, SM subpartitions, issue slots, and tensor cores). In our results on two modern GPUs, we present seven takeaways and show that although close to 100% utilization is generally achieved at the GPU level, average utilization of the issue slots and tensor cores always remains below 50% and 5.2%, respectively.
更多
查看译文
关键词
Train Machine Learning,Training Workload,GPU Utilization,Machine Learning Models,Training Time,Computational Resources,Training Iterations,Average Use,Machine Learning Framework,Train Machine Learning Models,Profiling Tool,Relevant Metrics,Optimal Impact,Execution Mode,Convolutional Neural Network,High Use,Batch Size,Performance Metrics,PyTorch,TensorFlow,Total Execution Time,Key Takeaway,GPU Memory,Increase In Throughput,Setup Time,Large Batch Size,Memory Capacity,Use Of Core
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要