Want Predictable GPU Execution? Beware SMIs!

Rohan Wagle,Zelin Tong, Richard L. Sites,James H. Anderson

2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)(2023)

引用 0|浏览1
暂无评分
摘要
It is common practice today to design complex safety-critical systems by repurposing hardware and software components originally designed for other contexts and using such components in a "black-box" fashion. However, if a black box’s inner workings are not fully understood, then this can be unsafe. This paper reports on an investigation pertaining to a black box that is important for autonomous systems, namely NVIDIA’s CUDA GPU framework. This investigation was motivated by certain timing glitches in CUDA kernels reported in the literature. After extensive tracing and testing efforts, the culprit causing these glitches was surprisingly found to be not CUDA-related at all, but rather delays due to system management interrupts (SMIs), a known source of timing unpredictability on x86 machines that is rarely if ever mentioned in work on real-time GPU usage. The effects of these SMIs are invisible to the operating system and can cause all cores on an x86 machine to become unavailable for over 20ms! This paper describes the methods used to uncover this timing-glitch source. It also discusses some lessons learned when trying to validate the timing behavior of black-box components.
更多
查看译文
关键词
GPU,CUDA,SMI,real-time,safety-critical,autonomous-vehicles,Linux,NVIDIA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要