Soft-Error Characterization and Mitigation Strategies for Edge Tensor Processing Units in Space

IEEE Transactions on Aerospace and Electronic Systems(2024)

引用 0|浏览0
暂无评分
摘要
The Google Coral Edge Tensor Processing Unit (Edge TPU) offers low-power, high-performance capabilities ideal for enabling deep learning in space. However, as a commercial product, no reliability considerations are made in its design, and little is known about its radiation profile. As a device targeted by current and future space computing platforms, it is vital to mission success to understand the vulnerabilities and possible failure modes prior to flight. In this research, we evaluate the vulnerabilities of the Edge TPU and propose fault-mitigation techniques to improve device reliability. Several Edge TPUs were irradiated at the Los Alamos Neutron Science Center's (LANSCE) Weapons Neutron Research Facility using a wide-spectrum neutron beam. We evaluate the reliability of two machine-learning applications with common use cases within the space domain: image classification and semantic segmentation. Through experimentation, a vulnerability within the onboard memory is identified. Responsible for caching model parameters for increased performance, the onboard memory represents a critical device area. Any upsets within the cache risk compromising data integrity and model determinism. Across a variety of models tested under neutron radiation, fault accumulation and persistence are consistently observed resulting in the degradation of model accuracy and confidence. For image classification, class confidence scores experience frequent fluctuations between inferences leading to potential misclassifications. Semantic segmentation shows decreases in pixel accuracy and mean intersection-over-union (mIoU). To alleviate the impact of radiation, we propose two fault-mitigation techniques: Naive Refreshing (NR) and Golden Batch Refreshing (GBR). Naive Refreshing forces the Edge TPU to reload model parameters after every $r$ inferences. Unfortunately, reloading incurs an overhead that significantly reduces performance by increasing total inference time by up to $4.4\times$ . Finally, Golden Batch Refreshing is proposed as an alternative method to reload model parameters based on detected anomalies in the output of a pretested batch of model inputs. Periodically, the inputs from the golden batch are passed into the model and the host validates the output. If an anomaly is detected, the model parameters are reloaded prior to the next inference. GBR improves performance over NR by only incurring the overhead of the inference time to examine the model for errors rather than forcing expensive memory transactions. The golden inputs can also be interleaved between critical inferences for additional overhead reduction. By leveraging knowledge of the cache vulnerabilities and applying one or more mitigation strategies, Edge TPUs can be properly considered for integration into existing and future flight hardware.
更多
查看译文
关键词
Deep Learning,Fault-Tolerant Computing,Machine Learning,Onboard Processing,Space Computing,Spacecraft Autonomy,Tensor Processing Units
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要