Concurrent Classifier Error Detection (CCED) in Large Scale Machine Learning Systems

Pedro Reviriego,Ziheng Wang,Alvaro Alonso,Zhen Gao,Farzad Niknia, Shanshan Liu,Fabrizio Lombardi

IEEE TRANSACTIONS ON RELIABILITY（2024）

引用 0|浏览21

暂无评分

摘要

The complexity of machine learning (ML) systems increases each year. As these systems are widely utilized, ensuring their reliable operation is becoming a design requirement. Traditional error detection mechanisms introduce circuit or time redundancy that significantly impacts system performance. An alternative is the use of concurrent error detection (CED) schemes that operate in parallel with the system and exploit their properties to detect errors. CED is attractive for large ML systems because it can potentially reduce the cost of error detection. In this article, we introduce concurrent classifier error detection (CCED), a scheme to implement CED in ML systems using a concurrent ML classifier to detect errors. CCED identifies a set of check signals in the main ML system and feed them to the concurrent ML classifier that is trained to detect errors. The proposed CCED scheme has been implemented and evaluated on two widely used large-scale ML models: Contrastive language-image pretraining (CLIP) used for image classification and bidirectional encoder representations from transformers (BERT) used for natural language applications. The results show that more than 95% of the errors are detected when using a simple Random Forest classifier that is orders of magnitude simpler than CLIP or BERT.

查看译文

关键词

Computational modeling,Transformers,Neural networks,Integrated circuit modeling,Complexity theory,Zero-shot learning,Training,Bidirectional encoder representations from transformers (BERT),concurrent error detection (CED),contrastive language-image pretraining (CLIP),machine learning (ML),soft errors

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要