Efficient MAE towards Large-Scale Vision Transformers.

Han Qiu,Gongjie Zhang,Jiaxing Huang, Peng Gao, Zhang Wei,Shijian Lu

IEEE/CVF Winter Conference on Applications of Computer Vision(2024)

引用 0|浏览2
暂无评分
摘要
Masked Autoencoder (MAE) has demonstrated superb pre-training efficiency for vision Transformer, thanks to its partial input paradigm and high mask ratio (0.75). However, MAE often suffers from severe performance drop under higher mask ratios, which hinders its potential toward larger-scale vision Transformers. In this work, we identify that the performance drop is largely attributed to the over-dominance of difficult reconstruction targets, as higher mask ratios lead to more sparse visible patches and fewer visual clues for reconstruction. To mitigate this issue, we design Efficient MAE that introduces a novel Difficulty-Flatten Loss and a decoder masking strategy, enabling a higher mask ratio for more efficient pre-training. The Difficulty-Flatten Loss provides balanced supervision on reconstruction targets of different difficulties, mitigating the performance drop under higher mask ratios effectively. Additionally, the decoder masking strategy discards the most difficult reconstruction targets, which further alleviates the optimization difficulty and accelerates the pre-training clearly. Our proposed Efficient MAE introduces 27% and 30% pre-training runtime accelerations for the ViT-Large and ViT-Huge models, provides valuable insights into MAE’s optimization, and paves the way for larger-scale vision Transformer pre-training. Code and pre-trained models will be released.
更多
查看译文
关键词
Algorithms,Image recognition and understanding,Algorithms,Machine learning architectures,formulations,and algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要