Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words
arxiv(2023)
摘要
Vision Transformer (ViT) has emerged as a powerful architecture in the realm
of modern computer vision. However, its application in certain imaging fields,
such as microscopy and satellite imaging, presents unique challenges. In these
domains, images often contain multiple channels, each carrying semantically
distinct and independent information. Furthermore, the model must demonstrate
robustness to sparsity in input channels, as they may not be densely available
during training or testing. In this paper, we propose a modification to the ViT
architecture that enhances reasoning across the input channels and introduce
Hierarchical Channel Sampling (HCS) as an additional regularization technique
to ensure robustness when only partial channels are presented during test time.
Our proposed model, ChannelViT, constructs patch tokens independently from each
input channel and utilizes a learnable channel embedding that is added to the
patch tokens, similar to positional embeddings. We evaluate the performance of
ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat
(satellite imaging). Our results show that ChannelViT outperforms ViT on
classification tasks and generalizes well, even when a subset of input channels
is used during testing. Across our experiments, HCS proves to be a powerful
regularizer, independent of the architecture employed, suggesting itself as a
straightforward technique for robust ViT training. Lastly, we find that
ChannelViT generalizes effectively even when there is limited access to all
channels during training, highlighting its potential for multi-channel imaging
under real-world conditions with sparse sensors. Our code is available at
https://github.com/insitro/ChannelViT.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要