HyenaPixel: Global Image Context with Convolutions
CoRR(2024)
摘要
In vision tasks, a larger effective receptive field (ERF) is associated with
better performance. While attention natively supports global context,
convolution requires multiple stacked layers and a hierarchical structure for
large context. In this work, we extend Hyena, a convolution-based attention
replacement, from causal sequences to the non-causal two-dimensional image
space. We scale the Hyena convolution kernels beyond the feature map size up to
191×191 to maximize the ERF while maintaining sub-quadratic complexity
in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel,
and bidirectional Hyena into the MetaFormer framework. For image
categorization, HyenaPixel and bidirectional Hyena achieve a competitive
ImageNet-1k top-1 accuracy of 83.0
outperforming other large-kernel networks. Combining HyenaPixel with attention
further increases accuracy to 83.6
the lack of spatial bias in later stages and support this finding with
bidirectional Hyena.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要