Distilling Vision-Language Models on Millions of Videos
Computer Vision and Pattern Recognition(2024)
Key words
Vision-language Models,Question Answering,Language Model,Language Version,Original Text,Tokenized,Video Dataset,Top-1 Accuracy,Causal Questions,Image Captioning,Causal Reasoning,Language Components,Flamingo,Vision Transformer,Text Sequence,Visual Question Answering,Visual Encoding,Video Understanding,Visual Adaptation,Video Captioning,Text Encoder,Textual Descriptions
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined