EXAMS-V: A Multi-Discipline Multilingual Multimodal Exam Benchmark for Evaluating Vision Language Models
arxiv(2024)
摘要
We introduce EXAMS-V, a new challenging multi-discipline multimodal
multilingual exam benchmark for evaluating vision language models. It consists
of 20,932 multiple-choice questions across 20 school disciplines covering
natural science, social science, and other miscellaneous studies, e.g.,
religion, fine arts, business, etc. EXAMS-V includes a variety of multimodal
features such as text, images, tables, figures, diagrams, maps, scientific
symbols, and equations. The questions come in 11 languages from 7 language
families. Unlike existing benchmarks, EXAMS-V is uniquely curated by gathering
school exam questions from various countries, with a variety of education
systems. This distinctive approach calls for intricate reasoning across diverse
languages and relies on region-specific knowledge. Solving the problems in the
dataset requires advanced perception and joint reasoning over the text and the
visual content of the image. Our evaluation results demonstrate that this is a
challenging dataset, which is difficult even for advanced vision-text models
such as GPT-4V and Gemini; this underscores the inherent complexity of the
dataset and its significance as a future benchmark.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要