Generating Natural Questions About an Image

meeting of the association for computational linguistics, 2016.

Cited by: 114|Bibtex|Views92|Links
EI
Keywords:
question answeringvisual questionVisual Question GenerationVisual Question AnsweringMicrosoft Common Objects in ContextMore(6+)
Weibo:
We introduced the novel task of ‘Visual Question Generation’, where given an image, the system is tasked with asking a natural question

Abstract:

There has been an explosion of work in the vision u0026 language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often dire...More

Code:

Data:

0
Introduction
  • The authors are witnessing a renewed interest in interdisciplinary AI research in vision & language, from descriptions of the visual input such as image captioning (Chen et al, 2015; Fang et al, 2014; Donahue et al, 2014; Chen et al, 2015) and video

    Natural Questions: - Was anyone injured in the crash? - Is the motorcyclist alive? - What caused this accident?

    Generated Caption: - A man standing next to a motorcycle.

    transcription (Rohrbach et al, 2012; Venugopalan et al, 2015), to testing computer understanding of an image through question answering (Antol et al, 2015; Malinowski and Fritz, 2014).
  • The most established work in the vision & language community is ‘image captioning’, where the task is to produce a literal description of the image
  • It has been shown (Devlin et al, 2015; Fang et al, 2014; Donahue et al, 2014) that a reasonable language modeling paired with deep visual features trained on large enough datasets promise a good performance on image captioning, making it a less challenging task from language learning perspective.
Highlights
  • We are witnessing a renewed interest in interdisciplinary AI research in vision & language, from descriptions of the visual input such as image captioning (Chen et al, 2015; Fang et al, 2014; Donahue et al, 2014; Chen et al, 2015) and video

    Natural Questions: - Was anyone injured in the crash? - Is the motorcyclist alive? - What caused this accident?

    Generated Caption: - A man standing next to a motorcycle.

    transcription (Rohrbach et al, 2012; Venugopalan et al, 2015), to testing computer understanding of an image through question answering (Antol et al, 2015; Malinowski and Fritz, 2014)
  • To move beyond the literal description of image content, we introduce the novel task of Visual Question Generation (VQG), where given an image, the system should ‘ask a natural and engaging question’
  • The contributions of this paper can be summarized as follows: (1) in order to enable the Visual Question Generation research, we carefully created three datasets with a total of 75,000 questions, which range from object- to event-centric images, where we show that Visual Question Generation covers a wide range of abstract terms including events and states (Section 3). (2) we collected 25,000 gold captions for our eventcentric dataset and show that this dataset presents
  • In order to shed some light on differences between our three datasets, we present the evaluation results separately on each dataset in Table 5
  • The GRNNX model outperforms other models according to all three metrics on the V QGBing−5000 dataset
  • We introduced the novel task of ‘Visual Question Generation’, where given an image, the system is tasked with asking a natural question
Methods
Results
  • The authors present the human and automatic metric evaluation results of the models introduced earlier.
  • The most competitive is K-NN+min bleu all, which performs the best on V QGCOCO−5000 and V QGF lickr−5000 datasets according to BLEU and ∆BLEU score.
  • This further confirms the effective retrieval methodology for including min-distance and n-gram overlap similarity measures.
  • This shows that the Bing dataset is more demanding, making it a meaningful challenge for the community
Conclusion
  • The authors introduced the novel task of ‘Visual Question Generation’, where given an image, the system is tasked with asking a natural question.
  • The authors provide three distinct datasets, each covering a variety of images.
  • The most challenging is the Bing dataset, requiring systems to generate questions with event-centric concepts such as ‘cause’, ‘event’, ‘happen’, etc., from the visual input.
  • The authors show that the Bing dataset presents challenging images to the state-of-the-art captioning systems.
  • The authors encourage the community to report their system results on the Bing test dataset and according to the ∆BLEU automatic metric.
  • All the datasets will be released to the public12
Summary
  • Introduction:

    The authors are witnessing a renewed interest in interdisciplinary AI research in vision & language, from descriptions of the visual input such as image captioning (Chen et al, 2015; Fang et al, 2014; Donahue et al, 2014; Chen et al, 2015) and video

    Natural Questions: - Was anyone injured in the crash? - Is the motorcyclist alive? - What caused this accident?

    Generated Caption: - A man standing next to a motorcycle.

    transcription (Rohrbach et al, 2012; Venugopalan et al, 2015), to testing computer understanding of an image through question answering (Antol et al, 2015; Malinowski and Fritz, 2014).
  • The most established work in the vision & language community is ‘image captioning’, where the task is to produce a literal description of the image
  • It has been shown (Devlin et al, 2015; Fang et al, 2014; Donahue et al, 2014) that a reasonable language modeling paired with deep visual features trained on large enough datasets promise a good performance on image captioning, making it a less challenging task from language learning perspective.
  • Methods:

    Retrieval models use the caption of a nearest neighbor training image to label the test image (Hodosh et al, 2013; Devlin et al, 2015; Farhadi et al, 2010; Ordonez et al, 2011).
  • Basic nearest neighbor approaches to image captioning on the MS COCO dataset are shown to outperform generation models according to automatic metrics (Devlin et al, 2015).
  • The authors obtained the most competitive results by setting K dynamically, as opposed to the earlier
  • Results:

    The authors present the human and automatic metric evaluation results of the models introduced earlier.
  • The most competitive is K-NN+min bleu all, which performs the best on V QGCOCO−5000 and V QGF lickr−5000 datasets according to BLEU and ∆BLEU score.
  • This further confirms the effective retrieval methodology for including min-distance and n-gram overlap similarity measures.
  • This shows that the Bing dataset is more demanding, making it a meaningful challenge for the community
  • Conclusion:

    The authors introduced the novel task of ‘Visual Question Generation’, where given an image, the system is tasked with asking a natural question.
  • The authors provide three distinct datasets, each covering a variety of images.
  • The most challenging is the Bing dataset, requiring systems to generate questions with event-centric concepts such as ‘cause’, ‘event’, ‘happen’, etc., from the visual input.
  • The authors show that the Bing dataset presents challenging images to the state-of-the-art captioning systems.
  • The authors encourage the community to report their system results on the Bing test dataset and according to the ∆BLEU automatic metric.
  • All the datasets will be released to the public12
Tables
  • Table1: Dataset annotations on the above image
  • Table2: Statistics of crowdsourcing task, aggregating all three datasets
  • Table3: Image captioning results
  • Table4: Sample generations by different systems on V QGbing−5000, in order: Humanconsensus and Humanrandom, GRNNbing and GRNNall, KNN+minbleu−all, MSR captions. Q is the query-term
  • Table5: Results of evaluating various models according to different metrics. X represents training on the corresponding dataset in the row. Human score per model is computed by averaging human score across multiple images, where human score per image is the median rating across the three raters
  • Table6: Correlations of automatic metrics against human judgments, with p-values in parentheses
  • Table7: Examples of errors in generation. The rows are Humanconsensus, GRNNall, and KNN+minbleu−all
Download tables as Excel
Related work
  • For the task of image captioning, datasets have primarily focused on objects, e.g. Pascal VOC (Everingham et al, 2010) and Microsoft Common Objects in Context (MS COCO) (Lin et al, 2014). MS COCO, for example, includes complex everyday scenes with 91 basic objects in 328k images, each paired with 5 captions. Event detection is the focus in video processing and action detection, but these do not include a textual description of the event (Yao et al, 2011b; Andriluka et al, 2014; Chao et al, 2015; Xiong et al, 2015). The number of actions in each of these datasets is still relatively small, ranging from 40 (Yao et al, 2011a) to 600 (Chao et al, 2015) and all involve humanoriented activity (e.g. ‘cooking’, ‘gardening’, ‘riding a bike’). In our work, we are focused on generating questions for static images of events, such as ‘fire’, ‘explosion’ or ‘snowing’, which have not yet been investigated in any of the above datasets.

    Visual Question Answering is a relatively new task where the system provides an answer to a question about the image content. The most notable, Visual Question Answering (VQA) (Antol et al, 2015), is an open-ended (free-form) dataset, in which both the questions and the answers are crowd-sourced, with workers prompted to ask a visually verifiable question which will ‘stump a smart robot’. Gao et al (2015) used similar methodology to create a visual question answering dataset in Chinese. COCO-QA (CQA) (Ren et al, 2015), in contrast, does not use human-authored questions, but generates questions automatically from image captions of the MS COCO dataset by applying a set of transformation rules to generate the wh-question. The expected answers in CQA are by design limited to objects, numbers, colors, or locations. A more in- depth analysis of VQA and CQA datasets will be presented in Section 3.1.
Reference
  • Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2014. 2d human pose estimation: New benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June.
    Google ScholarLocate open access versionFindings
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In International Conference on Computer Vision (ICCV).
    Google ScholarLocate open access versionFindings
  • Lee Becker, Sumit Basu, and Lucy Vanderwende. 2012. Mind the gap: Learning to choose gaps for question generation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 742–751, Montreal, Canada, June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. 2015. HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision.
    Google ScholarLocate open access versionFindings
  • Jianfu Chen, Polina Kuznetsova, David Warren, and Yejin Choi. 201Dejaimage-captions: A corpus of expressive descriptions in repetition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 504–514, Denver, Colorado, May–June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
    Findings
  • Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation.
    Google ScholarLocate open access versionFindings
  • Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 100–105, Beijing, China, July. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2014. Long-term recurrent convolutional networks for visual recognition and description. CoRR, abs/1411.4389.
    Findings
  • Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. 20The pascal visual object classes (voc) challenge. Int. J. Comput. Vision, 88(2):303–338, June.
    Google ScholarLocate open access versionFindings
  • Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2014. From captions to visual concepts and back. CoRR, abs/1411.4952.
    Findings
  • Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision: Part IV, ECCV’10, pages 15–29, Berlin, Heidelberg. Springer-Verlag.
    Google ScholarLocate open access versionFindings
  • Francis Ferraro, Nasrin Mostafazadeh, Ting-Hao Huang, Lucy Vanderwende, Jacob Devlin, Michel Galley, and Margaret Mitchell. 2015. A survey of current datasets for vision and language research. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 207–213, Lisbon, Portugal, September. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 445–450, Beijing, China, July. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 20Are you talking to a machine? dataset and methods for multilingual image question answering. CoRR, abs/1505.05612.
    Findings
  • Michael Heilman and Noah A. Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609–617, Los Angeles, California, June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Int. Res., 47(1):853–899, May.
    Google ScholarLocate open access versionFindings
  • Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross B. Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, and Margaret Mitchell. 2016. Visual storytelling. In Proceedings of NAACL 2016. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Joseph Jordania. 2006. Who Asked the First Question? The Origins of Human Choral Singing, Intelligence, Language and Speech. Logos.
    Google ScholarFindings
  • Igor Labutov, Sumit Basu, and Lucy Vanderwende. 2015. Deep questions without deep understanding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
    Google ScholarLocate open access versionFindings
  • Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics, ACL ’04, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. 2014. Microsoft COCO: common objects in context. CoRR, abs/1405.0312.
    Findings
  • David Lindberg, Fred Popowich, John Nesbit, and Phil Winne. 2013. Generating natural language questions to support learning on-line. In Proceedings of the 14th European Workshop on Natural Language Generation, pages 105–114, Sofia, Bulgaria, August. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Mateusz Malinowski and Mario Fritz. 2014. A multiworld approach to question answering about realworld scenes based on uncertain input. In Advances in Neural Information Processing Systems 27, pages 1682–1690.
    Google ScholarLocate open access versionFindings
  • Karen Mazidi and Rodney D. Nielsen. 2014. Linguistic considerations in automatic question generation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 321–326, Baltimore, Maryland, June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3111– 3119.
    Google ScholarLocate open access versionFindings
  • George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November.
    Google ScholarLocate open access versionFindings
  • Ruslan Mitkov and Le An Ha. 2003. Computer-aided generation of multiple-choice tests. In Jill Burstein and Claudia Leacock, editors, Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications Using Natural Language Processing, pages 17–22.
    Google ScholarLocate open access versionFindings
  • Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • J. Pustejovsky, P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L. Ferro, and M. Lazo. 2003. The TIMEBANK corpus. In Proceedings of Corpus Linguistics 2003, pages 647–656, Lancaster, March.
    Google ScholarLocate open access versionFindings
  • Radim Rehurek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta, May. ELRA. http://is.muni.cz/publication/884893/en.
    Locate open access versionFindings
  • Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Question answering about images using visual semantic embeddings. In Deep Learning Workshop, ICML 2015.
    Google ScholarLocate open access versionFindings
  • Marcus Rohrbach, Sikandar Amin, Mykhaylo Andriluka, and Bernt Schiele. 2012. A database for fine grained activity detection of cooking activities. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, IEEE, June.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
    Findings
  • Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 196–205, Denver, Colorado, May–June. Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 3104– 3112.
    Google ScholarLocate open access versionFindings
  • Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, and Chris Sienkiewicz. 2016. Rich image captioning in the wild. In Proceedings of Deep Vision Workshop at CVPR 2016. IEEE, June.
    Google ScholarLocate open access versionFindings
  • Lucy Vanderwende, Arul Menezes, and Chris Quirk. 2015. An amr parser for english, french, german, spanish and japanese and a new amr-annotated corpus. Proceedings of NAACL 2015, June.
    Google ScholarLocate open access versionFindings
  • Lucy Vanderwende. 2008. The importance of being important: Question generation. In In Workshop on the Question Generation Shared Task and Evaluation Challenge, Arlington, VA.
    Google ScholarLocate open access versionFindings
  • Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), pages 1494–1504, Denver, Colorado, June.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition.
    Google ScholarLocate open access versionFindings
  • John H Wolfe. 1976. Automatic question generation from text-an aid to independent study. In ACM SIGCUE Outlook, volume 10, pages 104–112. ACM.
    Google ScholarLocate open access versionFindings
  • Yuanjun Xiong, Kai Zhu, Dahua Lin, and Xiaoou Tang. 2015. Recognize complex events from static images by fusing deep channels. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June.
    Google ScholarLocate open access versionFindings
  • Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas J. Guibas, and Li Fei-Fei. 2011a. Action recognition by learning bases of action attributes and parts. In International Conference on Computer Vision (ICCV), Barcelona, Spain, November.
    Google ScholarLocate open access versionFindings
  • Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas J. Guibas, and Li Fei-Fei. 2011b. Human action recognition by learning bases of action attributes and parts. In International Conference on Computer Vision (ICCV), Barcelona, Spain, November.
    Google ScholarLocate open access versionFindings
  • Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answering in images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, IEEE.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments