Know What You Don't Know: Unanswerable Questions for SQuAD

ACL, pp. 784-789, 2018.

Cited by: 357|Bibtex|Views85|Links
EI
Keywords:
Stanford Question Answering Dataseteagle protectionunanswerable questionquestion answeringreading comprehensionMore(16+)
Weibo:
Machine reading comprehension has become a central task in natural language understanding, fueled by the creation of many large-scale datasets

Abstract:

Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswera...More

Code:

Data:

Introduction
  • Machine reading comprehension has become a central task in natural language understanding, fueled by the creation of many large-scale datasets (Hermann et al, 2015; Hewlett et al, 2016; Rajpurkar et al, 2016; Nguyen et al, 2016; Trischler et al, 2017; Joshi et al, 2017).
  • Recent work has even produced systems that surpass human-level exact match accuracy on the Stanford Question Answering Dataset (SQuAD), one of the most widely-used reading comprehension benchmarks (Rajpurkar et al, 2016).
  • These systems are still far from true language understanding.
  • Models only need to select the span that seems most related to the question, instead of checking that the answer is entailed by the text
Highlights
  • Machine reading comprehension has become a central task in natural language understanding, fueled by the creation of many large-scale datasets (Hermann et al, 2015; Hewlett et al, 2016; Rajpurkar et al, 2016; Nguyen et al, 2016; Trischler et al, 2017; Joshi et al, 2017)
  • One root cause of these problems is Stanford Question Answering Dataset’s focus on questions for which a correct answer is guaranteed to exist in the context document
  • We showed questions from Stanford Question Answering Dataset for each paragraph; this further encouraged unanswerable questions to look similar to answerable ones
  • SQUADRUN forces models to understand whether a paragraph entails that a certain span is the answer to a question
  • The adversarial examples in SQUADRUN are difficult even for models trained on examples from the same distribution
Methods
  • The authors evaluated three existing model architectures: the BiDAF-No-Answer (BNA) model proposed by Levy et al (2017), and two versions of the DocumentQA No-Answer (DocQA) model from Clark and Gardner (2017), namely versions with and without ELMo (Peters et al, 2018)
  • These models all learn to predict the probability that a question is unanswerable, in addition to a distribution over answer choices.
  • The authors find this strategy does slightly better than taking the argmax prediction, possibly due to the different proportions of negative examples at training and test time
Results
  • When evaluating on the test set, the authors use the threshold that maximizes F1 score on the development set.
  • Following Rajpurkar et al (2016), the authors report average exact match and F1 scores.3
Conclusion
  • SQUADRUN forces models to understand whether a paragraph entails that a certain span is the answer to a question.
  • Relation extraction systems must understand when a possible relationship between two entities is not entailed by the text (Zhang et al, 2017).
  • Jia and Liang (2017) created adversarial examples that fool pre-trained SQuAD models at test time.
  • The adversarial examples in SQUADRUN are difficult even for models trained on examples from the same distribution
Summary
  • Introduction:

    Machine reading comprehension has become a central task in natural language understanding, fueled by the creation of many large-scale datasets (Hermann et al, 2015; Hewlett et al, 2016; Rajpurkar et al, 2016; Nguyen et al, 2016; Trischler et al, 2017; Joshi et al, 2017).
  • Recent work has even produced systems that surpass human-level exact match accuracy on the Stanford Question Answering Dataset (SQuAD), one of the most widely-used reading comprehension benchmarks (Rajpurkar et al, 2016).
  • These systems are still far from true language understanding.
  • Models only need to select the span that seems most related to the question, instead of checking that the answer is entailed by the text
  • Methods:

    The authors evaluated three existing model architectures: the BiDAF-No-Answer (BNA) model proposed by Levy et al (2017), and two versions of the DocumentQA No-Answer (DocQA) model from Clark and Gardner (2017), namely versions with and without ELMo (Peters et al, 2018)
  • These models all learn to predict the probability that a question is unanswerable, in addition to a distribution over answer choices.
  • The authors find this strategy does slightly better than taking the argmax prediction, possibly due to the different proportions of negative examples at training and test time
  • Results:

    When evaluating on the test set, the authors use the threshold that maximizes F1 score on the development set.
  • Following Rajpurkar et al (2016), the authors report average exact match and F1 scores.3
  • Conclusion:

    SQUADRUN forces models to understand whether a paragraph entails that a certain span is the answer to a question.
  • Relation extraction systems must understand when a possible relationship between two entities is not entailed by the text (Zhang et al, 2017).
  • Jia and Liang (2017) created adversarial examples that fool pre-trained SQuAD models at test time.
  • The adversarial examples in SQUADRUN are difficult even for models trained on examples from the same distribution
Tables
  • Table1: Types of negative examples in SQUADRUN exhibiting a wide range of phenomena
  • Table2: Dataset statistics of SQUADRUN, compared to the original SQuAD dataset
  • Table3: Exact Match (EM) and F1 scores on SQUADRUN and SQuAD. The gap between humans and the best tested model is much larger on SQUADRUN, suggesting there is a great deal of room for model improvement
  • Table4: Exact Match (EM) and F1 scores on the SQUADRUN development set, compared with SQuAD with two types of automatically generated negative examples. SQUADRUN is more challenging for current models
Download tables as Excel
Funding
  • This work was supported by funding from Facebook
  • R.J. is supported by an NSF Graduate Research Fellowship under Grant No DGE-114747
Reference
  • S. Bowman, G. Angeli, C. Potts, and C. D. Manning. 2015. A large annotated corpus for learning natural language inference. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • C. Clark and M. Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723.
    Findings
  • S. N. Gaikwad, D. Morina, R. Nistala, M. Agarwal, A. Cossette, R. Bhanu, S. Savage, V. Narwal, K. Rajpal, J. Regino, et al. 2015. Daemo: A self-governed crowdsourcing marketplace. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. pages 101–102.
    Google ScholarLocate open access versionFindings
  • K. M. Hermann, T. Koisk, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS).
    Google ScholarLocate open access versionFindings
  • D. Hewlett, A. Lacoste, L. Jones, I. Polosukhin, A. Fandrianto, J. Han, M. Kelcey, and D. Berthelot. 2016. Wikireading: A novel large-scale language understanding task over Wikipedia. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • M. Hu, Y. Peng, and X. Qiu. 2017. Reinforced mnemonic reader for machine comprehension. arXiv.
    Google ScholarFindings
  • H. Huang, C. Zhu, Y. Shen, and W. Chen. 2018. Fusionnet: Fusing via fully-aware attention with application to machine comprehension. In International Conference on Learning Representations (ICLR).
    Google ScholarLocate open access versionFindings
  • R. Jia and P. Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. 2017. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
    Findings
  • O. Levy, M. Seo, E. Choi, and L. Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Computational Natural Language Learning (CoNLL).
    Google ScholarLocate open access versionFindings
  • M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. bernardi, and R. Zamparelli. 2014. A SICK cure for the evaluation of compositional distributional semantic models. In Language Resources and Evaluation Conference (LREC).
    Google ScholarLocate open access versionFindings
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In Workshop on Cognitive Computing at NIPS.
    Google ScholarLocate open access versionFindings
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In North American Association for Computational Linguistics (NAACL).
    Google ScholarLocate open access versionFindings
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • M. Richardson, C. J. Burges, and E. Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP). pages 193–203.
    Google ScholarLocate open access versionFindings
  • M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. arXiv.
    Google ScholarFindings
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA: A machine comprehension dataset. In Workshop on Representation Learning for NLP.
    Google ScholarFindings
  • M. Wang, N. A. Smith, and T. Mitamura. 2007. What is the jeopardy model? a quasi-synchronous grammar for QA. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
  • W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Making neural QA as simple as possible but not simpler. In Computational Natural Language Learning (CoNLL).
    Google ScholarLocate open access versionFindings
  • Y. Yang, W. Yih, and C. Meek. 2015. WikiQA: A challenge dataset for open-domain question answering. In Empirical Methods in Natural Language Processing (EMNLP). pages 2013–2018.
    Google ScholarLocate open access versionFindings
  • W. Yih, M. Chang, C. Meek, and A. Pastusiak. 2013. Question answering using enhanced lexical semantic models. In Association for Computational Linguistics (ACL).
    Google ScholarLocate open access versionFindings
  • Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning. 2017. Position-aware attention and supervised data improve slot filling. In Empirical Methods in Natural Language Processing (EMNLP).
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments