Get To The Point: Summarization with Pointer-Generator Networks

ACL, 2017.

Cited by: 935|Bibtex|Views132|Links
EI
Keywords:
automatic summarizationpointer generatorabstractive stateabstractive text summarizationhigh rouge scoreMore(11+)
Weibo:
Extractive methods assemble summaries exclusively from passages taken directly from the source text, while abstractive methods may generate novel words and phrases not featured in the source text – as a human-written abstract usually does

Abstract:

Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat the...More

Code:

Data:

0
Introduction
  • Summarization is the task of condensing a piece of text to a shorter version that contains the main information from the original.
  • There are two broad approaches to summarization: extractive and abstractive.
  • The extractive approach is easier, because copying large chunks of text from the source document ensures baseline levels of grammaticality and accuracy.
  • Due to the difficulty of abstractive summarization, the great majority of past work has been extractive (Kupiec et al, 1995; Paice, 1990; Saggion and Poibeau, 2013).
Highlights
  • Summarization is the task of condensing a piece of text to a shorter version that contains the main information from the original
  • Extractive methods assemble summaries exclusively from passages taken directly from the source text, while abstractive methods may generate novel words and phrases not featured in the source text – as a human-written abstract usually does
  • We propose a novel variant of the coverage vector (Tu et al, 2016) from Neural Machine Translation, which we use to track and control coverage of the source document
  • Our reasoning is that (i) calculating an explicit pgen usefully enables us to raise or lower the probability of all generated words or all copy words at once, rather than individually, the two distributions serve such similar purposes that we find our simpler approach suffices, and we observe that the pointer mechanism often copies a word while attending to multiple occurrences of it in the source text
  • We evaluate our models with the standard ROUGE metric (Lin, 2004b), reporting the F1 scores for ROUGE1, ROUGE-2 and ROUGE-L
  • We find that both our baseline models perform poorly with respect to ROUGE and METEOR, and the larger vocabulary size (150k) does not seem to help
Methods
  • The authors' model has 256dimensional hidden states and 128-dimensional word embeddings.
  • For the pointer-generator models, the authors use a vocabulary of 50k words for both source and target – note that due to the pointer network’s ability to handle OOV words, the authors can use.
  • Note that the pointer and the coverage mechanism introduce very few additional parameters to the network: for the models with vocabulary size 50k, the baseline model has 21,499,600 parameters, the pointer-generator adds 1153 extra parameters, and coverage adds 512 extra parameters.
  • The authors use loss on the validation set to implement early stopping
Results
  • 6.1 Preliminaries

    The authors' results are given in Table 1.
  • The authors evaluate the models with the standard ROUGE metric (Lin, 2004b), reporting the F1 scores for ROUGE1, ROUGE-2 and ROUGE-L.
  • The authors evaluate with the METEOR metric (Denkowski and Lavie, 2014), both in exact match mode and full mode.5.
  • In addition to the own models, the authors report the lead-3 baseline, and compare to the only existing abstractive (Nallapati et al, 2016) and extractive (Nallapati et al, 2017) models on the full dataset.
  • The output of the models is available online.6
Conclusion
  • 7.1 Comparison with extractive systems
  • It is clear from Table 1 that extractive systems tend to achieve higher ROUGE scores than abstractive, and that the extractive lead-3 baseline is extremely strong.
  • Smugglers profit from desperate migrants is a valid alternative abstractive summary for the first example in Figure 5, but it scores 0 ROUGE with respect to the reference summary.
  • The authors' model exhibits many abstractive abilities, but attaining higher levels of abstraction remains an open research question
Summary
  • Introduction:

    Summarization is the task of condensing a piece of text to a shorter version that contains the main information from the original.
  • There are two broad approaches to summarization: extractive and abstractive.
  • The extractive approach is easier, because copying large chunks of text from the source document ensures baseline levels of grammaticality and accuracy.
  • Due to the difficulty of abstractive summarization, the great majority of past work has been extractive (Kupiec et al, 1995; Paice, 1990; Saggion and Poibeau, 2013).
  • Methods:

    The authors' model has 256dimensional hidden states and 128-dimensional word embeddings.
  • For the pointer-generator models, the authors use a vocabulary of 50k words for both source and target – note that due to the pointer network’s ability to handle OOV words, the authors can use.
  • Note that the pointer and the coverage mechanism introduce very few additional parameters to the network: for the models with vocabulary size 50k, the baseline model has 21,499,600 parameters, the pointer-generator adds 1153 extra parameters, and coverage adds 512 extra parameters.
  • The authors use loss on the validation set to implement early stopping
  • Results:

    6.1 Preliminaries

    The authors' results are given in Table 1.
  • The authors evaluate the models with the standard ROUGE metric (Lin, 2004b), reporting the F1 scores for ROUGE1, ROUGE-2 and ROUGE-L.
  • The authors evaluate with the METEOR metric (Denkowski and Lavie, 2014), both in exact match mode and full mode.5.
  • In addition to the own models, the authors report the lead-3 baseline, and compare to the only existing abstractive (Nallapati et al, 2016) and extractive (Nallapati et al, 2017) models on the full dataset.
  • The output of the models is available online.6
  • Conclusion:

    7.1 Comparison with extractive systems
  • It is clear from Table 1 that extractive systems tend to achieve higher ROUGE scores than abstractive, and that the extractive lead-3 baseline is extremely strong.
  • Smugglers profit from desperate migrants is a valid alternative abstractive summary for the first example in Figure 5, but it scores 0 ROUGE with respect to the reference summary.
  • The authors' model exhibits many abstractive abilities, but attaining higher levels of abstraction remains an open research question
Tables
  • Table1: ROUGE F1 and METEOR scores on the test set. Models and baselines in the top half are abstractive, while those in the bottom half are extractive. Those marked with * were trained and evaluated on the anonymized dataset, and so are not strictly comparable to our results on the original text. All our ROUGE scores have a 95% confidence interval of at most ±0.25 as reported by the official ROUGE script. The METEOR improvement from the 50k baseline to the pointer-generator model, and from the pointer-generator to the pointer-generator+coverage model, were both found to be statistically significant using an approximate randomization test with p < 0.01
Download tables as Excel
Related work
  • Neural abstractive summarization. Rush et al (2015) were the first to apply modern neural networks to abstractive text summarization, achieving state-of-the-art performance on DUC-2004 and Gigaword, two sentence-level summarization datasets. Their approach, which is centered on the attention mechanism, has been augmented with recurrent decoders (Chopra et al, 2016), Abstract Meaning Representations (Takase et al, 2016), hierarchical networks (Nallapati et al, 2016), variational autoencoders (Miao and Blunsom, 2016), and direct optimization of the performance metric (Ranzato et al, 2016), further improving performance on those datasets.

    However, large-scale datasets for summarization of longer text are rare. Nallapati et al (2016) adapted the DeepMind question-answering dataset (Hermann et al, 2015) for summarization, resulting in the CNN/Daily Mail dataset, and provided the first abstractive baselines. The same authors then published a neural extractive approach (Nallapati et al, 2017), which uses hierarchical RNNs to select sentences, and found that it significantly outperformed their abstractive result with respect to the ROUGE metric. To our knowledge, these are the only two published results on the full dataset.
Funding
  • Stanford University gratefully acknowledges the support of the DARPA DEFT Program AFRL contract no
Reference
  • Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and Hui Jiang. 2016. Distraction-based neural networks for modeling documents. In International Joint Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Jackie Chi Kit Cheung and Gerald Penn. 2014. Unsupervised sentence enhancement for automatic summarization. In Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Sumit Chopra, Michael Auli, and Alexander M Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In North American Chapter of the Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In EACL 2014 Workshop on Statistical Machine Translation.
    Google ScholarLocate open access versionFindings
  • John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:2121–2159.
    Google ScholarLocate open access versionFindings
  • Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. In Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004b. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: ACL workshop.
    Google ScholarFindings
  • Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. In NIPS 2016 Workshop on Multi-class and Multi-label Learning in Extremely Large Label Spaces.
    Google ScholarFindings
  • Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. 2016. Coverage embedding models for neural machine translation. In Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Yishu Miao and Phil Blunsom. 2016. Language as a latent variable: Discrete generative models for sentence compression. In Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents. In Association for the Advancement of Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Computational Natural Language Learning.
    Google ScholarLocate open access versionFindings
  • Chris D Paice. 1990. Constructing literature abstracts by computer: techniques and prospects. Information Processing & Management 26(1):171–186.
    Google ScholarLocate open access versionFindings
  • Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 20Sequence level training with recurrent neural networks. In International Conference on Learning Representations.
    Google ScholarLocate open access versionFindings
  • Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Hongyan Jing. 2000. Sentence reduction for automatic text summarization. In Applied natural language processing.
    Google ScholarLocate open access versionFindings
  • Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press.
    Google ScholarFindings
  • Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. A trainable document summarizer. In International ACM SIGIR conference on Research and development in information retrieval.
    Google ScholarLocate open access versionFindings
  • Chin-Yew Lin. 2004a. Looking for a few good metrics: Automatic summarization evaluation-how many samples are enough? In NACSIS/NII Test Collection for Information Retrieval (NTCIR) Workshop.
    Google ScholarFindings
  • Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Horacio Saggion and Thierry Poibeau. 2013. Automatic text summarization: Past, present and future. In Multi-source, Multilingual Information Extraction and Summarization, Springer, pages 3–21.
    Google ScholarFindings
  • Baskaran Sankaran, Haitao Mi, Yaser Al-Onaizan, and Abe Ittycheriah. 2016. Temporal attention model for neural machine translation. arXiv preprint arXiv:1608.02927.
    Findings
  • Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Jun Suzuki and Masaaki Nagata. 2016. RNN-based encoder-decoder approach with word frequency estimation. arXiv preprint arXiv:1701.00138.
    Findings
  • Sho Takase, Jun Suzuki, Naoaki Okazaki, Tsutomu Hirao, and Masaaki Nagata. 2016. Neural headline generation on abstract meaning representation. In Empirical Methods in Natural Language Processing.
    Google ScholarLocate open access versionFindings
  • Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In Association for Computational Linguistics.
    Google ScholarLocate open access versionFindings
  • Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Neural Information Processing Systems.
    Google ScholarLocate open access versionFindings
  • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015.
    Google ScholarFindings
  • Wenyuan Zeng, Wenjie Luo, Sanja Fidler, and Raquel Urtasun. 2016. Efficient summarization with read-again and copy mechanism. arXiv preprint arXiv:1611.03382.
    Findings
Your rating :
0

 

Tags
Comments