Latent Domain Translation Models in Mix-of-Domains Haystack.

COLING(2014)

引用 45|浏览73
暂无评分
摘要
This paper addresses the problem of selecting adequate training sentence pairs from a mix-ofdomains parallel corpus for a translation task represented by a small in-domain parallel corpus. We propose a novel latent domain translation model which includes domain priors, domaindependent translation models and language models. The goal of learning is to estimate the probability of a sentence pair in mix-domain corpus to be in- or out-domain using in-domain corpus statistics as prior. We derive an EM training algorithm and provide solutions for estimating out-domain models (given only in- and mix-domain data). We report on experiments in data selection (intrinsic) and machine translation (extrinsic) on a large parallel corpus consisting of a mix of a rather diverse set of domains. Our results show that our latent domain invitation approach outperforms the existing baselines significantly. We also provide analysis of the merits of our approach relative to existing approaches. Large parallel corpora are important for training statistical MT systems. Besides size, the relevance of a parallel training corpus to the translation task at hand can be decisive for system performance, cf. (Axelrod et al., 2011; Koehn and Haddow, 2012). In this paper we look at data selection where we have access to a large parallel data repositoryCmix, representing a rather varied mix of domains, and we are given a sample of in-domain parallel dataCin, exemplifying a target translation task. Simply concatenatingCin withCmix does not always deliver best performance, because including irrelevant sentences might be more harmful than beneficial, cf. (Axelrod et al., 2011). To make the best of available data, we must select sentences fromCmix for their relevance to translating sentences fromCin. Axelrod et al. (2011) and follow-up work, e.g., (Haddow and Koehn, 2012; Koehn and Haddow, 2012), select sentence pairs inCmix using the cross-entropy difference between in- and mix-domain language models, both source and target sides, a modification of the Moore and Lewis method (Moore and Lewis, 2010). In the translation context, however, often a source phrase has different senses/translations in different domains, which cannot be distinguished with monolingual language models. The dependence of translation choice on domain suggests that the word alignments themselves can better be conditioned on domain information. However, in the data selection setting, corpusCmix often does not contain useful domain markers, andCin contains only a small sample of in-domain sentence pairs. In this paper we present a latent domain translation model which weights every sentence pairhf,ei2 Cmix with a probabilityP (D| f,e) for being in-domain (D1) or out-domain (D0). Our model defines P (e,f) = P D2{D1,D0} P (D)P (e,f | D), using a latent domain variable D 2 {D0,D1}. Using bidirectional translation models, this leads to a domain priorP (D), domain-dependent translation models Pt(·|·,D) and language modelsPlm(·|D) as in Equation 1:
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要