关于日中神经网络机器翻译中的词汇问题的探讨
Analyzing the Problems of Vocabulary in Japanese-Chinese Neural Network Machine Translation
摘要: 近年以来,神经网络机器翻译作为新兴的翻译技术,取得了极大的进步。翻译的译文不仅更加准确也更为流畅。但神经网络翻译同时还有许多问题需要改进。本文旨在以日中神经网络机器翻译为实例,探讨词汇层面的问题和成因,并提出相应的模型改进方法。受限于模型的词表大小和语料资源的领域不匹配等原因,译文中存在未知词和词语的错翻漏翻等问题。因此,本文根据这些原因提出了使用subword,替换低频词,利用外部词典,采用领域自适应训练模型等多个改进方案。使用subword或者利用外部词典,可以克服词表过小的问题。替换低频词可以降低低频词对模型的负影响。领域自适应可以提高模型对特定领域文本的表现。实验结果表明本文提出的模型改进方案相较于一般的神经网络翻译模型,能很好地减少词汇翻译问题的出现次数,从而提高译文的翻译质量。
Abstract: In recent years, Neural Network Machine Translation (NMT) has made great progress as a new translation technology. Its translation results are not only more accurate but also more fluid. But at the same time, NMT also has many problems that need to be solved. The purpose of this article is to explore problems of vocabulary and their causes, and propose solutions for tuning model of Japanese-Chinese NMT. The limitation of the size of vocabulary and the domain mismatch of corpus could lead some problems such as unknown words and mistranslated words. Therefore, this article proposes several solutions like using subword, replacing low-frequency words, using external dictionaries, and using domain adaptation. Using subword or using external dictionary can overcome the problem caused by small size of vocabulary. Replacing low-frequency words can reduce the negative influence of low-frequency words. Domain adaptation can improve the performance on translating specific domain text. The experimental results showed that compared with the general NMT model, the approaches of tuning model proposed in this article can reduce the number of vocabulary translation problems and improve the translation quality.
文章引用:罗雯涛. 关于日中神经网络机器翻译中的词汇问题的探讨[J]. 计算机科学与应用, 2020, 10(3): 387-397. https://doi.org/10.12677/CSA.2020.103040

参考文献

[1] Noam, C. (1956) Three Models for the Description of Language. IRE Transactions on Information Theory, 2, 113-124. [Google Scholar] [CrossRef
[2] Nagao, M. (1984) A Framework of a Mechanical Translation between Japanese and English by Analogy Principle. In: Artificial and Human Intelligence, Elsevier Science Publishers, New York.
[3] Koehn, P., Och, F.J. and Marcu, D. (2003) Statistical Phrase-Based Translation. Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 127-133. [Google Scholar] [CrossRef
[4] Bahdanau, D., Cho, K. and Bengio, Y. (2015) Neural Machine Translation by Jointly Learning to Align and Translate. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), International Confer-ence on Learning Representations, San Diego, CA.
[5] Luong, T., Pham, H. and Manning, C.D. (2015) Effective Approaches to Attention-Based Neural Machine Translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 1412-1421. [Google Scholar] [CrossRef
[6] Sennrich, R., Haddow, B. and Birch, A. (2016) Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL, Berlin, Germany, 1715-1725. [Google Scholar] [CrossRef
[7] Koehn, P. and Knowles, R. (2017) Six Challenges for Neural Machine Translation. In: Proceedings of the First Workshop on Neural Machine Translation, Association for Computational Linguistics, Vancouver, 28-39. [Google Scholar] [CrossRef
[8] Kudo, T. and Richardson, J. (2018) SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. CoRR. [Google Scholar] [CrossRef
[9] Radim, Ř. and Sojka, P. (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of LREC 2010 Workshop New Challenges for NLP Frameworks, University of Malta, Valletta, Malta, 46-50.
[10] Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013) Efficient Estimation of Word Representations in Vector Space. ICLR.
[11] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013) Distributed Representations of Words and Phrases and Their Compositionality. NIPS.
[12] Chu, C.H., Dabre, R. and Kurohashi, S. (2017) An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Vancouver, Canada. [Google Scholar] [CrossRef
[13] Chu, C.H., Dabre, R. and Kurohashi, S. (2018) A Comprehensive Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation. Journal of Information Processing, 26, 1-10. [Google Scholar] [CrossRef
[14] Tiedemann, J. (2016) OPUS-Parallel Corpora for Everyone. Baltic Journal of Modern Computing (BJMC), Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 4.
[15] Nakazawa, T., Yaguchi, M., Uchimoto, K., Utiyama, M., Sumita, E., Kurohashi, S. and Isahara, H. (2016) ASPEC: Asian Scientific Paper Excerpt Corpus. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2016).
[16] Klein, G., Kim, Y., Deng, Y., Senellart, J. and Rush, A.M. (2017) OpenNMT: Open-Source Toolkit for Neural Machine Translation. CoRR. [Google Scholar] [CrossRef
[17] Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, July 2002, 311-318. [Google Scholar] [CrossRef