Since the succeed of deep learning applied in natural language processing in 2014, the academia and industry kept working on using deep learning to enhance the translation quality of machine translation. As long as the Attention model was proposed, the neural machine translation finally made a great breakthrough. Positive experiment results proved that neural machine translation can achieve better performance than statistic machine translation in many languages. Followed the latest research, 2020 AI Lab started our own neural machine translation research. At first, 2020 AI Lab’s own neural machine translation system has been built in English and Chinese language pair, and recently, the system has been upgraded to support multilingual translation systems, including English, Chinese, German, French, Portuguese, Spanish, Japanese, Russian, and Arabic. Thus, the translation qualities are significantly improved.
This update of neural machine translation systems has been optimized in both data and technics.
The optimization of data is mainly focus on data sparsity and overfitting. The system applied recursive neural network technic to select appropriate data from our main corpora. Currently the selected data mainly covered 8 different domains: education, science & technology, medical, sports, politics, economy, social and spoken. It also applied the transform learning to adopt the features in different languages as well as overcome the lack of priori knowledge in several languages such as Japanese and Portuguese. As a conclusion, thanks to the recursive neural network and transform learning, data selection in machine translation has achieved a relatively good accuracy.
As to the translation technology, the system is still using the long short-term memory and attention model. It had optimized the encoding and decoding procedure. In the encoding part, the pre-trained words were embedded from a very large monolingual data to reduce the training time and get a better validation result. Regarding to the unknown words, the system used binary pair encoding and phrase table from statistic machine translation model to recognize the unknown words. All in all, with continuous effort, translation quality is now more competitive than many famous machine translation providers.
In the meantime, 2020 AI Lab will keep improving by cooperating with other research institutions, in order to further refining and building the best machine translation system.