Cross-language Information Retrieval (CLIR) refers to the information retrieval techniques or methods with the query established with a certain language as the carrier to retrieve the Information represented by one or more other languages. As a branch in the field of information retrieval (IR), CLIR has its own complexity in the linguistic aspect, which not only requires to deal with the problems faced by IR, but also has to deal with problems of inconsistency in language category for queries and document sets. The CLIR system of 2020 AI Labs realizes the semantic mapping between the query source language and the target language through the multi-term vector model set up with parallel corpus. At the same time, it solves the ambiguity resolution of cross-language information retrieval and the translation of unknown terms, enabling the cross-language information retrieval system.
Term vector refers to the method whereby to represent a term with vector, which is the basic method for digitization of natural language. Multi-term vector refers to the term vector representation formed by semantic relation, which is established between multilingual terms for the purpose to solve the multi-language (cross-language) task. As for the semantic similarity of monolingual terms, it is possible to represent the similarity with vectors. As for the semantic similarity in the cross-language terms, it is also possible to represent the similarity with vectors. Multi-term vector library is of vital basic significance for settlement of different types of multilingual (cross-language) tasks, which can be used in multilingual text classification, multilingual text clustering, multilingual text similarity calculation, multilingual emotion analysis, cross-language information retrieval, and machine translation among many other fields.
We build our CLIR system on the basis of the multi-term vector model. The system has the following features:
1) Multi-term vector model based on large-scale parallel corpus. On the basis of the large-scale parallel corpora accumulated by the YeeCloud Machine Translation System, with Chinese or English as the core bridge-connection language, by means of monolingual corpora and sentence-aligned corpus as the training data, the forward feedback network from other languages into Chinese or English is set up in order to finally train and establish a multilingual-terms vector model, and by introducing the loss function and training set between different languages and the bridge-connection language. It is possible to quickly extend the multilingual-terms vector, and then, on the one hand, with the dimensionality reduction of word vector to document vector, it is possible to express the semantics of the document more accurately in less dimension space. On the other hand, through the conversion of semantic space with multilingual-terms vector, it is possible to convert the source documents into a target document vector with
2) Cross-language retrieval model based on multi-term vector model. With the source language query vector as the input, the cross-language retrieval model outputs and inquires the similarity of the documents in the target language with similar semantic vectors, and based on the above work, realize the cross-language information retrieval in combination with query analysis, query translation, query reconstruction, grading and rank ordering, and a series of cross-language retrieval related work.
3) Support for a variety of languages. At present, the languages supported by the cross-language search engine include more than 10 languages such as Chinese, English, French, German, Russian, Japanese, Korean, Arabic, Spanish, and Portuguese. It is possible to add one new language per month.