Information Retrieval

1. What is Information Retrieval?

Information retrieval (IR) is a process of retrieving information which you need from a large collection of stored information. For example, Internet search engines such as Google, which perform searches through Web pages, are one form of IR systems. The modern society inundated with information heightens the need for the development of high-performance IR systems and its algorithm.

So far, keyword based retrieval methods based on simple string matching have often been used to search text data. However, as the amount of the available information increases and the demand for better search result rises, it becomes inevitable to utilize not only superficial and lexical description, but also inherent meaning of the information. Thus how we deal with this ``inherent meaning'' using computers has become a major issue, on which we, Toyama Group, are also focusing now.

2. Cross-lingual Information Retrieval

Cross-lingual Information Retrieval

In this rapidly internationalized world, Information Retrieval should be able to deal with documents in multiple languages. The IR which targets at documents written in the languages which is different from the one expressing the query is called Cross-lingual Information Retrieval (CLIR). CLIR enables us to write a query (i.e. a searching question), in, say, Japanese, and retrieve the relevant documents described in English(See left). We can also obtain result in Japanese. In this case, the documents retrieved are translated automatically into Japanese (in the dashed circle).

A major difference between normal IR and CLIR, as shown as a dashed line in the figure, is that there is a language barrier. There are some different ways to achieve CLIR in terms of how to overcome this barrier, and many researches have been conducted for the better CLIR performance.

3. Multi-lingual thesaurus

Multi-lingual thesaurus

Resources such as dictionaries are usually useful in order to overcome the language barrier. Multi-lingual thesauri are also useful resources for CLIR use. A multi-lingual thesaurus is a set of words which are arranged according to their semantic relationship, as shown in the right figure, and it has these features:

  • Includes words in several different languages.
  • Has a tree-structure where the upper levels correspond to more abstract concepts.
  • The words with similar meanings are positioned nearby.

 

Because a multi-lingual thesaurus allows to obtain similar words regardless of their languages, it is often used in CLIR.

4. Automatic construction of multi-lingual thesaurus

Although multi-lingual thesauri are useful resources in CLIR, their construction cost is rather expensive because it is difficult to classify words in several different languages on the same criterion. To cope with this issue, we are doing research on the methods to construct multi-lingual thesauri automatically.

The conventional approaches have taken advantage of parallel corpora (large collections of sentences in two languages, each sentence of which corresponds to its translation counterpart) to construct multi-lingual thesauri. These approach, however, have a disadvantage that though they can extract relations between the words in different languages, they can not extract relations among the words in the same language directly. In the example mentioned above, we can obtain the relationship between Japanese and English words, though none among Japanese or English words.

To deal with this problem, we proposed a new method to construct automatically a Japanese-English thesaurus in which similarity between any two words can be calculated. This method prepares ``term pairs'', which consist of a Japanese and an English word each of which is a translation of the other. These pairs are used as common features to make vectors corresponding to entries in Japanese-Japanese and English-English dictionaries. Thus, words of both languages are located in a vector space, and similarity between any two words can be calculated, regardless of their languages.