A Cross-Language Information Retrieval System  based

on Word Sense Disambiguation and Document Classification

 

Keywords: Information Retrieval, Machine Translation, Associate Search

 

This paper describes a Cross-Language Information Retrieval (CLIR) System that translates an English query context-dependent into the target languages Dutch, French, German, Italian, and Spanish, to retrieve a collection of appropriate documents in the individual target languages – beside the documents in English. In order to translate correctly a Word Sense Disambiguatioin (WSD) method is imposed on the system. Due to the fact that the disambiguation is based on context information of the query, it has to be at least of the length of a paragraph.

 

The system performs the following consecutive steps:

-         WordNet's morphological analysis of the word form to get lemmata (if the word is not found in the dictionary it remains unchanged and is treated as a proper noun). 

-         WSD for polysemious/homonomious words (for all monosemious lemmata the most frequent translation is chosen)

-         Information Retrieval with any translated document.

The system provides the retrieved collections in the individual languages in different windows to the user.

 

To achieve Word Sense Disambiguation a machine readable multi-lingual lexicon is provided to the system. For any entry the individual translations in the target languages and an abstract definition of the word is specified. The WSD of a word x with n readings works in the following manner. Any of the n definitions are compared to the context of the word a in the query (by an associate search component). The nearest match is assumed to be the most suitable translation.

 

In the paper we present an evaluation of our approach in terms of precision and recall. Furthermore we compare it to a machine-translation-based approach (SYSTRAN) to generate the target-language queries.

 

Beside the translations of words into the individual target languages, the system provides a domain-independent CLIR approach that allows the cross-language retrieval for instance in the internet.