A Cross-Language Information Retrieval
System based
on Word Sense Disambiguation and Document
Classification
Keywords: Information Retrieval, Machine
Translation, Associate Search
This paper describes a
Cross-Language Information Retrieval (CLIR) System that translates an English
query context-dependent into the target languages Dutch, French, German,
Italian, and Spanish, to retrieve a collection of appropriate documents in the
individual target languages – beside the documents in English. In order to translate
correctly a Word Sense Disambiguatioin (WSD) method is imposed on the system.
Due to the fact that the disambiguation is based on context information of the
query, it has to be at least of the length of a paragraph.
The system performs
the following consecutive steps:
-
WordNet's morphological analysis of the
word form to get lemmata (if the word is not found in the dictionary it remains
unchanged and is treated as a proper noun).
-
WSD for polysemious/homonomious words
(for all monosemious lemmata the most frequent translation is chosen)
-
Information Retrieval with any translated
document.
The system provides
the retrieved collections in the individual languages in different windows to
the user.
To achieve Word Sense
Disambiguation a machine readable multi-lingual lexicon is provided to the
system. For any entry the individual translations in the target languages and
an abstract definition of the word is specified. The WSD of a word x with n
readings works in the following manner. Any of the n definitions are compared
to the context of the word a in the query (by an associate search component).
The nearest match is assumed to be the most suitable translation.
In the paper we
present an evaluation of our approach in terms of precision and recall. Furthermore
we compare it to a machine-translation-based approach (SYSTRAN) to generate the
target-language queries.
Beside the
translations of words into the individual target languages, the system provides
a domain-independent CLIR approach that allows the cross-language retrieval for
instance in the internet.