Repetition of Content


What is it about all these social network sites that makes me slightly impatient? Today I found the news of the leaked iPhones manual on at least 10 different sites. Google will index all of these sites and not be able to tell the difference that all of them talk about the same content: The leaked iPhones manual! This is what I call stupid software. It may be good for Google if they have more links on the Internet. But it is definitely not a great user experience to find the same content over and over again. Smart software should be able to recognize similar content of different documents across more then one language (and no, I do not mean translating the document into another language, I mean cross – language – content – recognition). Smart Software should be able to do so across more then one language. Now speaking about Cross-Language-Search, I would like to mention the following:Cross-language Search: What’s it all about?

The term “cross-language search” is used in many different senses:

1. Some search engine providers claim to support multilingual or cross-language search if they can handle and index documents written in different languages. They search for the exact appearance of the entered search terms, e.g. “war” finds English documents referring to military actions and it finds German documents containing “war” in the sense of “was” (i.e. a meaningless glue word).

2. Other search engines (see, e.g., http://www.google.com/intl/en/press/annc/translate_20070523.html) provide a tool for the translation of a query into a selectable other language, and then, the query is submitted with the translated query text. This is certainly a progress and can be useful in some specific situations, e.g. if one is looking for a hairdresser in Paris.

Shortcomings:
– If one is looking for “member of the board” and “SAir Group” (Swissair) and searches for German documents, the translated query “Mitglied des Brettes” und “SAir Gruppe” won’t provide any results. If “member of the board” is replaced by “Aufsichtsrat” some documents are found but they do not correspond to the commonly used terms “Verwaltungsrat” or “Verwaltungsräte” in conjunction with the Sair case.
– For information research and intelligence services the above-mentioned method does not help because it is not able to compare and rank documents written in different languages.

3. A true cross-language search is possible only if the search engine is able to recognize the thematic content, i.e., if the system realizes that the English translation of a French (or a German etc.) document is equivalent to the original document. This advanced technique is implemented in http://www.infocodex.com. It simultaneously finds documents in all supported languages, without the need for a cumbersome (and arbitrary) translation into each other language. Because of the cross-language content recognition and a well-founded similarity measure, the documents can be ordered by their relevance with respect to the query.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s