Matching a Fixed Classification System with InfoCodex

1. Statement of the Problem
A company has a given fixed classification system. The documents should be categorized according to this classification system, overriding the self-organizing categorization done by the constraint Kohonen algorithm applied in InfoCodex.

Example of a given classification system:

The categories at level 1 correspond to main topics, whereas the sub-categories (level 2, 3 etc.) should be represented by individual neurons in the information map of InfoCodex.

The given classification system may also contain some descriptors that describe the corresponding category (column D “Category description” in the example shown above).

In addition to the category description, a set of documents should be available whose target categories are known in advance and that could be used for the training of the map (“Learn documents”).

After the map has been created and trained according to the given classification system and the “Learn documents”, InfoCodex should be able to classify new documents into the existing classification system, i.e. to automatically assign target categories for the new documents with high accuracy.

2. Solution with InfoCodex
When setting up a new collection, the given classification system can be assigned to the collection. In this case, InfoCodex constructs the information map exactly according to the given classification system, i.e.

● the categories on level 1 become main topics (“Container/tank”, “Water heaters” etc.)
● the sub-categories (level 2, 3 etc.) are each represented by a neuron.

The neuron labels (lower right corner of the map) display the given category code, the category name and the first few category descriptors.

For the “Learn documents” used for the training of the map, the target categories must be supplied on an Excel table which contains at least one column with the file name and a second column with the target category. This training information can be assigned in the field “Metadata instructions” of the form for setting up the collection.

3. Matching of New Documents into the Given Classification System
With the new function, the following objectives can be achieved for any new set of documents:

● The automatic classification of the new documents according to the given classification scheme
● The automatic generation of keywords and abstracts for the new documents

A multiple matching is also supported, i.e. the assignment of a document to more categories. The results of the matching process are presented in a list as shown below.

Zuordnung von Werbebannern zu spezifischen Texten

Ausgangslage
Es liegt eine grössere Menge von Werbebannern vor (z.B. 1000 bis 1 Million), und zu jedem Werbebanner gibt es einige Stichwörter oder einen Textblock (deutsch, englisch, französisch, italienisch oder spanisch).

Ziel
Zu irgendeinem beliebigen Dokument (in D; E, F, I oder ES) sollen diejenigen Werbebanner gefunden werden, die inhaltlich am besten zu diesem Dokument passen.

Mögliche Lösung mit InfoCodex
In einem ersten Schritt werden die gegebenen Werbebanner aufgrund der vorhandenen Stichwörter bzw. Kurztexte inhaltlich analysiert und automatisch in eine sachlogisch gegliederte Informationslandkarte (“virtuelles Bücherregal”) eingeordnet. Werbebanner mit ähnlichem Inhalt werden dabei im gleichen Fach abgelegt. Die Gliederung erfolgt durch InfoCodex ohne menschliches Zutun, kann aber im Bedarfsfall beeinflusst werden.

Die sporadisch eingehenden Dokumente (denen passende Werbebanner zugeordnet werden sollen) werden laufend inhaltlich analysiert und aufgrund eines fundierten Ähnlichkeitsmasses in der Informationslandkarte “platziert”. Als Resultat wird eine kurze Liste mit den am besten passenden Werbebannern zurückgegeben:
Werbebanner 37: 95% Relevanz
Werbebanner 2021: 92% Relevanz
Werbebanner 195: 87% Relevanz
etc.

Technische Angaben
Die Software-Komponenten von InfoCodex stehen als API-Module zur Verfügung und können auch in der Form von Web Services angeboten werden.

Die Software läuft unter Windows, Linux (Debian, Suse, Red Hat) oder Unix (Solaris, IBM AIX, HP Unix).

Weitere Unterlagen
Die vorgeschlagene Lösung entspricht im Prinzip der beiliegend beschriebenen Einordnung von neuen Dokumenten in ein vorgegebenes Klassifikationsschema (“Matching a Fixed Classification System with InfoCodex”).

„Unser System kann den Inhalt einer Internetseite erkennen“

Aus der Frankfurter Allgemeinen Zeitung, ein Interview mit David Crystal geführt von Holger Schmidt.

InfoCodex the Better Tool for Competitive Intelligence

Why cross-language semantic search technology is essential for Competitive Intelligence (CI)

In today’s highly competitive marketplace companies are forced to be aware of important moves and developments of their competitors without delay. The overload of information freely available in the internet (news feeds, patent registrations, press releases, etc.) in a variety of sources and languages can only be managed with highly sophisticated automated tools which are able to understand the meaning of documents and to consolidate the gathered information into a comprehensive daily update for the intelligence officer:

• Recognizing similar content from various sources:
InfoCodex automatically recognizes the similarity of content (even across different languages) and returns a consolidated overview of the new information. Example: the launch of the new Apple iPhone was covered in thousands of news sources all over the world, whereas the effective content (the fact that Apple launches a new cellular phone) was the same.

• Diffs: Sometimes details matter more and therefore, it’s important to have a monitoring tool which also allows you to recognize even the smallest changes in specific sources. Example: price changes or feature enhancements (in these cases the rest of the content of the corresponding files often remains the same).

• Automatic abstract generation: To scan new facts as efficiently as possible, InfoCodex automatically generates abstracts of the documents. By means of a user-specific filter and alert function, abstracts of particular interest can automatically be extracted or forwarded.

• Analyzing the search results of Google, Yahoo or any other search engine: InfoCodex creates a Heat Map based on content similarity of the documents, whereas Google or Yahoo will not recognize similar documents but just give you a bigger search result. InfoCodex will group similar documents based on their content into document families.

You can tell InfoCodex how deep it should digg “in links” and how many document’s content it should analyze.

See for yourself: http://demo.infocodex.com/ic-new.html

Security Gaps in Search Engines

Theories and allegations are one thing – but it is functionality in practice that counts.

Suppose your documents have been indexed by Google Search Appliance. Make any search and note, e.g., the seventh search result. Then, change the access right for this document such that your user account has no read access anymore to this specific document. Now submit the same search again and see what happens…

Enterprise Intelligence

I just watched this piece of information available at PBS.org. I came across the name of Jeff Jonas and his post about Enterprise Intelligence. I believe he should be interested in the InfoCodex technology.

Was macht InfoCodex mit den Indexierten Dokumenten?

1. Spracherkennung: deutsch/ englisch/ französisch/ italienisch/ spanisch

2. Übersetzung in eine einheitliche Sprache

3. Lexikalische und semantische Analyse: Erkennen von Inhalten

Diese Dinge kann die Konkurrenz nicht. Andere Suchmaschinen und Enterprise Search Softwares müssen zuerst trainiert werden bevor der Inhalt erkennt und analysiert werden kann.

Wieso erkennt Google eigentlich keine ähnlichen Dokumente?

Firmen wie Apple oder Microsoft generieren bei neuen Produkt-Releasen einigen Hype in der Online Welt. Es kann also durchaus vorkommen, dass eine neue Meldung auf hundert unterschiedlichen Seiten erscheinen kann, immer wieder leicht anders interpretiert. Dies geschieht allerdings ohne eine wirkliche Mehr-Information für den Benutzer.

Ein Beispiel: Vor der WWDC von Apple anfangs Juni machte das Gerücht die Runde, dass die Keynote-Speech von Steve Jobs aufgetaucht sei. Ich habe diverse RSS abonniert und konnte dieses Gerücht auf mindestens 10 unterschiedlichen Websites nachlesen. Dies generiert auch sehr sehr viele Links bei Google, alle mit der gleichen Information.

Mit der Ähnlichkeitssuche von IC kann der User die Dokumente mit ähnlichem Inhalt entsprechend gruppieren, erhält eine viel bessere Übersicht und weniger Links von Google. Die Verwirrung des Users ist bedeutend kleiner und der Zeitgewinn für die Informations-Recherche bedeutend grösser.

Dies ist die Heat-Mat einer Such-Kollektion, gruppiert nach thematischem Inhalt. Inhaltlich ähnliche Dokumente werden gruppiert und nahe beieinander dargestellt.