ApolloKnowledge
Intelligent Text Research with Apollo
Key Data about the Scope of the Proposal
The following list illustrates the intelligent and unique method of text search; the standard functions like searching for numerical information, for dates and times, for exact matches etc. are not covered although Apollo handles these as well.
1. Fault-tolerant Search
All words or phrases that are similar to a keyed-in word/phrase in a given text are identified. Not only all similar words are found, but also all word forms (singular, plural) and even misspelled words are identified. Optionally, these functions can be combined with a language-dependent grammar check which favors key words with the same word stem.
Note: all hits are always listed, but the order of the listing (the “ranking”) can be influenced by additional selection criteria.
The reason the stem word analysis check is optional is that a check does not make sense for proper names. Whether such an analysis is done depends on the search results. If a keyed-in word is identified as a proper name (proper noun, geographical location, city, …) a word stem check is not done.
2. Application of Syntax Information
When a phrase is entered (i.e. a number of distinct words), a syntactical check of the phrase is started automatically. This analysis can resolve ambiguities, e.g. is it an adjective, noun or verb by looking at the preceding word, and tag additional information onto the word or phrase. Words like “not” or “doesn’t” are then used to further refine the search and the resulting ranking.
Note: If a whole sentence is entered, this information is extracted already at the time of entry because the sentence is processed as one phrase; its syntactic and semantic properties are recorded with it.
3. Content-specific Search
The first step is to enter a key-word or – to be more specific – a phrase, which can be as much as a complete sentence. Now a pre-trained associative database is parsed for phrases of similar meaning. These phrases are then used in connection with syntax-dependent terms for the actual search. Inaccuracies in spelling are also taken into account. The relative weighting between syntax-dependent and associative search can be easily adjusted by setting the weighting parameter to a value between 0.00 (syntax-dependent search only) and 1.00 (predominantly associative search).
4. Self-organized Training of Texts
In principal, any text in any language can be trained. For reasons of efficiency, language dependencies should be taken into consideration:
a) completely unknown language
This means there is no commonsense knowledge available. The necessary commonsense knowledge is obtained during the training phase. In this case it is limited to the words and phrases in its various forms of spelling, basic sentence structures, and statistics about their frequency of occurrence. This information is already sufficient to for functions like the automatic recognition of “stop words” (e.g. and, a, an) the identification of rudimentary syntax information like adjective-noun correlations and other simple text meta
information. On this level, sentences can only be identified by formatting information like punctuation marks and paragraphs.
b) known language but incomplete commonsense knowledge
Here, new words or phrases are also inserted into the commonsense knowledge, but the existing knowledge of procedural and lexical information can be used to tag words with additional characteristics. For example: when a language uses articles (this is procedural information), then it stands to reason that a word preceded by an article and followed by a verb is a noun. Through the application of a morphological lexicon and the use of disassembling algorithms to strip pre- and suffixes, the word stem can be isolated and extracted. When larger amounts of data are taught in, the commonsense knowledge base is automatically broadened – provided the texts are meaningful! This in turn leads to an improved phrase search behavior by applying the newly acquired commonsense knowledge.
c) complete commonsense knowledge
In this case, in addition to word-based and syntactical information about the language’s grammar, there is also associative knowledge present. Now all the capabilities of the above-mentioned associative search can be applied to its best advantage: hitherto unknown words and phrases in new training texts can be identified, which can also be used to expand and improve the procedural knowledge.
5. Training of Associative Knowledge
To achieve this knowledge, the system is presented with (in general) grammatically correct and sensible texts dealing with a particular topic. In addition to common knowledge like it is contained in e.g. an encyclopedia, more specific information about a certain area of interest like jurisprudence, natural sciences, or astronomy can be taught in. This knowledge improves and broadens the associative search capabilities.
Note: Naturally, this also increases the topic-specific vocabulary which in turn leads to improved search capabilities.

