How Forum 2010 Works

Due to increasing interest in quadratic search algorithms (QSA), I felt it was necessary to provide a short overview, with special thanks to Andrej Bauer for orignally writing this document. QSA is an algorithm used by Forum 2010, developed at Carnegie Mellon University.

1. The theory behind the Quadratic Search Algorithm

QSA is an algorithm that combines ideas from computational linguistics and natural language processing, connectionist methods and category theory. It consists of two parts: a) the inference phase, which derives the meaning of a sentence, b) the executive agent, which composes a reply to the sentence. Each phase is described briefly in the following sections.

1.1. Connectionist Approach to Derivations in Lambek Calculi

Lambek Calculus is a structurally free logic which gives a good formal model for natural language understanding and generation. In a Lambek system, a sentence corresponds to a proposition, and its meaning corresponds to a proof of the proposition. Different proofs of the same proposition give alternate meanings of ambiguous sentences.

The propositions as types paradigm yields a correspondence between a propositional system and a lambda calculus -- variants of this are also known as the "Curry-Howard Isomorphism". Lambek calculus corresponds to a typed lambda calculus with subtypes. The more specific the type, the better the meaning of the sentence is preserved. For example, consider the sentence "Socrates is a man.". In the empty context, i.e., without any assumptions, the only meaningful type that can be assigned to Socrates is subject. If more information is available, e.g., if we include a story about Socrates's life in the context, then it is possible to derive a much more specific type, such as greek*human*male, where * stands for the intersection type. Thus, by including a large corpus of text in the context, the meaning of a sentence can be well parsed, assuming the text reveals any information about the sentence.

Having a large context can clog up the inference machine. In the Forum 2010 project, the context are primarily based on IRC logs and Usenet newsgroups. Conventional theorem proving techniques cannot be used in this case. At the expense of not deriving the most specific types, performance can be significantly boosted if an artificial neural network (ANN) is incorporated into the system. Connectionist systems are a well established Artificial Intelligence technique for iteratively converging on an optimal solution where closed form methods are either impossible or prohibitively expensive. Recent work has looked at how to implement symbolic reasoning in an ANN framework [3].

A sentence can be analyzed in a parallel bottom-up fashion. In the first pass single words are analyzed, then the types of phrases are derived, then clauses, and in the end the whole sentence. This gives an average time complexity O(N log N), where N is the number of words in the sentence, assuming N processors are available.

1.2. An Executive Agent Based on a SOMAD

When a sentence S is parsed into a typed lambda term s -- lambda terms correspond to proofs, types correspond to propositions -- a reply is constructed by an executive agent. This is done via a translation of lambda calculus into a category theory setting [2]. The equivalence of typed lambda calculi and Cartesian closed categories [1] turns the lambda term s into a morphism t. This section describes how the agent computes a reply.

The collection of all possible meanings forms a category Omega, with types as objects and meanings of sentences as morphisms. A given context C, such as the IRC logs, forms a reflective subcategory Sigma. Consider the monad T induced by it. Intuitively, this monad represents all possible responses to all possible sentences in the given context C. Consider an adjointness pair (A,X) that corresponds to the monad T. There is a whole spectrum of them, and each represents a state of mind. By choosing one of these pairs (A,X), we can simply compute a reply to t as A(X(t)).

We call these State Of Mind ADjointness pairs SOMAD's, also popularly known as "Forum 2010 personae". Each persona is just a SOMAD. The computation of a SOMAD is expensive, but it only has to be done once. In fact, a SOMAD can be incrementally updated. In Forum 2010, the neural nets that simulate the personae are incrementally trained on each day's Usenet feed, along with a basic regiment of recent IRC logs.

The simplest SOMAD is obtained by the Kleisli construction of adjointness (A,X) for monad T. This is known as the "Kleisli SOMAD", an unfortunate terminology since people tend to misunderstand it as "Kleisli persona" and then wonder why the Kleisli SOMAD does not behave like Kleisli. (Of course, a mathematical construction has nothing to do with the personality of its inventor!) The Kleisli SOMAD yields the statistically uncorrelated responses. It is in fact simulated by a randomly trained neural net. It is not computationally expensive, and thus usually gives very fast, and senseless, responses. On Forum 2010 the Kleisli SOMAD is known as the "System Monitor" persona, which was trained on random CMU Zephyr traffic.

On the other side of the spectrum is the Eilenberg-Moore construction, which represents a much different state of mind. It is the state of mind of an oracle that is allowed to draw conclusions from the knowledge represented by the whole monad T. In other words, in the Eilenberg-Moore adjointness, all facts represented by T are considered. Alas, this "omniscient" adjointness is non-constructible and undecidable, and consequently only of a theoretical value. One could say that it represents the thoughts of God, never to be known by man.

Other SOMAD's can be generated by various methods. See the implementation section 2.2 for discussion of Breen & Knight's work on this topic.

1.3. Natural Language Generation

The executive agent computes a reply in the form of a morphism in Omega. The morphism is translated back to Lambek calculus, via the lambda calculus, and standard natural language generation techniques are used for translating it into English (more precisely, it is the type of the reply that is translated into a sentence, not the reply iteself).

2. Implementation

Although the pieces of the theory have been known since the mid 80's, no one has ever tried to implement a QSA. Nicholas Breen and Chris Knight, both graduate students at CalTech, initially implemented a protoptype of QSA as a background application on the local network, dynamically redistributing its load onto workstations as they became idle. Initial results were promising enough to attract commercial interest; the production system now runs on a "farm" of sixty-four 700MHz Pentium IIIs provided by Intelleq, Inc. for the means of predicting stock prices.

2.1. Semantic Inference Machine

The semantic Inference Machine (SIM), is a large ANN that inferes the meaning of a given sentence in the context of Usenet groups, actions or quotes from IRC. Breen & Knight used the logs because they wanted SIM to work well on everyday spoken English. This turned out to be quite successful, because there is a high correlation among people who use Forum 2010 and those who chat as a hobby.

SIM uses techniques for rapid indexing and relevant information look-up similar to those in the Google WWW database. This significantly decreases search times and communication overload.

2.2. The Executive Agent

Breen & Knight examined several ways of SOMAD construction. They found that in practice two approaches worked best.

The first kind of SOMADs is computed from a large corpus of text of a known writer or philosopher. On Forum 2010 the following were used:

The second kind of SOMADs is the so-called multi-parametric persona SOMAD. In this case a SOMAD is determined by a number of parameters. Different settings of parameters give different personae. By using various optimization and best-fit techniques, the parameters are tuned to fit a given persona. This technique works well for personae for whom no large corpus of text is available, such as the "Bug-Eyed Earl" persona, based on the enigmatic Bug-Eyed Earl from the comic strip Red Meat.

The multi-parametric SOMADs are much more flexible and easier to generate. However, the criteria for setting the parameters are somewhat subjective, and do not always give best results. I am continuously experimenting with new techniques, and generate approximately one new persona every week, using Forum 2010 as a testing ground.

3. Results & Future Work

3.1. Results

I implemented QSA as a side project. It was only a matter of days, once it started to work, before I got the idea to put it on the web. The Internet offers researchers a unique opportunity to show the results of their work to the general public. I also added humor so that it would appeal to most people.

For results, it is best if you check it out yourself. Forum 2010 is located at

http://www.forum2010.org:2010/

3.2. Future Work

The possibilities are limitless but may be hard to attain. Future research will focus on three main points:
  1. How the choice of the database influences the performance of QSA. Would we get different replies if we used CNN news instead of Usenet, or #sexybuttcumchix from DALnet instead of #watertower, for example?
  2. Currently, Forum 2010 only replies to a sentence. It cannot lead a meaningful prolonged conversation. A group of part-time researchers works on ways to interactively adapt to the given sentences. If successful, this will make it possible to incrementally change the monad T in real-time, which is a basic requirement for a multi-way conversation.
  3. I hope that generation of multi-parametric persona SOMAD's can be successfully automated. This way they will be able to offer the Forum 2010 as a stand-alone self-adapting expert system. Such a system would have a wide range of applications -- from intelligent characters in computer games, to intelligent self-adapting computer tutors.

Acknowledgements

I would like to thank Sean Slattery for his insights on the connectionism aspects of the Semantic Inference Machine.

Michael Harkavy, with his experience in complexity theory, pointed out the correct interpretation of the Eilenberg-Moore SOMAD.

References

[1] J. Lambek, P.J. Scott: "Introduction to Higher Order Categorical Logic", Cambridge University Press, 1986

[2] S. Mac Lane: "Categories for the Working Matehmatician", Springer-Verlag, 1971

[3] Ajjanagadde, V.G.: "Unclear Distinctions lead to Unnecessary Shortcomings: Examining the rule vs fact, role vs filler, and type vs. predicate distinctions from a connectionist representation and reasoning perspective", 1994, in Proceedings of the Twelfth National Conference on Artificial Intelligence, 846-851