Calibration data

In order to verify the correct functioning of the topic detection algorithm as a whole, a test data set was created, representing the ideal topic segmentation task for the TextTiling algorithm: A 100% frequency of one word followed by a 100% frequency of another, and so on. A snippet of this data can be seen in figure 5.1. Because this data set involves a complete change in word frequency (zero cohesion between the `topics'), the algorithm should work optimally on it. The text was `evaluated' manually and the results, presented in graph form, can be seen in figure 5.2. The red lines represent the hand-marked topic changes (and thus, in this case, the actual boundary between one repeated word and the next) and the green lines represent the system's automatic location of the breaks. Of interest is the peak between marked breaks 2 and 3, representing the third located topic. It is lower than the other peaks in this example because it is not long enough (in terms of word count) to register fully according to the TextTiling algorithm; that is, the whole topic is somewhat smaller than the rolling window used in the algorithm.

Figure 5.1: Test word set for system evaluation
\begin{figure}\texttt{\ldots one one one one one one one one one one one one one...
...five five five five five five five five five five five five\ldots}\end{figure}

Figure 5.2: Full system graph output on the test word set
\begin{figure}\centering
\epsfig{file=graphs/testwords,width=1\textwidth}\end{figure}

James Ballantine 2005-02-19