Hearst hypothesises that the higher frequency of certain words in a particular area of a document is an indicator of the location of a topic. When a word or group of words shows a change in local frequency at some location within a document, Hearst theorises that this may indicate that the current topic has changed. This is based on the assumption that slightly different vocabulary sets are required for discussing different topics. For example, a document discussing the possibility of life on other planets (the example document Hearst discusses in [6]) may mention the word `water' several times when discussing possible criteria for life, but few times or not at all in other sections, such as in areas discussing meteoric evidence of life.
However, there is no utility in comparing the frequency of non-topic words: Closed-class words and other `stop-words' are unlikely to indicate topic in any scenario. However, some words which may be good topic indicators in one situation are poor indicators in another due to pervasive use (for example, the word `Mozart' in an article on Mozart's music).
In order to locate these few topic-indicative words and prioritise them, a cosine similarity measure is used. A cosine measure compares two vectors of the same length, returning a single value between zero and one which represents the `similarity' of the two vectors. Visualising the vectors as
-dimensional vectors in
-space, the cosine measure gives the cosine of the angle between the two vectors. Thus, if the vectors have the same angle, the angle between them will be zero. This will result in a cosine measure of 1, since
. In contrast, vectors as dissimilar as possible will at maximum have an angular difference of
, and thus a cosine measure of 0.
This cosine similarity measure is used to compare groups of word frequency across the document. Each vector has a value for every word appearing in the document, with the value at that position containing the frequency of the word in the current section of the document. Thus, small segments of the document are compared based on their word frequency, and changes in frequency result in a low cosine measure score. This has the advantage of automatically removing words which are not good topic indicators: If a word is excessively frequent, it will have values of, for example, 14 and 23 in the two vectors under comparison. However, an uncommon word indicating a topic change may have values of 1 and 6, a proportionally much greater difference which will result in a much lower similarity score.
The equation giving the cosine measure similarity score to two same-length vectors is shown in equation 2.1. The two vectors in question are represented by
and
;
is the value at the current position in each vector (
standing for weight, in reference to the fact that the cosine similarity measure compares proportionally rather than by magnitude).
In order to compare different sections of a document using this cosine similarity measure, Hearst divides the document into pseudosentences, each containing the same number of words. The pseudosentences are grouped into logical units called blocks, comprising several pseudosentences; blocks are not fixed sets of pseudosentences, but rather a rolling window containing an exact number of pseudosentences.
To perform the similarity measure across the entire document, Hearst's system serially compares two overlapping blocks, moving across the document one pseudosentence at a time. The blocks overlap by one pseudosentence, guaranteeing at least some similarity. At each step, a cosine measure is performed to determine the similarity of the vectors representing the frequencies of the words in the two blocks. Where a word does not appear in either block (one from elsewhere in the document), both the frequency counts for that word will both be zero. This does not affect the cosine measure.
Hearst set the size of each pseudosentence to be 20 words (not including stop words, which are removed from the comparison), a number intended approximately to reflect the length of a typical sentence in the target documents. She then determined experimentally that the optimal block size is 6 pseudosentences. This, Hearst estimates, corresponds to the typical size of a paragraph in the target documents.
The list of similarity values for the entire document falls into a range between zero and one. When plotted as a line graph, the troughs in the graph represent the algorithm's detected areas of low topic cohesion--the locations most likely to contain topic breaks. Hearst's system implements a trough detection algorithm--not described in detail--which is calibrated to locate the same number of topic changes as a human locates in a document of the same number of paragraphs. Thus, it finds the
-best topic change locations according to the cosine measure.
Having located the topic breaks as indicated by the cosine measure, they are then moved to the nearest real paragraph break. Hearst assumes that topic changes can only occur at paragraph boundaries (that is, that authors will always start a new conscious topic at the start of a new paragraph). Thus a degree of inaccuracy or `fuzziness' to the results of the algorithm is not critical, as errors of less than half a paragraph-length will be corrected. At least, it will be corrected in terms of the results of comparison to human annotation, as it appears that Hearst told the annotators to mark topic changes at paragraph boundaries only.
Hearst's evaluation was based on human annotation of the target documents. With a base-line precision and recall of 0.43 and 0.42, the system achieved 0.66 and 0.61, compared to the average human judge's precision and recall of 0.81 and 0.71. The base-line values were measured from a random allocation of breaks on paragraph boundaries, at a rate consistent with the average human judgement of the number of breaks per paragraph (41%)--in this way similar to the rate-decision method for the TextTiling algorithm itself.
Hearst also commented on the lack of inter-annotator agreement in human tests of the documents: While some boundaries are strongly agreed upon by her seven annotators, others are disagreed upon, and no boundary markings were unanimous. Her evaluation, therefore, took an agreement of three or more annotators as a real topic break (she argues that overgeneration is more useful than undergeneration for the goals of the TextTiling algorithm). This is in contrast to [15], which required four annotators out of seven to agree on a topic break before it was accepted.
Hearst notes that attempts to provide the TextTiling algorithm with a thesaurus implementation to augment its word frequency calculations actually worsened the results of the system. While possibly an implementation problem, this is a positive reflection on the effectiveness of simple statistical implementations.
James Ballantine 2005-02-19