Kozima argues that words in fact only receive their meaning when placed in the context of the current topic; therefore, topic segmentation is fundamental to text understanding, and therefore is important for such tasks as resolution of anaphora. Kozima sees text segments as one of several constituents that make up the full structure of a text. In this respect, his idea of a text segment is not based on a human decision per se, as is assumed in [6] and [15], but is rather that a text segment is an ideal logical part of the structure of any discourse. He compares text segments to scenes in movies.
While Kozima's idealised description of text segments is one of fundamental logical structure, the LCP system is, like [6], based essentially on a statistical approach: it uses ``spreading activation on a semantic network'' combined with trough detection upon the similarity output.
Kozima's lexical cohesiveness metric is based on a semantic similarity between words, derived from [10], which functions as follows: A semantic network is systematically constructed from the LDOCE English dictionary. The similarity between two words is then given as a function of the activity of one of the words' nodes in the semantic network after activation of the other node. The function brings into consideration the significance of the activated word, based on a normalisation metric to prevent common words such as ``and'' from being too significant.
The lexical cohesion profile is the derived as follows: In a fixed-width window centered around the word currently under examination, take the mean of the current word's similarity to the other words within the window. This number is calculated for each word in the document, and a line-graph of the similarity metric is constructed, in much the same manner as [6].
This is demonstrated well in two of Kozima's examples in table 2.1.
Because the `important' words cat, pet, and lion are strongly semantically related, the overall lexical cohesion profile of the `sentence' (here a text segment comprising three sentences) is high due to strong activation of similar words in the semantic network. In contrast, the second segment has a low overall topic cohesion. It is not necessarily an area of topic change, but it cannot be said to be a topic at all, rather it is three unrelated, non sequitur sentences.
Indeed, Kozima's graphs of similarity throughout example documents do not show high plateaus with sudden drops, as is for example expected in ideal output from the TextTiling algorithm [6]. Rather, the majority of topic cohesion within the document is relatively low, with only a few short-duration peaks into an LCP value of above about 0.45. If the algorithm is effective, this suggests the possibility that a description of topic segmentation which must assign definite topic to all areas of a document may be attempting to label too much, and that some areas are essentially topic-free.
Kozima uses the lowest valleys of the LCP algorithm's output, smoothed using a Hanning window, to mark possible locations of topic change.
In his conclusion Kozima suggests an interesting visualisation of topics within a document:
Segment boundaries can be considered as segment switching (push and pop) in hierarchical structure of text. [10]
This suggests a model of topic flow similar to grammar-based parsing of programming languages, where topics can be nested recursively, and each topic must be closed and `returned' in order. This is perhaps most useful as a model for highly structured, well-laid-out documents, but does not seem appropriate for impromptu or ``live'' data.
James Ballantine 2005-02-19