Segmentation where relatively few words are in common

Ponte and Croft [16], in 1997, approached textual topic segmentation with a divergent goal from previous work, especially from [6]: their aim was to research methods for detection of relatively small topic segments, which may not share many common words. This essentially discounts a TextTiling approach. Instead, they proposed a query expansion technique which identifies common features in the segments beside word frequency.

Of interest is that, unlike TextTiling, Ponte and Croft's work does not make use of pre-defined boundaries in the source data such as paragraph breaks.

The research focuses on news-bite feeds, where individual articles can be as short as one or two sentences, expressing a fact and moving on--this discounts the possibility of using a fixed-window approach such as used by TextTiling. There is usually no topic word repetition at all. However, as the following example illustrates, some semantic relations exist, such as in figure 2.1. Here, the tokens ``cyst'' and ``cancerous'', followed by ``growth'' and ``surgical procedure'' occur only once each, but are clearly semantically strongly related.

Figure 2.1: Three topic segments within a news feed
\begin{figure}\texttt{Police in Lebanon said that two Red Cross workers abducted...
...orp., said he would hand over power to his deputy, Goh Chok Tong.}\end{figure}

This problem is approached using Local Context Analysis (LCA): This essentially performs the work of a thesaurus in this context. When given a sentence, it returns a set of generalised ``concepts'' which allow words with similar conceptual meanings to be matched. For example, the second two-sentence topic shown in figure 2.1 (``The White House...'') has no words in common between its two sentences, but contains 11 counts of ``concept'' feature similarity. In contrast, there are zero similarities between either the first sentence and its preceding sentence in the news feed, or between the second sentence and the following sentence in the feed. Thus the LCA analysis seems to be a strong method for locating short topic segments.

The possible segments are then scored according to the sum of the pairwise similarities between adjacent sentences and the left- and right-external similarities. The segments are ranked according to the internal similarity minus the external similarities. This simply means, for example, that a pair which is self-similar (homogenous) but dissimilar to its neighbouring sentences will be scored highly as a candidate segment. This process is repeated for different sentence-count sizes of individual segments.

Ponte and Croft demonstrate an interesting case where the algorithm fails: Three consecutive news-bites, discussing a Conservative Party conference, the Yugoslavian Premier seeking financial assistance from the USA, and violence in Namibia respectively, were detected as two, with the division falling in the middle of the second article. This was a difficult case, as all of the articles have a similar theme (politics) and share concepts such as ``president'', ``party'', ``economy'', ``premier'', ''political'', ``administration'', and ``leaders''. Ponte and Croft theorise that, given that the LCA database uses data from 1990-1992 and the problematic news-feed was produced in 1989, a training database from a closer time period would have improved matters. For example, it would allow the use of time-specific data such as semantic relations between ``Markovic'', ``Yugoslavia'' and ``premier''.

James Ballantine 2005-02-19