In its first pass, the module handles reading and chunking of the input XML files and coordinates the internal processing of the data. It makes use of the SAX module from the PyXML package to parse the input files easily. It creates a SaxUtils default handler for the XML document, calling it to parse the document specified by the calling module (segmentation.py). SAX then handles the task of traversing the input file, passing XML entity tokens to the handler. The handler itself acts only when it receives a turn event--turns contain the utterances of the speakers in the conversation, whereas all information outside those turns is metadata not required by the algorithm.
The data received by the handler is then stored in a complex set of data structures: At parse-time, the following structures are filled in parallel (from a single parse of the input file):
During the first pass, the handler repeatedly calls the private method chunkBody(), passing in the current string of data, without XML tags, if it is inside an utterance.
chunkBody() tokenises the strings it receives into individual words, strips them of punctuation and extraneous characters, and removes words in the stop_words list. Secondly, it copies the dialogue it receives directly into a string representing the original dialogue. Pseudosentence numbers are inserted in XML tags, allowing later visualisations of the dialogue to display them for correlation against graphs during evaluation and experimentation. Lastly, the system takes note of any <break /> markings within the text: These are indicators of where humans have located topic change, and are manually inserted. The locations of these breaks are recorded by the index number of the pseudosentence they appear in, and built into a list.
After parsing the file itself, the following additional data structures are created (dependent on the specific calling options):
The second pass of the module, working on the in-memory representation of the dialogue, builds a list of all unique words within the dialogue. This structure will be used by the cosinemeasure.py module to construct a list of the frequencies of all words within the current windows.
The system then begins to construct a list of objects of type pseudoSentence. This type is defined inside the micaseparse.py module, and exists to perform a number of important tasks related to each pseudosentence: Foremost, it contains the pseudosentence itself as a list of string tokens. Its main method tallies the frequencies of the words in the pseudosentence and returns a list of those frequencies. It also performs a variety of functions upon the pseudosentence, mostly related to reporting for statistical and debugging purposes.
Once the list of pseudosentences has been populated and their frequency lists calculated, a loop iterates across the list of pseudosentences, applying the TextTiling algorithm: Starting from position
in the list, the algorithm compares overlapping
groups of pseudosentences using the cosinemeasure.py module. For each comparison, a value between zero and one is returned. These `continuous' similarity values are collected in a list.
The system then creates a smoothed copy of the similarity list using a simple smoothing algorithm: Each value in the smoothed list is given as the mean of itself and its nearest neighbours. The number of neighbours compared can be specified in the segmentation.py module, but the default is to use three values:
,
, and
.
The trough detection module peakpick.py is then used upon the smoothed similarity values list: It returns a list of integers corresponding to the pseudosentences detected as being significant troughs; that is, candidates for topic breaks.
James Ballantine 2005-02-19