Sentence Segmentation
Manipulating texts at the level of individual words often presupposes the ability to divide a text into individual sentences. As we have seen, some corpora already provide access at the sentence level. In the following example, we compute the average number of words per sentence in the Brown Corpus:
>>> len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) 20.250994070456922
In other cases, the text is available only as a stream of characters. Before tokenizing the text into words, we need to segment it into sentences. NLTK facilitates this by including the Punkt sentence segmenter (Kiss & Strunk, 2006). Here is an example of its use in segmenting the text of a novel. (Note that if the segmenter's internal data has been updated by the time you read this, you will see different output.)
>>> sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') >>> text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt') >>> sents = sent_tokenizer.tokenize(text) >>> pprint.pprint(sents[171:181]) ['"Nonsense!',
'" said Gregory, who was very rational when anyone else\nattempted paradox.',
'"Why do all the clerks and navvies in the\nrailway trains look so sad and tired,...',
'It is because they know that the train is going right.',
'It\nis because they know that whatever place they have taken a ticket\nfor that ...', 'It is because after they have\npassed Sloane Square they know that the next stat...', 'Oh, their wild rapture!',
'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation w...' '"\n\n"It is you who are unpoetical," replied the poet Syme.']
Notice that this example is really a single sentence, reporting the speech of Mr. Lucian Gregory. However, the quoted speech contains several sentences, and these have been split into individual strings. This is reasonable behavior for most applications.
Sentence segmentation is difficult because a period is used to mark abbreviations, and some periods simultaneously mark an abbreviation and terminate a sentence, as often happens with acronyms like U.S.A.
For another approach to sentence segmentation, see Section 6.2.
Post a comment