Plotting and Tabulating Distributions

Apart from combining two or more frequency distributions, and being easy to initialize, a ConditionalFreqDist provides some useful methods for tabulation and plotting.

The plot in Figure 2-1 was based on a conditional frequency distribution reproduced in the following code. The condition is either of the words america or citizen ©, and the counts being plotted are the number of times the word occurred in a particular speech. It exploits the fact that the filename for each speech—for example, 1865-Lincoln.txt—contains the year as the first four characters O. This code generates the pair (' america', ' 1865') for every instance of a word whose lowercased form starts with america—such as Americans—in the file 1865-Lincoln.txt.

>>> from nltk.corpus import inaugural

>>> cfd = nltk.ConditionalFreqDist(

... for fileid in inaugural.fileids()

... for w in inaugural.words(fileid)

The plot in Figure 2-2 was also based on a conditional frequency distribution, reproduced in the following code. This time, the condition is the name of the language, and the counts being plotted are derived from word lengths ©. It exploits the fact that the filename for each language is the language name followed by '-Latinl' (the character encoding).

>>> from nltk.corpus import udhr

>>> languages = ['Chickasaw', 'English', 'German_Deutsch',

... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

>>> cfd = nltk.ConditionalFreqDist(

... for lang in languages

In the plot() and tabulate() methods, we can optionally specify which conditions to display with a conditions= parameter. When we omit it, we get all the conditions. Similarly, we can limit the samples to display with a samples= parameter. This makes it possible to load a large quantity of data into a conditional frequency distribution, and then to explore it by plotting or tabulating selected conditions and samples. It also gives us full control over the order of conditions and samples in any displays. For example, we can tabulate the cumulative frequency data just for two languages, and for words less than 10 characters long, as shown next. We interpret the last cell on the top row to mean that 1,638 words of the English text have nine or fewer letters.

>>> cfd.tabulate(conditions=['English', 'German_Deutsch'], ... samples=range(10), cumulative=True)

0 1 2 3 4 5 6 7 8 9 English 0 185 525 883 997 1166 1283 1440 1558 1638 German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275

Your Turn: Working with the news and romance genres from the Brown Corpus, find out which days of the week are most newsworthy, ' ' I .v and which are most romantic. Define a variable called days containing a list of days of the week, i.e., ['Monday', ...]. Now tabulate the counts for these words using cfd.tabulate(samples=days). Now try the same thing using plot in place of tabulate. You may control the output order of days with the help of an extra parameter: condi tions=['Monday', ...].

You may have noticed that the multiline expressions we have been using with conditional frequency distributions look like list comprehensions, but without the brackets. In general, when we use a list comprehension as a parameter to a function, like set([w.lower for w in t]), we are permitted to omit the square brackets and just write set(w.lower() for w in t). (See the discussion of "generator expressions" in Section 4.2 for more about this.)

0 0

Post a comment

  • Receive news updates via email from this site