Indexing Lists Versus Dictionaries

A text, as we have seen, is treated in Python as a list of words. An important property of lists is that we can look up a particular item by giving its index, e.g., textl l00 . Notice how we specify a number and get back a word. We can think of a list as a simple kind of table, as shown in Figure 5-2. Figure 5-2. List lookup We access the contents of a Python list with the help of an integer index. Contrast this situation with frequency distributions Section 1.3 , where we specify a word and...

Reading Tagged Corpora

Several of the corpora included with NLTK have been tagged for their part-of-speech. Here's an example of what you might see if you opened a file from the Brown Corpus with a text editor The at Fulton np-tl County nn-tl Grand jj-tl Jury nn-tl said vbd Friday nr an at inves-tigation nn of in Atlanta's np recent jj primary nn election nn produced vbd no at evidence nn '' '' that cs any dti irregularities nns took vbd place nn . . Other corpora use a variety of formats for storing part-of-speech...

The ACalculus

In Section 1.3, we pointed out that mathematical set notation was a helpful method of specifying properties P of words that we wanted to select from a document. We illustrated this with 31 , which we glossed as the set of all w such that w is an element of V the vocabulary and w has property P. It turns out to be extremely useful to add something to first-order logic that will achieve the same effect. We do this with the A-operator pronounced lambda . The A counterpart to 31 is 32 . Since we...

NLTKs Regular Expression Tokenizer

The function nltk.regexp_tokenize is similar to re.findall as we've been using it for tokenization . However, nltk.regexp_tokenize is more efficient for this task, and avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special x verbose flag tells Python to strip out the embedded whitespace and comments. gt gt gt text 'That U.S.A. poster-print costs 12.40 ' gt gt gt pattern r''' x set...

Further Reading Amg

Please consult http www.nltk.org for further materials on this chapter and on how to install external machine learning packages, such as Weka, Mallet, TADM, and MegaM. For more examples of classification and machine learning with NLTK, please see the classification HOWTOs at http www.nltk.org howto. For a general introduction to machine learning, we recommend Alpaydin, 2004 . For a more mathematically intense introduction to the theory of machine learning, see Hastie, Tibshirani amp Friedman,...

Further Reading Rjk

Please consult http www.nltk.org for further materials on this chapter, including HOWTOs feature structures, feature grammars, Earley parsing, and grammar test suites. For an excellent introduction to the phenomenon of agreement, see Corbett, 2006 . The earliest use of features in theoretical linguistics was designed to capture phonological properties of phonemes. For example, a sound like b might be decomposed into the structure labial, voice . An important motivation was to capture...

B

b word boundary in regular expressions, 110 Bayes classifier see naive Bayes classifier bigram taggers, 204 bigrams, 20 generating random text with, 55 binary formats, text, 85 binary predicate, 372 binary search, 160 binding variables, 374 binning, 249 BIO Format, 286 book module NLTK , downloading, 3 Boolean operators, 368 in propositional logic, truth conditions for, 368 Boolean values, 336 bottom-up approach to dynamic programming, 167 bottom-up parsing, 304 bound, 374, 375 breakpoints,...

Brown Corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. Table 2-1 gives an example of each genre for a complete list, see Table 2-1. Example document for each section of the Brown Corpus Table 2-1. Example document for each section of the Brown Corpus Christian Science Monitor Editorials Underwood Probing the...

Writing Results to a File

We have seen how to read text from files Section 3.1 . It is often useful to write output to files as well. The following code opens a file output.txt for writing, and saves the program output to the file. gt gt gt output_file open 'output.txt', 'w' gt gt gt words gt gt gt for word in sorted words output_file.write word n Your Turn What is the effect of appending n to each string before we write it to the file If you're using a Windows machine, you may want to use word r n instead. What happens...

The Life Cycle of a Corpus

Corpora are not born fully formed, but involve careful preparation and input from many people over an extended period. Raw data needs to be collected, cleaned up, documented, and stored in a systematic structure. Various layers of annotation might be applied, some requiring specialized knowledge of the morphology or syntax of the language. Success at this stage depends on creating an efficient workflow involving appropriate tools and format converters. Quality control procedures can be put in...

Comparative Wordlists

Another example of a tabular lexicon is the comparative wordlist. NLTK includes so-called Swadesh wordlists, lists of about 200 common words in several languages. The languages are identified using an ISO 639 two-letter code. gt gt gt from nltk.corpus import swadesh gt gt gt swadesh.fileids 'be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk' gt gt gt swadesh.words 'en' 'I', 'you singular , thou', 'he',...

Functional Decomposition

Well-structured programs usually make extensive use of functions. When a block of program code grows longer than 10-20 lines, it is a great help to readability if the code is broken up into one or more functions, each one having a clear purpose. This is analogous to the way a good essay is divided into paragraphs, each expressing one main idea. Functions provide an important kind of abstraction. They allow us to group multiple actions into a single, complex action, and associate a name with it....

Variable Scope

Function definitions create a new local scope for variables. When you assign to a new variable inside the body of a function, the name is defined only within that function. The name is not visible outside the function, or in other functions. This behavior means you can choose variable names without being concerned about collisions with names used in your other function definitions. When you refer to an existing name from within the body of a function, the Python interpreter first tries to...

ContextFree Grammar A Simple Grammar

Parser Demo Tree

Let's start off by looking at a simple context-free grammar CFG . By convention, the lefthand side of the first production is the start-symbol of the grammar, typically S, and all well-formed trees must have this symbol as their root label. In NLTK, context-free grammars are defined in the nltk.grammar module. In Example 8-1 we define a grammar and show how to parse a simple sentence admitted by the grammar. Example 8-1. A simple context-free grammar. VP - gt V NP V NP PP PP - gt P NP V - gt...

Combining Taggers

One way to address the trade-off between accuracy and coverage is to use the more accurate algorithms when we can, but to fall back on algorithms with wider coverage when necessary. For example, we could combine the results of a bigram tagger, a unigram tagger, and a default tagger, as follows 1. Try tagging the token with the bigram tagger. 2. If the bigram tagger is unable to find a tag for the token, try the unigram tagger. 3. If the unigram tagger is also unable to find a tag, use a default...

Counting Words by Genre

In Section 2.1, we saw a conditional frequency distribution where the condition was the section of the Brown Corpus, and for each condition we counted words. Whereas FreqDist takes a simple list as input, ConditionalFreqDist takes a list of pairs. gt gt gt from nltk.corpus import brown gt gt gt cfd nltk.ConditionalFreqDist for genre in brown.categories for word in brown.words categories genre Let's break this down, and look at just two genres, news and romance. For each genre , we loop over...

Symbols

not equal to operator, 22, 376 quotation marks, double , in strings, 87 dollar sign in regular expressions, 98, 101 percent sign in string formatting, 119 s formatting string, 107, 119 s and d conversion specifiers, 118 amp ampersand , and operator, 368 ' ' quotation marks, single in strings, 88 ' ' quotation marks, single , in strings, 87 ' apostrophe in tokenization, 110 parentheses adding extra to break lines of code, 139 enclosing expressions in Python, 2 in function names, 9 in regular...

CrossValidation

In order to evaluate our models, we must reserve a portion of the annotated data for the test set. As we already mentioned, if the test set is too small, our evaluation may not be accurate. However, making the test set larger usually means making the training set smaller, which can have a significant impact on performance if a limited amount of annotated data is available. One solution to this problem is to perform multiple evaluations on different test sets, then to combine the scores from...

Gender Identification

In Section 2.4, we saw that male and female names have some distinctive characteristics. Names ending in a, e, and i are likely to be female, while names ending in k, o, r, s, and t are likely to be male. Let's build a classifier to model these differences more precisely. The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following...

Processing Raw Text

The most important source of texts is undoubtedly the Web. It's convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them. The goal of this chapter is to answer the following questions 1. How can we write programs to access text from local files and from the Web, in order to get hold of an unlimited range of language material 2. How can we split...

Tagging Unknown Words

Our approach to tagging unknown words still uses backoff to a regular expression tagger or a default tagger. These are unable to make use of context. Thus, if our tagger encountered the word blog, not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context the blog or to blog. How can we do better with these unknown words, or out-of-vocabulary items A useful method to tag unknown words based on context is to limit the vocabulary of a tagger...

Shoebox and Toolbox Lexicons

Perhaps the single most popular tool used by linguists for managing data is Toolbox, previously known as Shoebox since it replaces the field linguist's traditional shoebox full of file cards. Toolbox is freely downloadable from http www.sil.org computing toolbox . A Toolbox file consists of a collection of entries, where each entry is made up of one or more fields. Most fields are optional or repeatable, which means that this kind of lexical resource cannot be treated as a table or spreadsheet....

Chunking with Regular Expressions

a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked , and run the chunker on this input O. Example 7-2. Simple noun phrase chunker. grammar r NP lt DT PP gt lt JJ gt lt NN gt chunk determiner possessive, adjectives and nouns lt NNP gt chunk sequences of proper nouns sentence Rapunzel, NNP , let, VBD , down, RP , O her, PP , long, JJ , golden, JJ , hair, NN gt gt gt print cp.parse sentence O S NP her PP long golden JJ hair NN The symbol is a...

Textual Entailment

The challenge of language understanding has been brought into focus in recent years by a public shared task called Recognizing Textual Entailment RTE . The basic scenario is simple. Suppose you want to find evidence to support the hypothesis Sandra Goudie was defeated by Max Purnell, and that you have another short text that seems to be relevant, for example, Sandra Goudie was first elected to Parliament in the 2002 elections, narrowly winning the seat of Coromandel by defeating Labour...

Parsing with ContextFree Grammar

A parser processes input sentences according to the productions of a grammar, and builds one or more constituent structures that conform to the grammar. A grammar is a declarative specification of well-formedness it is actually just a string, not a program. A parser is a procedural interpretation of the grammar. It searches through the space of trees licensed by a grammar to find one that has the required sentence along its fringe. A parser permits a grammar to be evaluated against a collection...

Separating the Training and Testing Data

Now that we are training a tagger on some data, we must be careful not to test it on the same data, as we did in the previous example. A tagger that simply memorized its training data and made no attempt to construct a general model would get a perfect score, but would be useless for tagging new text. Instead, we should split the data, training on 90 and testing on the remaining 10 gt gt gt size int len brown_tagged_sents 0.9 gt gt gt size 4160 gt gt gt train_sents brown_tagged_sents size gt gt...

Extracting Encoded Text from Files

Let's assume that we have a small text file, and that we know how it is encoded. For example, polish-lat2.txt, as the name suggests, is a snippet of Polish text from the Polish Wikipedia see This file is encoded as Latin-2, also known as ISO-8859-2. The function nltk.data.find locates the file for us. gt gt gt path The Python codecs module provides functions to read encoded data into Unicode strings, and to write out Unicode strings in encoded form. The codecs.open function takes an encoding...

Nltk Feature Grammar Embedded Function

1. 0 What constraints are required to correctly parse word sequences like I am happy and she is happy but not you is happy or they am happy Implement two solutions for the present tense paradigm of the verb be in English, first taking Grammar 8 as your starting point, and then taking Grammar 20 as the starting point. 2. 0 Develop a variant of grammar in Example 9-1 that uses a feature COUNT to make the distinctions shown here 3. 0 Write a function subsumes that holds of two feature structures...

Recognizing Textual Entailment

Recognizing textual entailment RTE is the task of determining whether a given piece of text T entails another text called the hypothesis as already discussed in Section 1.5 . To date, there have been four RTE Challenges, where shared development and test data is made available to competing teams. Here are a couple of examples of text hypothesis pairs from the Challenge 3 development dataset. The label True indicates that the entailment holds, and False indicates that it fails to hold. T Parviz...

Developing and Evaluating Chunkers

Now you have a taste of what chunking does, but we haven't explained how to evaluate chunkers. As usual, this requires a suitably annotated corpus. We begin by looking at the mechanics of converting IOB format into an NLTK tree, then at how this is done on a larger scale using a chunked corpus. We will see how to score the accuracy of a chunker relative to a corpus, then look at some more data-driven ways to search for NP chunks. Our focus throughout will be on expanding the coverage of a...

Matplotlib

Python has some libraries that are useful for visualizing language data. The Matplotlib package supports sophisticated plotting functions with a MATLAB-style interface, and is available from http matplotlib.sourceforge.net . So far we have focused on textual presentation and the use of formatted print statements to get output lined up in columns. It is often very useful to display numerical data in graphical form, since this often makes it easier to detect patterns. For example, in Example 3-5,...

Exercises Ctq

1. 0 The IOB format categorizes tagged tokens as I, O, and B. Why are three tags necessary What problem would be caused if we used I and O tags exclusively 2. 0 Write a tag pattern to match noun phrases containing plural head nouns, e.g., many JJ researchers NNS, two CD weeks NNS, both DT new JJ positions NNS. Try to do this by generalizing the tag pattern that handled singular noun phrases. 3. 0 Pick one of the three chunk types in the CoNLL-2000 Chunking Corpus. Inspect the data and try to...

Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http www.gu tenberg.org . We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids , the file identifiers in this corpus gt gt gt import nltk gt gt gt nltk.corpus.gutenberg.fileids 'austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt',...

Truth in Model

We have looked at the syntax of first-order logic, and in Section 10.4 we will examine the task of translating English into first-order logic. Yet as we argued in Section 10.1, this gets us further forward only if we can give a meaning to sentences of first-order logic. In other words, we need to give a truth-conditional semantics to first-order logic. From the point of view of computational semantics, there are obvious limits to how far one can push this approach. Although we want to talk...

Getting Started with NLTK

Before going further you should install NLTK, downloadable for free from http www .nltk.org . Follow the instructions there to download the version required for your platform. Once you've installed NLTK, start up the Python interpreter as before, and install the data required for the book by typing the following two commands at the Python prompt, then selecting the book collection as shown in Figure 1-1. gt gt gt import nltk gt gt gt nltk.download gt gt gt import nltk gt gt gt nltk.download...

Curation Versus Evolution

As large corpora are published, researchers are increasingly likely to base their investigations on balanced, focused subsets that were derived from corpora produced for entirely different reasons. For instance, the Switchboard database, originally collected for speaker identification research, has since been used as the basis for published studies in speech recognition, word pronunciation, disfluency, syntax, intonation, and discourse structure. The motivations for recycling linguistic corpora...

Case and Gender in German

Compared with English, German has a relatively rich morphology for agreement. For example, the definite article in German varies with case, gender, and number, as shown in Table 9-2. Table 9-2. Morphological paradigm for the German definite article Table 9-2. Morphological paradigm for the German definite article Subjects in German take the nominative case, and most verbs govern their objects in the accusative case. However, there are exceptions, such as helfen, that govern the dative case 55...

OLAC Open Language Archives Community

The Open Language Archives Community, or OLAC, is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by i developing consensus on best current practices for the digital archiving of language resources, and ii developing a network of interoperating repositories and services for housing and accessing such resources. OLAC's home on the Web is at http www.language-archives.org . OLAC Metadata is a standard for describing...

Special Considerations When Working with Endangered Languages

The importance of language to science and the arts is matched in significance by the cultural treasure embodied in language. Each of the world's 7,000 human languages is rich in unique respects, in its oral histories and creation legends, down to its grammatical constructions and its very words and their nuances of meaning. Threatened remnant cultures have words to distinguish plant subspecies according to therapeutic uses that are unknown to science. Languages evolve over time as they come...

Annotated Text Corpora

Many text corpora contain linguistic annotations, representing part-of-speech tags, named entities, syntactic structures, semantic roles, and so forth. NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. Table 2-2 lists some of the corpora. For information about downloading them, see http www.nltk.org data. For more examples of how to access NLTK corpora, please...

Searching Text

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby Dick by entering text1 followed by a period, then the term concordance, and then placing monstrous in parentheses gt gt gt Building index Displaying 11 of 11 matches ong the former , one was of a most monstrous size . This came towards us , ON OF THE PSALMS . Touching that monstrous...

Exercises Luo

1. 0 Can you come up with grammatical sentences that probably have never been uttered before Take turns with a partner. What does this tell you about human language 2. 0 Recall Strunk and White's prohibition against using a sentence-initial however to mean although. Do a web search for however used at the start of the sentence. How widely used is this construction 3. 0 Consider the sentence Kim arrived or Dana left and everyone cheered. Write down the parenthesized forms to show the relative...

Choosing the Right Features

Selecting relevant features and deciding how to encode them for a learning method can have an enormous impact on the learning method's ability to extract a good model. Much of the interesting work in building a classifier is deciding what features might be relevant, and how we can represent them. Although it's often possible to get decent performance by using a fairly simple and obvious set of features, there are usually significant gains to be had by using carefully constructed features based...

Morphology in PartofSpeech Tagsets

Common tagsets often capture some morphosyntactic information, that is, information about the kind of morphological markings that words receive by virtue of their syntactic role. Consider, for example, the selection of distinct grammatical forms of the word go illustrated in the following sentences b. He sometimes goes to the cafe. Each of these forms go, goes, gone, and went is morphologically distinct from the others. Consider the form goes. This occurs in a restricted set of grammatical...

FirstOrder Logic

In the remainder of this chapter, we will represent the meaning of natural language expressions by translating them into first-order logic. Not all of natural language semantics can be expressed in first-order logic. But it is a good choice for computational semantics because it is expressive enough to represent many aspects of semantics, and on the other hand, there are excellent systems available off the shelf for carrying out automated inference in first-order logic. Our next step will be to...

Inaugural Address Corpus

in Section 1.1, we looked at the inaugural Address Corpus, but treated it as a single text. The graph in Figure 1-2 used word offset as one of the axes this is the numerical index of the word in the corpus, counting from the first word of the first address. However, the corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension gt gt gt from nltk.corpus import inaugural gt gt gt inaugural.fileids...

Ubiquitous Ambiguity

Shot Elephant Pajamas Syntax

A well-known example of ambiguity is shown in 2 , from the Groucho Marx movie, Animal Crackers 1930 2 While hunting in Africa, I shot an elephant in my pajamas. How an elephant got into my pajamas I'll never know. Let's take a closer look at the ambiguity in the phrase I shot an elephant in my pajamas. First we need to define a simple grammar gt gt gt groucho_grammar nltk.parse_cfg NP - gt Det N Det N PP 'I' N - gt 'elephant' 'pajamas' This grammar permits the sentence to be analyzed in two...

Pronoun Resolution

A deeper kind of language understanding is to work out who did what to whom, i.e., to detect the subjects and objects of verbs. You learned to do this in elementary school, but it's harder than you might think. In the sentence the thieves stole the paintings, it is easy to tell who performed the stealing action. Consider three possible following sentences in 4 , and try to determine what was sold, caught, and found one case is ambiguous . 4 a. The thieves stole the paintings. They were...

PartofSpeech Tagging

In Chapter 5, we built a regular expression tagger that chooses a part-of-speech tag for a word by looking at the internal makeup of the word. However, this regular expression tagger had to be handcrafted. Instead, we can train a classifier to work out which suffixes are most informative. Let's begin by finding the most common suffixes gt gt gt from nltk.corpus import brown gt gt gt suffix_fdist nltk.FreqDist gt gt gt for word in brown.words word word.lower suffix_fdist.inc word -1...

Unsimplified Tags

Let's find the most frequent nouns of each noun part-of-speech type. The program in Example 5-1 finds all tags starting with NN, and provides a few example words for each one. You will see that there are many variants of NN the most important contain for possessive nouns, S for plural nouns since plural nouns typically end in s , and P for proper nouns. In addition, most of the tags have suffix modifiers -NC for citations, -HL for words in headlines, and -TL for titles a feature of Brown tags ....