Riassunto del manuale NLTK (Updated)

Posted on 7 ottobre 2008


Language and computation.

>> Explicit inferences vs. hardcoded assumption

Commercial dialogue systems use contextual assumptions and simple business logic to ensure that the different ways in which a user might express requests or provide information are handled in a way that makes sense for the particular application. Thus, whether the user says “When is …”, or “I want to know when …”, or “Can you tell me when …”, simple rules will always yield screening times.

Dialogue system can perform simple inferences but such sophistication is only found in cutting edge research prototypes.

>> The three bases for NLP

The formal language theory defined a set of stings accepted by a class of automata, such as context-free languages and pushdown automata, and provided the underpinnings for computational syntax.

The symbolic logic provided a formal method for capturing selected aspects of natural language that are relevant for expressing logical proofs. Examples are propositional logic and First Order Logic.

The principle of compositionality states that the meaning of a complex expression is composed from the meaning of its parts and their mode of combination. This principle provided a useful correspondence between syntax and semantics, namely that the meaning of a complex expression could be computed recursively.

Although this grammar-based NLP is still a significant area of research, it has become somewhat eclipsed in the last 15-20 years due to automatic speech recognition: systems which involved learning patterns from large bodies of speech data were significantly more accurate, efficient and robust. In addition, the speech community found that progress in building better systems was hugely assisted by the construction of shared resources for quantitatively measuring performance against common test data.
Eventually, much of the NLP community embraced a data intensive orientation to language processing, coupled with a growing use of machine-learning techniques and evaluation-led methodology.

>> Generative Grammar and Modularity

One of the descendants of formal language theory was the linguistic framework known as generative grammar. Such a grammar contains a set of rules that recursively specify (generate) the set of well-formed strings in a language. While there is a wide spectrum of models that owe some allegiance to this core, Chomsky’s transformational grammar, in its various incarnations, is probably the best known.

In this tradition, it is claimed that humans have distinct kinds of linguistic knowledge, organized into different modules: a phonological module might
provide a set of phonemes together with an operation for concatenating phonemes into phonological strings. Similarly, a syntactic module might provide labeled nodes as primitives together with a mechanism for assembling them into trees.
A set of linguistic primitives, together with some operators for defining complex elements, is often called a level of representation.

>> Words, the most fundamental level for NLP

Tokenization is a prelude to pretty much everything else we might want to do in NLP, since it tells our processing software what our basic units are.

w+        # sequences of ’word’ characters
$?d+(.d+)?    # currency amounts, e.g. $12.50
([A-Z].)+    # abbreviations, e.g. U.S.A.
[^ws]+    # sequences of punctuation

Distinct words that have the same written form are called homographs. We can distinguish homographs with the help of context; often the previous word suffices.

Lemmatization is a rather sophisticated process of mapping words to their lemmas – it uses rules for the regular word patterns, and table look-up for the irregular patterns. Lemmatization is a special cases of normalization. It identifies a canonical
representative for a set of related word forms. Normalization collapses distinctions.


AGGIORNAMENTO (20 giugno 2011).

Un utente chiede sul social network Quora “What do data scientists think about Python’s NLTK library?“, e la risposta al momento più votata è la seguente: la cito qui per tutti coloro che dovessero voler sapere se vale la pena adottare NLTK.

Nearly everything algorithmic in nltk is done better somewhere else. The parsers are years behind, clustering works only on toy datasets, and a lot of what actually is interesting in the library are just popen calls to other packages.

Where nltk shines is for spackle and duct-tape code. The downloadable corpora (and the wordnet stuff) are useful. The sentence tokenizer works okay. The


module for manipulating parse trees is surprisingly versatile, although nltk’s corresponding dependency graph class sucks.

But trying to use nltk in production has its annoyances. More recent versions of nltk don’t install correctly via


. And then there’s this:

ubuntu@foo:~$ time python -c "import nltk"
real	0m1.188s
user	0m0.410s
sys	0m0.090s

Suddenly all of your unit tests grind to a halt, and you’re in for quite a surprise the first time you serve a web page that innocently imports something that imports something else that imports nltk.