NLP Pipelines with NLTK
Often with Natural Language Processing (NLP) applications a pipeline is useful to take the raw text and process it and extract relevant features before inputting it into a machine learning (ML) algorithm.
Normalization
From the standpoint of an ML algorithm, it may not make much sense to differentiate between different cases of a word - “Tree”, “tree” and “TREE” for example all have the same meaning. For this reason, we may want to convert all text to lower case.
Equally depending on your NLP task you may want to remove special characters like punctuation.
In Python we can achieve both of these objectives very simply:
import re
text = re.sub(r'[^a-zA-Z0-9]', ' ', text.lower())
Tokenization
The name “token” is just a fancy word for “symbol”. In the case of NLP, our symbols are usually words, so “tokenization” just means splitting our text into the individual words.
We could consider doing this in Python simply with “text.split()”, but the NLTK word_tokenize
function handles this task in a smarter way.
To use NLTK, first install nltk
import nltk
nltk.download()
Now import the word_tokenize
method:
from nltk import word_tokenize, sent_tokenize
Now tokenizing the text into words can be done easily:
# Split text into words using NLTK
word_tokenize(text)
or into sentences with
# Split text into sentences
sent_tokenize(text)
Tweet tokenizer
NLTK also comes with a tokenizer for Twitter tweets
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)
This is aware of Twitter handles, hashtags and emoticons. With the strip_handles=True
and reduce_len=True
, arguments it will remove handles and replace repeated character sequences of length 3 or greater with sequences of length 3.
Removing stop words
Stop words are uninformative words like “is”, “are”, “the” in English or “de”, “los”, “por” in Spanish. We may want to remove them to reduce the vocabulary we have to deal with and the complexity of NLP tasks.
NLTK includes stopword corpora for various languages.
from nltk.corpus import stopwords
The stop words in English can be examined with
print(stopwords.words("english"))
>>>['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "
They can be removed from the text with
words = [w for w in words if w not in set(stopwords.words('english'))]
Part of speech tagging
We may want to tag parts of a sentence depending on if they are a noun or verb etc. NLTK makes this pretty easy with pos_tag
.
from nltk import pos_tag
pos_tag(word_tokenize("I am eating cheese"))
Outputs:
>>> [('I', 'PRP'), ('am', 'VBP'), ('eating', 'VBG'), ('cheese', 'NN')]
To see what each of these symbols means we can use the handy help feature:
nltk.help.upenn_tagset('VBP')
which in this case tells us that
>>>VBP: verb, present tense, not 3rd person singular
predominate wrap resort sue twist spill cure lengthen brush terminate
appear tend stray glisten obtain comprise detest tease attract
emphasize mold postpone sever return wag ...
Named Entity Recognition
Named entities are typically noun phrases that refer to some specific object, person or place. NLTK has the ne_chunk
function to label named entities in text. Note that first the text must be tokenized and tagged.
from nltk import ne_chunk
print(ne_chunk(pos_tag(word_tokenize("Richard Feynman was a professor at Caltech"))))
Outputs:
>>>(S
(PERSON Richard/NNP)
(PERSON Feynman/NNP)
was/VBD
a/DT
professor/NN
at/IN
(ORGANIZATION Caltech/NNP))
Stemming and Lemmatization
Stemming
Stemming is the process of reduces a word to its stem or root form.
For example, “fishing”, “fished”, and “fisher” reduce to the stem “fish”.
Stemming just uses from rules (e.g. dropping suffixes like “ing” and “ed”) to remove the last few characters to replace the set of words with a common stem. The resulting word may not be meaningful in the given language. For example the Porter algorithm reduces, “argue”, “argued”, “argues”, “arguing”, and “argus” to the stem “argu”.
The purpose of stemming is to reduce complexity whilst retaining the essence of words.
Once again, NLTK makes this process easy for us:
from nltk.stem.porter import PorterStemmer
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
Lemmatization
Lemmatization is another technique to reduce words to a normalized form, but in this case the transformation actually uses a dictionary to map different variants of a word back to its root. For example “is”, “was”, “were” may all become “be”.
The default lemmatizer in NLTK uses the Wordnet database, which is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms.
Implementing lemmatization with NLTK looks like:
from nltk.stem.wordnet import WordNetLemmatizer
# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
The default is to assume the pos
(“Part Of Speech”) is to assume the words are nouns. For example, the verb boring
would not go to bore
, so we may want to run it again with pos=v
meaning verbs to ensure verbs also get lemmatized appropriately:
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
Bag of words
The bag of words representation treats each document as an unordered collection or “bag” of words.
To obtain a bag of words from a piece of raw text, first apply text processing steps: cleaning, normalizing, tokenizing, stemming, lemmatization etc. Next turn each document into a vector of numbers representing how many times each word appears in a document.
First collect all unique words in the corpus to form the vocabulary, arrange the words in some order and let them form the vector components before counting the number of each word in each document. For example:
littl | hous | prairi | mari | lamb | silenc | twinkl | star | |
---|---|---|---|---|---|---|---|---|
“Little House on the Prairie” | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
“Mary had a Little Lamb” | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
“The Silence of the Lambs” | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
“Twinkle Twinkle Little Star” | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
We could now compare how similar 2 document vectors are using cosine similarity
\[\cos{(\theta)} = \frac{\mathbf{a}\cdot\mathbf{b}}{||\mathbf{a}||||\mathbf{b}||}\]CountVectorizer
We can use the CountVectorizer
from scikit-learn to implement bag of words:
from sklearn.feature_extraction.text import CountVectorizer
def tokenize(text):
# normalize case and remove punctuation
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
# tokenize text
tokens = word_tokenize(text)
# lemmatize andremove stop words
tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
return tokens
# initialize count vectorizer object
vect = CountVectorizer(tokenizer=tokenize)
Then we can get the counts of each word with
# get counts of each token (word) in text data
X = vect.fit_transform(corpus)
and view the counts of each word as
# view token vocabulary and counts
vect.vocabulary_
TF-IDF
One limitation of the bag-of-words approach is that it treats every word as being equally important, whereas we know that some words occur frequently within a corpus, for example in financial documents the word “price” may occur all over the place, thus many appearances of that word in a given document would not be so useful to characterize the document.
We can compensate by counting the number of documents in which that term occurs (called the “document frequency”) and dividing the term frequency by the document frequency. The weights the words in a given document by how unique that are to that particular document and therefore better characterize it.
For example:
littl | hous | prairi | mari | lamb | silenc | twinkl | star | |
---|---|---|---|---|---|---|---|---|
“Little House on the Prairie” | 1/3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
“Mary had a Little Lamb” | 1/3 | 0 | 0 | 1 | 1/2 | 0 | 0 | 0 |
“The Silence of the Lambs” | 0 | 0 | 0 | 0 | 1/2 | 1 | 0 | 0 |
“Twinkle Twinkle Little Star” | 1/3 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
The TF-IDF transform is simply the product of 2 weights:
\[\text{tfidf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D)\]where $tf$ is the term frequency: how many times a term $t$ appears in the given document $d$ divided by the total number of terms in $d$. The term $idf$ is the inverse document frequency: a term inversely proportional to the number of documents in the corpus $D$ in which the term $t$ appears divided by the total number of documents in the corpus. Commonly the logarithm of this term is taken (to try to smooth the resulting values and also prevent zero-division errors) and we have
\[\log{(|D|/|\{d \in D: t \in d \}|)}\]This can be done with the scikit-learn TfidfVectorizer
:
from sklearn.feature_extraction.text import TfidfVectorizer
# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)
# convert sparse matrix to numpy array to view
X.toarray()
Word Embeddings
We could just represent words with one-hot encoding, but as the size of the vocabulary grows the dimension of the embedding space would also grow extremely large.
What we’d like is to represent words in some fixed-sized vector space, and also it would be nice if when 2 words are close in meaning they are close in the vector space (cosine similarity).
Word2Vec
Word2Vec is one of the most popular embeddings used in practice.
Word2Vec has uses a neural-network model that is able to predict a word given neighbouring words, or vice versa predict neighbouring words given the word.
the quick brown fox <jumps> over the lazy dog
The continuous bag of words (CBoW) model takes neighbouring words and tries to predict the missing word. Whereas the Skip-gram model takes the middle word and tries to predict the neighbouring words.
This prediction task is an ancillary task, and what we really want is the representation of the word that can be gleaned from the hidden layer of the network. This will be what forms the representation of the word.
Pipeline
We can put all this together into and sklean Pipeline with the following code
import nltk
nltk.download()
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
def load_data():
"""
Load your data
.
.
.
"""
return X, y
def tokenize(text):
"""
Clean and tokenize
"""
tokens = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
clean_tokens = []
for tok in tokens:
clean_tok = lemmatizer.lemmatize(tok).lower().strip()
clean_tokens.append(clean_tok)
return clean_tokens
def main():
X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', RandomForestClassifier())
])
# train classifier
pipeline.fit(X_train, y_train)
# predict on test data
y_pred = pipeline.predict(X_test)
# display results
accuracy = (y_pred == y_test).mean()
print("Accuracy:", accuracy)
main()
Comments