Aquileo | Lemmatization with NLTK

Lemmatization is a text preprocessing technique in Natural Language Processing (NLP) that converts words into their base or dictionary form called a lemma. Unlike stemming, it considers the meaning and part of speech of words, making the output more accurate and meaningful.

Lemmatization Techniques

There are different techniques to perform lemmatization each with its own advantages and use cases

1. Rule Based Lemmatization

In rule-based lemmatization, predefined rules are applied to a word to remove suffixes and get the root form. This approach works well for regular words but may not handle irregularities well.

For example:

Rule: For regular verbs ending in "-ed," remove the "-ed" suffix.
Example: "walked" -> "walk"

While this method is simple and interpretable, it doesn't account for irregular word forms like "better" which should be lemmatized to "good".

2. Dictionary-Based Lemmatization

It uses a predefined dictionary or lexicon such as WordNet to look up the base form of a word. This method is more accurate than rule-based lemmatization because it accounts for exceptions and irregular words.

For example:

'running' -> 'run'
'better' -> 'good'
'went' -> 'go
"I was running to become a better athlete and then I went home," -> "I was run to become a good athlete and then I go home."

By using dictionaries like WordNet this method can handle a range of words effectively, especially in languages with well-established dictionaries.

3. Machine Learning-Based Lemmatization

It uses algorithms trained on large datasets to automatically identify the base form of words. This approach is highly flexible and can handle irregular words and linguistic nuances better than the rule-based and dictionary-based methods.

For example:

A trained model may deduce that “went” corresponds to “go” even though the suffix removal rule doesn’t apply. Similarly, for 'happier' the model deduces 'happy' as the lemma.

Machine learning-based lemmatizers are more adaptive and can generalize across different word forms which makes them ideal for complex tasks involving diverse vocabularies.

Implementation of Lemmatization in Python

Lets see step by step how Lemmatization works in Python:

Step 1: Installing NLTK and Downloading Necessary Resources

In Python, the NLTK library provides an easy and efficient way to implement lemmatization. First, we need to install the NLTK library and download the necessary datasets like WordNet and the punkt tokenizer.

Python

!pip install nltk

Now lets import the library and download the necessary datasets.

Python

import nltk
nltk.download('punkt_tab')      
nltk.download('wordnet')    
nltk.download('omw-1.4') 
nltk.download('averaged_perceptron_tagger_eng')

Step 2: Lemmatizing Text with NLTK

Now we can tokenize the text and apply lemmatization using NLTK's WordNetLemmatizer.

Python

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
text = "The cats were running faster than the dogs."
tokens = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]

print(f"Original Text: {text}")
print(f"Lemmatized Words: {lemmatized_words}")

Output:

In this output, we can see that:

"cats" is reduced to its lemma "cat" (noun).
"running" remains "running" (since no POS tag is provided, NLTK doesn't convert it to "run").

Step 3: Improving Lemmatization with Part of Speech (POS) Tagging

To improve the accuracy of lemmatization, it’s important to specify the correct Part of Speech (POS) for each word. By default, NLTK assumes that words are nouns when no POS tag is provided. However, it can be more accurate if we specify the correct POS tag for each word.

For example:

"running" (as a verb) should be lemmatized to "run".
"better" (as an adjective) should be lemmatized to "good".

Python

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "The children are running towards a better place."
tokens = word_tokenize(sentence)
tagged_tokens = pos_tag(tokens)

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return 'a'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('N'):
        return 'n'
    elif tag.startswith('R'):
        return 'r'
    else:
        return 'n'

lemmatized_sentence = []
for word, tag in tagged_tokens:
    if word.lower() == 'are' or word.lower() in ['is', 'am']:
        lemmatized_sentence.append(word)
    else:
        lemmatized_sentence.append(
            lemmatizer.lemmatize(word, get_wordnet_pos(tag)))
print("Original Sentence: ", sentence)
print("Lemmatized Sentence: ", ' '.join(lemmatized_sentence))

Output:

nltk2 — Improving Lemmatization with POS Tagging

In this improved version:

"children" is lemmatized to "child" (noun).
"running" is lemmatized to "run" (verb).
"better" is lemmatized to "good" (adjective).

Download code from here

Advantages

Reduces the number of unique words in datasets
Improves memory and computational efficiency
Enhances search and information retrieval accuracy
Makes text data more consistent for NLP models
Improves prediction accuracy and context understanding

Disadvantages

Slower than stemming because of dictionary and grammar analysis
Not ideal for real-time applications requiring fast processing
Can produce ambiguous results for words with multiple meanings
Requires more computational resources compared to simpler techniques

Python - Lemmatization Approaches with Examples
Python | Named Entity Recognition (NER) using spaCy
Python | PoS Tagging and Lemmatization using spaCy
Removing stop words with NLTK in Python

Lemmatization with NLTK

Lemmatization Techniques

1. Rule Based Lemmatization

2. Dictionary-Based Lemmatization

3. Machine Learning-Based Lemmatization

Implementation of Lemmatization in Python

Step 1: Installing NLTK and Downloading Necessary Resources

Step 2: Lemmatizing Text with NLTK

Step 3: Improving Lemmatization with Part of Speech (POS) Tagging

Advantages

Disadvantages

Related Articles

Explore