Lemmatization is a text preprocessing technique in Natural Language Processing (NLP) that converts words into their base or dictionary form called a lemma. Unlike stemming, it considers the meaning and part of speech of words, making the output more accurate and meaningful.

Lemmatization Techniques
There are different techniques to perform lemmatization each with its own advantages and use cases
1. Rule Based Lemmatization
In rule-based lemmatization, predefined rules are applied to a word to remove suffixes and get the root form. This approach works well for regular words but may not handle irregularities well.
For example:
Rule: For regular verbs ending in "-ed," remove the "-ed" suffix.
Example: "walked" -> "walk"
While this method is simple and interpretable, it doesn't account for irregular word forms like "better" which should be lemmatized to "good".
2. Dictionary-Based Lemmatization
It uses a predefined dictionary or lexicon such as WordNet to look up the base form of a word. This method is more accurate than rule-based lemmatization because it accounts for exceptions and irregular words.
For example:
- 'running' -> 'run'
- 'better' -> 'good'
- 'went' -> 'go
"I was running to become a better athlete and then I went home," -> "I was run to become a good athlete and then I go home."
By using dictionaries like WordNet this method can handle a range of words effectively, especially in languages with well-established dictionaries.
3. Machine Learning-Based Lemmatization
It uses algorithms trained on large datasets to automatically identify the base form of words. This approach is highly flexible and can handle irregular words and linguistic nuances better than the rule-based and dictionary-based methods.
For example:
A trained model may deduce that “went” corresponds to “go” even though the suffix removal rule doesn’t apply. Similarly, for 'happier' the model deduces 'happy' as the lemma.
Machine learning-based lemmatizers are more adaptive and can generalize across different word forms which makes them ideal for complex tasks involving diverse vocabularies.
Implementation of Lemmatization in Python
Lets see step by step how Lemmatization works in Python:
Step 1: Installing NLTK and Downloading Necessary Resources
In Python, the NLTK library provides an easy and efficient way to implement lemmatization. First, we need to install the NLTK library and download the necessary datasets like WordNet and the punkt tokenizer.
!pip install nltk
Now lets import the library and download the necessary datasets.
import nltk
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')
Step 2: Lemmatizing Text with NLTK
Now we can tokenize the text and apply lemmatization using NLTK's WordNetLemmatizer.
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
text = "The cats were running faster than the dogs."
tokens = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
print(f"Original Text: {text}")
print(f"Lemmatized Words: {lemmatized_words}")
Output:

In this output, we can see that:
- "cats" is reduced to its lemma "cat" (noun).
- "running" remains "running" (since no POS tag is provided, NLTK doesn't convert it to "run").
Step 3: Improving Lemmatization with Part of Speech (POS) Tagging
To improve the accuracy of lemmatization, it’s important to specify the correct Part of Speech (POS) for each word. By default, NLTK assumes that words are nouns when no POS tag is provided. However, it can be more accurate if we specify the correct POS tag for each word.
For example:
- "running" (as a verb) should be lemmatized to "run".
- "better" (as an adjective) should be lemmatized to "good".
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "The children are running towards a better place."
tokens = word_tokenize(sentence)
tagged_tokens = pos_tag(tokens)
def get_wordnet_pos(tag):
if tag.startswith('J'):
return 'a'
elif tag.startswith('V'):
return 'v'
elif tag.startswith('N'):
return 'n'
elif tag.startswith('R'):
return 'r'
else:
return 'n'
lemmatized_sentence = []
for word, tag in tagged_tokens:
if word.lower() == 'are' or word.lower() in ['is', 'am']:
lemmatized_sentence.append(word)
else:
lemmatized_sentence.append(
lemmatizer.lemmatize(word, get_wordnet_pos(tag)))
print("Original Sentence: ", sentence)
print("Lemmatized Sentence: ", ' '.join(lemmatized_sentence))
Output:

In this improved version:
- "children" is lemmatized to "child" (noun).
- "running" is lemmatized to "run" (verb).
- "better" is lemmatized to "good" (adjective).
Download code from here
Advantages
- Reduces the number of unique words in datasets
- Improves memory and computational efficiency
- Enhances search and information retrieval accuracy
- Makes text data more consistent for NLP models
- Improves prediction accuracy and context understanding
Disadvantages
- Slower than stemming because of dictionary and grammar analysis
- Not ideal for real-time applications requiring fast processing
- Can produce ambiguous results for words with multiple meanings
- Requires more computational resources compared to simpler techniques