Aquileo | Introduction to Stemming

Stemming is a text preprocessing technique in NLP that reduces words to their root or base form by removing prefixes and suffixes. It helps simplify and standardize text, making text analysis and processing more efficient.

Simplifies words into a common root form
Improves text processing and analysis efficiency
Commonly used in text classification and information retrieval
Helps reduce redundancy in text data
May sometimes reduce readability or produce inaccurate root words

Types of Stemmer in NLTK

Python's NLTK (Natural Language Toolkit) provides various stemming algorithms each suitable for different scenarios and languages. Lets see an overview of some of the most commonly used stemmers:

1. Porter's Stemmer

Porter's Stemmeris one of the most widely used stemming algorithms in NLP. It removes common suffixes from English words using a set of predefined rules to produce root forms.

Simple, fast and efficient stemming algorithm
Removes common suffixes from words
Mainly designed for the English language
Widely used in text preprocessing tasks
Stemmed words may not always be meaningful dictionary words

Example

'agreed' → 'agree'
Rule: If the word has a suffix EED (with at least one vowel and consonant) remove the suffix and change it to EE.

Advantages

Very fast and efficient.
Commonly used for tasks like information retrieval and text mining.

Limitations

Outputs may not always be real words.
Limited to English words.

Now lets implement Porter's Stemmer in Python, here we will be using NLTK library.

Python

from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

words = ["running", "jumps", "happily", "running", "happily"]

stemmed_words = [porter_stemmer.stem(word) for word in words]

print("Original words:", words)
print("Stemmed words:", stemmed_words)

Output:

2. Snowball Stemmer

Snowball Stemmer is an improved version of Porter’s Stemmer, also known as Porter2. It is faster, more aggressive, and supports multiple languages for stemming tasks.

Improved and faster version of Porter’s Stemmer
Removes suffixes more effectively
Supports multiple languages
Commonly used for multilingual text processing
Produces more consistent stemming results

Example

'running' → 'run'
'quickly' → 'quick'

Advantages

More efficient than Porter Stemmer.
Supports multiple languages.

Limitations

More aggressive which might lead to over-stemming.

Now lets implement Snowball Stemmer in Python, here we will be using NLTK library.

Python

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer(language='english')

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)

Output:

3. Lancaster Stemmer

Lancaster Stemmer is a fast and aggressive stemming algorithm that reduces words to very short root forms using iterative rule-based processing.

Faster and more aggressive than many other stemmers
Applies rules repeatedly to reduce words
Can produce very short or distorted stems
Useful for applications requiring fast text processing
May reduce accuracy due to over-stemming

Example

'running' → 'run'
'happily' → 'happy'

Advantages

Very fast.
Good for smaller datasets or quick preprocessing.

Limitations

Aggressive which can result in over-stemming.
Less efficient than Snowball in larger datasets.

Now lets implement Lancaster Stemmer in Python, here we will be using NLTK library.

Python

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)

Output:

4. Regexp Stemmer

Regexp Stemmer is a flexible stemming algorithm that uses regular expressions (regex) to remove word suffixes based on custom rules.

Allows custom stemming rules using regex
Useful for domain-specific text processing
Provides flexible and rule-based stemming
Requires manual rule creation
Can be slower for large datasets

Example

'running' → 'runn'
Custom rule: r'ing$' removes the suffix ing.

Advantages

Highly customizable using regular expressions.
Suitable for domain-specific tasks.

Limitations

Requires manual rule definition.
Can be computationally expensive for large datasets.

Now let's implement Regexp Stemmer in Python, here we will be using NLTK library.

Python

from nltk.stem import RegexpStemmer

custom_rule = r'ing$'
regexp_stemmer = RegexpStemmer(custom_rule)

word = 'running'
stemmed_word = regexp_stemmer.stem(word)

print(f'Original Word: {word}')
print(f'Stemmed Word: {stemmed_word}')

Output:

5. Krovetz Stemmer

Krovetz Stemmer is a linguistically aware stemming algorithm developed by Robert Krovetz. It focuses on preserving the actual meaning of words while converting them into their root forms.

Preserves linguistic meaning more accurately
Handles plural and tense conversions effectively
Produces more meaningful root words
Slower compared to many other stemmers
May be less efficient for very large datasets

Example

'children' → 'child'
'running' → 'run'

Advantages

More accurate, as it preserves linguistic meaning.
Works well with both singular/plural and past/present tense conversions.

Limitations

May be inefficient with large corpora.
Slower compared to other stemmers.

Note: The Krovetz Stemmer is not natively available in the NLTK library, unlike other stemmers such as Porter, Snowball or Lancaster.

Stemming vs. Lemmatization

Let's see the tabular difference between Stemming and Lemmatization for better understanding:

Stemming	Lemmatization
Reduces words to their root form, which may not be a valid word	Reduces words to their base form (lemma), producing a meaningful word
Uses simple rule-based methods	Considers meaning and context of the word
Faster and simpler process	More accurate but computationally heavier
Does not consider part of speech	Uses part of speech and context
No context is considered.	Considers the context and part of speech.
May generate non-dictionary words	Produces valid dictionary words
Example: Better → Bet	Example: Better → Good

Advantages

Normalizes different word forms into a common root
Reduces text dimensionality and improves efficiency
Enhances search and information retrieval performance
Simplifies processing of large text datasets
Improves machine learning and text analysis tasks

Applications

Improves search engine and information retrieval results
Reduces feature space in text classification tasks
Helps group similar documents in document clustering
Enhances sentiment analysis by handling word variations
Improves efficiency in processing large text datasets

Limitations

Over-stemming may reduce words too aggressively and change their meaning
Under-stemming may fail to group related words into the same root form
Ignores context and semantic meaning of words
Can affect accuracy in tasks like sentiment analysis
Different stemmers may produce different results for the same word

Tokenization

Introduction to Stemming

Types of Stemmer in NLTK

1. Porter's Stemmer

2. Snowball Stemmer

3. Lancaster Stemmer

4. Regexp Stemmer

5. Krovetz Stemmer

Stemming vs. Lemmatization

Advantages

Applications

Limitations

Related Articles:

Explore