+91 88606 33966            edu_sales@siriam.in                   Job Opening : On-site Functional Trainer/Instructor | Supply Chain Management (SCM)
Stemming Techniques using NLTK Library

This article focuses on understanding and applying different stemming techniques used in Natural Language Processing (NLP) to reduce words to their root forms. Stemming is an essential preprocessing step that simplifies text data, helping in tasks like text analysis, classification, and search optimization.

Stemming is the process of reducing a word to its linguistic root. The purpose of this reduction is to find a base term, so that a search can be expanded to include all forms of the term. For example, you generally want a search for the word elections to match a document that contains the word election. As long as the two terms stem to the same form, both return in the search.

Method 1: Porter Stemmer Method

The Porter Stemmer is a rule-based stemming algorithm developed by Martin Porter in 1980. It is one of the most widely used stemming techniques in Natural Language Processing (NLP) due to its simplicity and effectiveness in reducing words to their base or root form. The algorithm applies a series of predefined rules to strip suffixes from words, such as -ing, -ly, -ed, and -ness, to produce a common stem.

Limitations

  • Over-stemming: Sometimes reduces words too aggressively, resulting in stems that may not make sense (e.g., “history” → “histori”).
  • Language-Specific: Designed primarily for English and may not perform well with other languages.
  • Context-Agnostic: Does not consider the meaning or context of words.
  • Here you list of words which include verbs, nouns, and adjectives, with different suffixes such as -ing, -s, -en, and -ly and the purpose is to see how the Porter Stemmer reduces each word to its base form.
  • Imported the PorterStemmer class from the NLTK library
  • Create an instance of the PorterStemmer class
  • The stem() method of the PorterStemmer object (stemming) is called with the current word as input and the method applies the Porter stemming algorithm and returns the stemmed form of the word.

Output

The word “history” is reduced to “histori”, which is a side effect of the algorithm’s aggressive suffix removal.

Method-2: Snowball Stemmer

The Snowball Stemmer is an improved version of the Porter Stemmer and is often considered more accurate. Unlike the Porter Stemmer, which is mostly limited to English, the Snowball Stemmer supports multiple languages, including English, French, German, Spanish, and many others. This makes it more versatile for multilingual text processing.

Although it is more accurate than the Porter Stemmer, it may still produce incorrect stems in certain edge cases.

  • Import the SnowballStemmer class from the nltk library.
  • Initialize the SnowballStemmer for the English language by specifying language=’english’. This tells the stemmer that it will be processing English words.
  • Defined a list of words (words_to_stem) that we want to stem.
  • Used a list comprehension to apply the stemmer to each word in the list words_to_stem. The stem() method of the SnowballStemmer object is called for each word, and it returns the root form of that word.

Output

Method 3: RegExp Stemmer

The RegexpStemmer in NLTK is a stemming tool that allows you to create a custom stemming process using regular expressions (regex). Unlike other stemmers like Porter or Snowball, which have predefined rules for removing suffixes, the RegexpStemmer enables you to define your own rules based on patterns in the word endings.

  • The regex pattern ‘ing$|e$|s$|able$’ matches the suffixes “ing”, “e”, “s”, and “able” at the end of words.
  • The min=4 parameter ensures that only words with 4 or more characters are stemmed, preventing over-stemming.

Output

With this you have learned some different stemming techniques

Similar Topics

Exploring Different Tokenization Techniques in NLP using NLTK Library

Stemming Techniques using NLTK Library

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top