Reduce words to their root forms
Stemming refers to a text normalization technique in natural language processing that reduces words to their root forms. Stemming is done primarily by removing affixes of the words, which may result in an invalid dictionary word.
Stemming is commonly used for:
- Information retrieval, where stemmed words are used as synonyms to expand search criteria
- Engineering applications to reduce dimensionality, where stemming results in fewer words to be tracked and used in a model with machine learning algorithms
Porter’s Stemming Algorithm
The Porter stemmer algorithm is one of the most popular stemming approaches for the English language, and is based on simple heuristic rules. This stemming approach is fast but may not always be accurate. In subsequent years, many other algorithms were proposed, but Porter’s stemming algorithm remains popular due to its speed and simplicity.
Stemming vs. Lemmatization
A related, but more sophisticated approach, to stemming is lemmatization. Compared to stemming,
- Lemmatization uses vocabulary and morphological analysis and stemming uses simple heuristic rules
- Lemmatization returns dictionary forms of the words, whereas stemming may result in invalid words
The differences between lemmatization and stemming are shown below.
In MATLAB, stemming can be done using “normalizeWords” function with the default style option of ‘stem’. To learn more about stemming and building models with text data, see Text Analytics Toolbox™.