Lemmatization is a text normalization technique in natural language processing that reduces words to their dictionary forms, known as lemma. For example, “building has floors” reduces to “build have floor” upon lemmatization.
Lemmatization is often used for:
- Information retrieval for expanding search criteria
- Reducing dimensionality of problems in text classification, sentiment analysis, or topic modeling
Lemmatization is a common text preprocessing step performed before building models with words using machine learning algorithms. Lemmatization removes affixes of words by using vocabulary and morphological analysis. That means lemmatization is often dependent on the part of speech of the word and its context.
A related approach to lemmatization is stemming. It is based on simple heuristic rules and is easier to implement and faster than lemmatization. But stemming often results roots or word parts that are not actual words, whereas lemmatization is more accurate and returns valid dictionary words. For applications that require preserving meanings of the words, lemmatization is more useful than stemming.
The differences between lemmatization and stemming are shown below.
To learn more about using lemmatization and building predictive models with text data with MATLAB, see Text Analytics Toolbox™.