Preprocess Text Data
Description
The Preprocess Text Data Live Editor task helps prepare text data for analysis.
You can use the task to control these processing steps:
HTML clean up
Tokenization
Adding token details
Word normalization
Changing and removing words
The Preprocess Text Data Live Editor task generates code that performs the selected preprocessing steps, which you can use to create a preprocessing function for your workflows.
Open the Task
To add the Preprocess Text Data task to a live script in the MATLAB® Editor:
On the Live Editor tab, select Task > Preprocess Text Data.
In a code block in the live script, type a relevant keyword, such as
preprocess
,clean
, ortext
. Select Preprocess Text Data from the suggested command completions.
Examples
Create Simple Preprocessing Function
This example shows how to create a function which cleans and preprocesses text data for analysis using the Preprocess Text Data Live Editor task.
First, load the factory reports data. The data contains textual descriptions of factory failure events.
tbl = readtable("factoryReports.csv")
Open the Preprocess Text Data Live Editor task. To open the task, begin typing the task name and select Preprocess Text Data from the suggested command completions. Alternatively, on the Live Editor tab, select Task > Preprocess Text Data.
Preprocess the text using these options:
Select
tbl
as the input data and select the table variableDescription
.Tokenize the text using automatic language detection.
To improve lemmatization, add part-of-speech tags to the token details.
Normalize the words using lemmatization.
Remove words with fewer than 3 characters or more than 14 characters.
Remove stop words.
Erase punctuation.
Display the preprocessed text in a word cloud.
The Preprocess Text Data Live Editor task generates code in your live script. The generated code reflects the options that you select and includes code to generate the display. To see the generated code, click Show code at the bottom of the task parameter area. The task expands to display the generated code.
By default, the generated code uses preprocessedText
as the name of the output variable returned to the MATLAB workspace. To specify a different output variable name, enter a new name in the summary line at the top of the task.
To reuse the same steps in your code, create a function that takes as input the text data and
outputs the preprocessed text data. You can include the function at the end of a script or
as a separate file. The preprocessTextData
function listed at the end of
the example, uses the code generated by the Preprocess Text
Data Live Editor task.
To use the function, specify the table as input to the preprocessTextData
function.
documents = preprocessTextData(tbl);
Preprocessing Function
The preprocessTextData
function uses the code generated by the
Preprocess Text Data Live Editor task. The function
takes as input the table tbl
and returns the preprocessed text
preprocessedText
. The function performs these steps:
Extract the text data from the
Description
variable of the input table.Tokenize the text using
tokenizedDocument
.Add part-of-speech details using
addPartOfSpeechDetails
.Lemmatize the words using
normalizeWords
.Remove words with 2 or fewer characters using
removeShortWords
.Remove words with 15 or more characters using
removeLongWords
.Remove stop words (such as "and", "of", and "the") using
removeStopWords
.Erase punctuation using
erasePunctuation
.
function preprocessedText = preprocessTextData(tbl) %% Preprocess Text preprocessedText = tbl.Description; % Tokenize preprocessedText = tokenizedDocument(preprocessedText); % Add token details preprocessedText = addPartOfSpeechDetails(preprocessedText); % Change and remove words preprocessedText = normalizeWords(preprocessedText,Style="lemma"); preprocessedText = removeShortWords(preprocessedText,2); preprocessedText = removeLongWords(preprocessedText,15); preprocessedText = removeStopWords(preprocessedText,IgnoreCase=false); preprocessedText = erasePunctuation(preprocessedText); end
For an example showing a more detailed workflow, see Preprocess Text Data in Live Editor. For next steps in text analytics, you can try creating a classification model or analyze the data using topic models. For examples, see Create Simple Text Model for Classification and Analyze Text Data Using Topic Models.
Parameters
Data
— Text to preprocess
workspace variable
Text to preprocess, specified as a MATLAB workspace variable. The variable must be a table, string array, or character vector to appear in the list.
If you select a table, then specify the table variable containing the text data in the second drop-down box that appears.
Extract HTML text
— Extract text data from HTML tags
off
(default) | on
Extract text data from HTML tags.
The generated code uses extractHTMLText
.
Remove HTML tags
— Remove HTML tags
off
(default) | on
Remove HTML tags.
The generated code uses eraseTags
.
Decode HTML entities
— Convert HTML and XML entities into characters
off
(default) | on
Convert HTML and XML entities into characters. For example convert
"&"
to "&"
.
The generated code uses decodeHTMLEntities
.
Language
— Text language
Automatic
(default) | English
| German
| Japanese
| Korean
Text language, specified as one of these options:
Automatic
Automatic language detection
English
English language
German
German language
Japanese
Japanese language
Korean
Korean language
The generated code uses tokenizedDocument
.
Split
— Text splitting mode
None
(default) | Sentences
| Paragraphs
Text splitting mode, specified as one of these options:
None
Do not split input.
Sentences
Split input into sentences. This option supports scalar input only.
The generated code uses
splitSentences
.Paragraphs
Split input into paragraphs. This option supports scalar input only.
The generated code uses
splitParagraphs
.
Add sentence numbers
— Option to add sentence numbers
off
(default) | on
Option to add sentence numbers to tokens.
The generated code uses addSentenceDetails
.
Add part-of-speech tags
— Option to add part-of-speech tags
on
(default) | off
Option to add part-of-speech tags to tokens.
The generated code uses addPartOfSpeechDetails
.
Detect named entities
— Option to detect named entities
off
(default) | on
Option to detect named entities in tokens.
The generated code uses addEntityDetails
.
Parse dependencies
— Option to parse dependencies
off
(default) | on
Option to parse dependencies in tokens. This option requires Text Analytics Toolbox™ Model for Udify data support package.
The generated code uses addDependencyDetails
.
Word normalization
— Word normalization
Lemma
(default) | Stem
| None
Word normalization, specified as one of these options:
None
Do not normalize words.
Lemma
Normalize words using lemmatization. This option outputs text in lowercase.
Stem
Normalize words using stemming.
The generated code uses normalizeWords
.
Case normalization
— Case normalization
None
(default) | Uppercase
| Lowercase
Minimum word length
— Minimum word length
3
(default) | positive integer | off
Minimum word length, specified as of these options:
off
— Do not remove short wordspositive integer — remove words with fewer than the specified number of characters
The generated code uses removeShortWords
.
Maximum word length
— Maximum word length
14
(default) | positive integer | off
Maximum word length, specified as of these options:
off
— Do not remove long wordspositive integer — remove words with more than the specified number of characters
The generated code uses removeLongWords
.
Remove stop words
— Option to remove stop words
on
(default) | off
Option to remove stop words.
The generated code uses removeStopWords
.
Remove Erase punctuation
— Option to erase punctuation
on
(default) | off
Option to erase punctuation.
The generated code uses erasePunctuation
.
Replace words
— Source and target words for replacement
pairs of source and target strings
Source and target words for replacement, specified as pairs of source and target strings. To specify multiword phrases (n-grams), use whitespace separated words.
The generated code uses replaceWords
and
replaceNgrams
.
Remove words
— Words to remove
string
Words to remove, specified as strings. To specify multiword phrases (n-grams), use whitespace separated words.
The generated code uses removeWords
and removeNgrams
.
Remove empty documents
— Option to remove empty documents
off
(default) | on
Option to remove empty documents.
The generated code uses removeEmptyDocuments
.
Ignore case
— Option to ignore case
off
(default) | on
Option to ignore case in word change and removal options.
Show tokenized text
— Option to show tokenized text
off
(default) | on
Option to show tokenized text.
Show token details
— Option to show token details
off
(default) | on
Option to show token details.
The generated code uses tokenDetails
.
Show word cloud
— Option to show word cloud
off
(default) | on
Option to show word cloud.
The generated code uses wordcloud
.
Tips
By default, the Preprocess Text Data task does not automatically run when you modify the task parameters. To have the task run automatically after any change, select the Autorun checkbox at the top-right of the task. If your data set is large, do not enable this option.
Version History
Introduced in R2023a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)