Text Data Preparation

Import text data into MATLAB^® and preprocess it for analysis

Text Analytics Toolbox™ includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. Use these tools to extract text from popular file formats, preprocess raw text, extract individual words or multiword phrases (n-grams), convert text into numerical representations, and build statistical models. For an example showing how to get started, see Prepare Text Data for Analysis.

Text Analytics Toolbox supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions work with text from other languages. For more information, see Language Considerations.

Live Editor Tasks

Preprocess Text Data

Preprocess and clean up text data for analysis (Since R2023a)

Functions

expand all

Import and Export

`extractFileText`	Read text from PDF, Microsoft Word, HTML, and plain text files
`extractHTMLText`	Extract text from HTML
`readPDFFormData`	Read data from PDF forms
`pdfinfo`	PDF file information (Since R2023a)
`writeTextDocument`	Write documents to text file

HTML Parsing

`htmlTree`	Parsed HTML tree
`findElement`	Find elements in HTML tree
`getAttribute`	Read HTML attribute of root node of HTML tree
`ismissing`	Find HTML trees without values
`string`	Convert parsed HTML tree to string

Document Preprocessing

`tokenizedDocument`	Array of tokenized documents for text analysis
`erasePunctuation`	Erase punctuation from text and documents
`eraseTags`	Erase HTML and XML tags from text
`eraseURLs`	Erase HTTP and HTTPS URLs from text
`removeStopWords`	Remove stop words from documents
`removeShortWords`	Remove short words from documents or bag-of-words model
`removeLongWords`	Remove long words from documents or bag-of-words model
`removeWords`	Remove selected words from documents or bag-of-words model
`normalizeWords`	Stem or lemmatize words
`replaceWords`	Replace words in documents
`replaceNgrams`	Replace n-grams in documents
`splitSentences`	Split text into sentences
`splitParagraphs`	Split text into paragraphs (Since R2023a)
`stopWords`	List of stop words
`decodeHTMLEntities`	Convert HTML and XML entities into characters
`lower`	Convert documents to lowercase
`upper`	Convert documents to uppercase

Token Details

`context`	Search documents for word or n-gram occurrences in context
`tokenDetails`	Details of tokens in tokenized document array
`addSentenceDetails`	Add sentence numbers to documents
`addPartOfSpeechDetails`	Add part-of-speech tags to documents
`addLemmaDetails`	Add lemma forms of tokens to documents
`addLanguageDetails`	Add language identifiers to documents
`addEntityDetails`	Add entity tags to documents
`addDependencyDetails`	Add grammatical dependency details to documents (Since R2022b)
`addTypeDetails`	Add token type details to documents
`splitSentences`	Split text into sentences
`splitParagraphs`	Split text into paragraphs (Since R2023a)
`corpusLanguage`	Detect language of text
`abbreviations`	Table of common abbreviations
`topLevelDomains`	List of top-level domains

Word and N-Gram Counting

`bagOfWords`	Bag-of-words model
`bagOfNgrams`	Bag-of-n-grams model
`addDocument`	Add documents to bag-of-words or bag-of-n-grams model
`removeDocument`	Remove documents from bag-of-words or bag-of-n-grams model
`removeInfrequentWords`	Remove words with low counts from bag-of-words model
`removeInfrequentNgrams`	Remove infrequently seen n-grams from bag-of-n-grams model
`removeNgrams`	Remove n-grams from bag-of-n-grams model
`removeEmptyDocuments`	Remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
`topkwords`	Most important words in bag-of-words model or LDA topic
`topkngrams`	Most frequent n-grams
`encode`	Encode documents as matrix of word or n-gram counts
`tfidf`	Term Frequency–Inverse Document Frequency (tf-idf) matrix
`join`	Combine multiple bag-of-words or bag-of-n-grams models

Spelling Correction and Edit Distance

`correctSpelling`	Correct spelling of words
`editDistance`	Find edit distance between two strings or documents
`editDistanceSearcher`	Edit distance nearest neighbor searcher
`knnsearch`	Find nearest neighbors by edit distance
`rangesearch`	Find nearest neighbors by edit distance range
`splitGraphemes`	Split string into graphemes

Document Manipulation and Conversion

`docfun`	Apply function to words in documents
`containsWords`	Check if word is member of documents (Since R2022b)
`containsNgrams`	Check if n-gram is member of documents (Since R2022a)
`contains`	Check if pattern is substring in documents (Since R2022b)
`plus`	Append documents
`replace`	Replace substrings in documents
`regexprep`	Replace text in words of documents using regular expression
`doclength`	Length of documents in document array
`doc2cell`	Convert documents to cell array of string vectors
`joinWords`	Convert documents to string by joining words
`string`	Convert scalar document to string vector

Unicode

`textanalytics.unicode.nfc`	Unicode composed normalized form (NFC) (Since R2022b)
`textanalytics.unicode.nfd`	Unicode decomposed normalized form (NFD) (Since R2021a)
`textanalytics.unicode.nfkc`	Unicode compatibility composed normalized form (NFKC) (Since R2022b)
`textanalytics.unicode.nfkd`	Unicode compatibility decomposed normalized form (NFKD) (Since R2022b)
`textanalytics.unicode.UTF32`	Unicode UTF-32 string representation (Since R2021a)
`characterCategories`	Unicode character categories (Since R2021a)
`hex`	Convert UTF-32 representation to hexadecimal values (Since R2021a)
`string`	Convert UTF-32 representation to string (Since R2021a)

Topics

Import

Extract Text Data from Files
This example shows how to extract the text data from text, HTML, Microsoft® Word, PDF, CSV, and Microsoft Excel® files and import it into MATLAB® for analysis.
Parse HTML and Extract Text Content
This example shows how to parse HTML code and extract the text content from particular elements.
Data Sets for Text Analytics
Discover data sets for various text analytics tasks.

Preprocessing

Preprocess Text Data in Live Editor
Explore text preprocessing techniques using the Preprocess Text Data Live Editor task.
Prepare Text Data for Analysis
This example shows how to create a function which cleans and preprocesses text data for analysis.
Analyze Text Data Containing Emojis
This example shows how to analyze text data containing emojis.
Correct Spelling in Documents
This example shows how to correct spelling in documents using Hunspell.
Create Extension Dictionary for Spelling Correction
This example shows how to create a Hunspell extension dictionary for spelling correction.
Create Custom Spelling Correction Function Using Edit Distance Searchers
This example shows how to correct spelling using edit distance searchers and a vocabulary of known words.
Analyze Sentence Structure Using Grammatical Dependency Parsing
This example shows how to extract information from a sentence using grammatical dependency parsing.

Language Support

Language Considerations
Information on using Text Analytics Toolbox features for other languages.
Japanese Language Support
Information on Japanese support in Text Analytics Toolbox.
Analyze Japanese Text Data
This example shows how to import, prepare, and analyze Japanese text data using a topic model.
German Language Support
Information on German support in Text Analytics Toolbox.
Analyze German Text Data
This example shows how to import, prepare, and analyze German text data using a topic model.

Featured Examples

Extract Text Data from Files

Extract the text data from text, HTML, Microsoft® Word, PDF, CSV, and Microsoft Excel® files and import it into MATLAB® for analysis.

Open Live Script

Prepare Text Data for Analysis

Create a function which cleans and preprocesses text data for analysis.

Open Live Script

Analyze Text Data Containing Emojis

Analyze text data containing emojis.

Open Live Script