Main Content

Text Data Preparation

Import text data into MATLAB® and preprocess it for analysis

Text Analytics Toolbox™ includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. Use these tools to extract text from popular file formats, preprocess raw text, extract individual words or multiword phrases (n-grams), convert text into numerical representations, and build statistical models. For an example showing how to get started, see Prepare Text Data for Analysis.

Text Analytics Toolbox supports the languages English, Japanese, German, and Korean. Most Text Analytics Toolbox functions work with text from other languages. For more information, see Language Considerations.


expand all

extractFileTextRead text from PDF, Microsoft Word, HTML, and plain text files
extractHTMLTextExtract text from HTML
readPDFFormDataRead data from PDF forms
writeTextDocumentWrite documents to text file
htmlTreeParsed HTML tree
findElementFind elements in HTML tree
getAttributeRead HTML attribute of root node of HTML tree
ismissingFind HTML trees without values
stringConvert parsed HTML tree to string
tokenizedDocumentArray of tokenized documents for text analysis
erasePunctuationErase punctuation from text and documents
eraseTagsErase HTML and XML tags from text
eraseURLsErase HTTP and HTTPS URLs from text
removeStopWordsRemove stop words from documents
removeShortWordsRemove short words from documents or bag-of-words model
removeLongWordsRemove long words from documents or bag-of-words model
removeWordsRemove selected words from documents or bag-of-words model
normalizeWordsStem or lemmatize words
replaceWordsReplace words in documents
replaceNgramsReplace n-grams in documents
stopWordsList of stop words
decodeHTMLEntitiesConvert HTML and XML entities into characters
lowerConvert documents to lowercase
upperConvert documents to uppercase
contextSearch documents for word or n-gram occurrences in context
tokenDetailsDetails of tokens in tokenized document array
addSentenceDetailsAdd sentence numbers to documents
addPartOfSpeechDetailsAdd part-of-speech tags to documents
addLemmaDetailsAdd lemma forms of tokens to documents
addLanguageDetailsAdd language identifiers to documents
addEntityDetailsAdd entity tags to documents
addTypeDetailsAdd token type details to documents
splitSentencesSplit text into sentences
corpusLanguageDetect language of text
abbreviationsTable of common abbreviations
topLevelDomainsList of top-level domains
bagOfWordsBag-of-words model
bagOfNgramsBag-of-n-grams model
addDocumentAdd documents to bag-of-words or bag-of-n-grams model
removeDocumentRemove documents from bag-of-words or bag-of-n-grams model
removeInfrequentWordsRemove words with low counts from bag-of-words model
removeInfrequentNgramsRemove infrequently seen n-grams from bag-of-n-grams model
removeNgramsRemove n-grams from bag-of-n-grams model
removeEmptyDocumentsRemove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
topkwordsMost important words in bag-of-words model or LDA topic
topkngramsMost frequent n-grams
encodeEncode documents as matrix of word or n-gram counts
tfidfTerm Frequency–Inverse Document Frequency (tf-idf) matrix
joinCombine multiple bag-of-words or bag-of-n-grams models
correctSpellingCorrect spelling of words
editDistanceFind edit distance between two strings or documents
editDistanceSearcherEdit distance nearest neighbor searcher
knnsearchFind nearest neighbors by edit distance
rangesearchFind nearest neighbors by edit distance range
splitGraphemesSplit string into graphemes
docfunApply function to words in documents
plusAppend documents
replaceReplace substrings in documents
regexprepReplace text in words of documents using regular expression
doclengthLength of documents in document array
doc2cellConvert documents to cell array of string vectors
joinWordsConvert documents to string by joining words
stringConvert scalar document to string vector
textanalytics.unicode.nfdUnicode decomposed normalized form (NFD)
UTF32Unicode UTF-32 string representation
characterCategoriesUnicode character categories
hexConvert UTF-32 representation to hexadecimal values
stringConvert UTF-32 representation to string



Extract Text Data from Files

This example shows how to extract the text data from text, HTML, Microsoft® Word, PDF, CSV, and Microsoft Excel® files and import it into MATLAB® for analysis.

Parse HTML and Extract Text Content

This example shows how to parse HTML code and extract the text content from particular elements.

Data Sets for Text Analytics

Discover data sets for various text analytics tasks.


Prepare Text Data for Analysis

This example shows how to create a function which cleans and preprocesses text data for analysis.

Analyze Text Data Containing Emojis

This example shows how to analyze text data containing emojis.

Correct Spelling in Documents

This example shows how to correct spelling in documents using Hunspell.

Create Extension Dictionary for Spelling Correction

This example shows how to create a Hunspell extension dictionary for spelling correction.

Create Custom Spelling Correction Function Using Edit Distance Searchers

This example shows how to correct spelling using edit distance searchers and a vocabulary of known words.

Language Support

Language Considerations

Information on using Text Analytics Toolbox features for other languages.

Japanese Language Support

Information on Japanese support in Text Analytics Toolbox.

Analyze Japanese Text Data

This example shows how to import, prepare, and analyze Japanese text data using a topic model.

German Language Support

Information on German support in Text Analytics Toolbox.

Analyze German Text Data

This example shows how to import, prepare, and analyze German text data using a topic model.

Featured Examples