Extract Text Data from Files
This example shows how to extract the text data from text, HTML, Microsoft® Word, PDF, CSV, and Microsoft Excel® files and import it into MATLAB® for analysis.
Usually, the easiest way to import text data into MATLAB is to use the extractFileText
function. This function extracts the text data from text, PDF, HTML, and Microsoft Word files. To import text from CSV and Microsoft Excel files, use readtable
. To extract text from HTML code, use extractHTMLText
. To read data from PDF forms, use readPDFFormData
.
Text File
Extract the text from sonnets.txt
using extractFileText
. The file sonnets.txt
contains Shakespeare's sonnets in plain text.
filename = "sonnets.txt";
str = extractFileText(filename);
View the first sonnet by extracting the text between the two titles "I
" and "II
".
start = " I" + newline; fin = " II"; sonnet1 = extractBetween(str,start,fin)
sonnet1 = " From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee. "
For text files containing multiple documents separated by newline characters, use the readlines
function.
filename = "multilineSonnets.txt";
str = readlines(filename)
str = 3×1 string
"From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee."
"When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."
"Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee."
Microsoft Word Document
Extract the text from sonnets.docx
using extractFileText
. The file exampleSonnets.docx
contains Shakespeare's sonnets in a Microsoft Word document.
filename = "exampleSonnets.docx";
str = extractFileText(filename);
View the second sonnet by extracting the text between the two titles "II
" and "III
".
start = " II" + newline; fin = " III"; sonnet2 = extractBetween(str,start,fin)
sonnet2 = " When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "
The example Microsoft Word document uses two newline characters between each line. To replace these characters with a single newline character, use the replace
function.
sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = " When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "
PDF Files
Extract text from PDF documents and data from PDF forms.
PDF Document
Extract the text from sonnets.pdf
using extractFileText
. The file exampleSonnets.pdf
contains Shakespeare's sonnets in a PDF.
filename = "exampleSonnets.pdf";
str = extractFileText(filename);
View the third sonnet by extracting the text between the two titles "III
" and "IV
". This PDF has a space before each newline character.
start = " III " + newline; fin = "IV"; sonnet3 = extractBetween(str,start,fin)
sonnet3 = " Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee. "
PDF Form
To read text data from PDF forms, use readPDFFormData
. The function returns a struct containing the data from the PDF form fields.
filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
event_type: "Thunderstorm Wind"
event_narrative: "Large tree down between Plantersville and Nettleton."
HTML
Extract text from HTML files, HTML code, and the web.
HTML File
To extract text data from a saved HTML file, use extractFileText
.
filename = "exampleSonnets.html";
str = extractFileText(filename);
View the forth sonnet by extracting the text between the two titles "IV"
and "V"
.
start = newline + "IV" + newline; fin = newline + "V" + newline; sonnet4 = extractBetween(str,start,fin)
sonnet4 = " Unthrifty loveliness, why dost thou spend Upon thy self thy beauty's legacy? Nature's bequest gives nothing, but doth lend, And being frank she lends to those are free: Then, beauteous niggard, why dost thou abuse The bounteous largess given thee to give? Profitless usurer, why dost thou use So great a sum of sums, yet canst not live? For having traffic with thy self alone, Thou of thy self thy sweet self dost deceive: Then how when nature calls thee to be gone, What acceptable audit canst thou leave? Thy unused beauty must be tombed with thee, Which, used, lives th' executor to be. "
HTML Code
To extract text data from a string containing HTML code, use extractHTMLText
.
code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = "THE SONNETS by William Shakespeare"
From the Web
To extract text data from a web page, first read the HTML code using webread
, and then use extractHTMLText
.
url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 'Text Analytics Toolbox Analyze and model text data Release Notes PDF Documentation Release Notes PDF Documentation Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling. Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models. Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data. Get Started Learn the basics of Text Analytics Toolbox Text Data Preparation Import text data into MATLAB® and preprocess it for analysis Modeling and Prediction Develop predictive models using topic models and word embeddings Display and Presentation Visualize text data and models using word clouds and text scatter plots Language Support Information on language support in Text Analytics Toolbox'
Parse HTML Code
To find particular elements of HTML code, parse the code using htmlTree
and use findElement
. Parse the HTML code and find all the hyperlinks. The hyperlinks are nodes with element name "A"
.
tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);
View the first 10 subtrees and extract the text using extractHTMLText
.
subtrees(1:10)
ans = 10×1 htmlTree: <A class="skip_link sr-only" href="#skip_link_anchor">Skip to content</A> <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A> <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A> <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A> <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A> <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A> <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A> <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A> <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A> <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
str = extractHTMLText(subtrees);
View the extracted text of the first 10 hyperlinks.
str(1:10)
ans = 10×1 string
"Skip to content"
""
"Products"
"Solutions"
"Academia"
"Support"
"Community"
"Events"
"Get MATLAB"
""
To get the link targets, use getAttributes
and specify the attribute "href"
(hyperlink reference). Get the link targets of the first 10 subtrees.
attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string
"#skip_link_anchor"
"https://www.mathworks.com?s_tid=gn_logo"
"https://www.mathworks.com/products.html?s_tid=gn_ps"
"https://www.mathworks.com/solutions.html?s_tid=gn_sol"
"https://www.mathworks.com/academia.html?s_tid=gn_acad"
"https://www.mathworks.com/support.html?s_tid=gn_supp"
"https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
"https://www.mathworks.com/company/events.html?s_tid=gn_ev"
"https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml"
"https://www.mathworks.com?s_tid=gn_logo"
CSV and Microsoft Excel Files
To extract text data from CSV and Microsoft Excel files, use readtable
and extract the text data from the table that it returns.
Extract the table data from factoryReposts.csv
using the readtable
function and view the first few rows of the table.
T = readtable('factoryReports.csv','TextType','string'); head(T)
Description Category Urgency Resolution Cost _____________________________________________________________________ ____________________ ________ ____________________ _____ "Items are occasionally getting stuck in the scanner spools." "Mechanical Failure" "Medium" "Readjust Machine" 45 "Loud rattling and banging sounds are coming from assembler pistons." "Mechanical Failure" "Medium" "Readjust Machine" 35 "There are cuts to the power when starting the plant." "Electronic Failure" "High" "Full Replacement" 16200 "Fried capacitors in the assembler." "Electronic Failure" "High" "Replace Components" 352 "Mixer tripped the fuses." "Electronic Failure" "Low" "Add to Watch List" 55 "Burst pipe in the constructing agent is spraying coolant." "Leak" "High" "Replace Components" 371 "A fuse is blown in the mixer." "Electronic Failure" "Low" "Replace Components" 441 "Things continue to tumble off of the belt." "Mechanical Failure" "Low" "Readjust Machine" 38
Extract the text data from the event_narrative
column and view the first few strings.
str = T.Description; str(1:10)
ans = 10×1 string
"Items are occasionally getting stuck in the scanner spools."
"Loud rattling and banging sounds are coming from assembler pistons."
"There are cuts to the power when starting the plant."
"Fried capacitors in the assembler."
"Mixer tripped the fuses."
"Burst pipe in the constructing agent is spraying coolant."
"A fuse is blown in the mixer."
"Things continue to tumble off of the belt."
"Falling items from the conveyor belt."
"The scanner reel is split, it will soon begin to curve."
Extract Text from Multiple Files
If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.
Create a file datastore for the example sonnet text files. The example files are named "exampleSonnetN.txt
", where N
is the number of the sonnet. Specify the file name using the wildcard "*" to find all file names of this structure. To specify the read function to be extractFileText
, input this function to fileDatastore
using a function handle.
location = "exampleSonnet*.txt"; fds = fileDatastore(location,'ReadFcn',@extractFileText);
Loop over the files in the datastore and read each text file.
str = []; while hasdata(fds) textData = read(fds); str = [str; textData]; end
View the extracted text.
str
str = 4×1 string
" From fairest creatures we desire increase,↵ That thereby beauty's rose might never die,↵ But as the riper should by time decease,↵ His tender heir might bear his memory:↵ But thou, contracted to thine own bright eyes,↵ Feed'st thy light's flame with self-substantial fuel,↵ Making a famine where abundance lies,↵ Thy self thy foe, to thy sweet self too cruel:↵ Thou that art now the world's fresh ornament,↵ And only herald to the gaudy spring,↵ Within thine own bud buriest thy content,↵ And tender churl mak'st waste in niggarding:↵ Pity the world, or else this glutton be,↵ To eat the world's due, by the grave and thee."
" When forty winters shall besiege thy brow,↵ And dig deep trenches in thy beauty's field,↵ Thy youth's proud livery so gazed on now,↵ Will be a tatter'd weed of small worth held:↵ Then being asked, where all thy beauty lies,↵ Where all the treasure of thy lusty days;↵ To say, within thine own deep sunken eyes,↵ Were an all-eating shame, and thriftless praise.↵ How much more praise deserv'd thy beauty's use,↵ If thou couldst answer 'This fair child of mine↵ Shall sum my count, and make my old excuse,'↵ Proving his beauty by succession thine!↵ This were to be new made when thou art old,↵ And see thy blood warm when thou feel'st it cold."
" Look in thy glass and tell the face thou viewest↵ Now is the time that face should form another;↵ Whose fresh repair if now thou not renewest,↵ Thou dost beguile the world, unbless some mother.↵ For where is she so fair whose unear'd womb↵ Disdains the tillage of thy husbandry?↵ Or who is he so fond will be the tomb,↵ Of his self-love to stop posterity?↵ Thou art thy mother's glass and she in thee↵ Calls back the lovely April of her prime;↵ So thou through windows of thine age shalt see,↵ Despite of wrinkles this thy golden time.↵ But if thou live, remember'd not to be,↵ Die single and thine image dies with thee."
" Unthrifty loveliness, why dost thou spend↵ Upon thy self thy beauty's legacy?↵ Nature's bequest gives nothing, but doth lend,↵ And being frank she lends to those are free:↵ Then, beauteous niggard, why dost thou abuse↵ The bounteous largess given thee to give?↵ Profitless usurer, why dost thou use↵ So great a sum of sums, yet canst not live?↵ For having traffic with thy self alone,↵ Thou of thy self thy sweet self dost deceive:↵ Then how when nature calls thee to be gone,↵ What acceptable audit canst thou leave?↵ Thy unused beauty must be tombed with thee,↵ Which, used, lives th' executor to be."
See Also
pdfinfo
| extractFileText
| readPDFFormData
| extractHTMLText
| tokenizedDocument