extractFileText
Read text from PDF, Microsoft Word, HTML, and plain text files
Description
specifies additional options using one or more name-value pair arguments.str
= extractFileText(filename
,Name,Value
)
Examples
Extract Text Data from Text File
Extract the text from sonnets.txt
using extractFileText
. The file sonnets.txt
contains Shakespeare's sonnets in plain text.
str = extractFileText("sonnets.txt");
View the first sonnet.
i = strfind(str,"I"); ii = strfind(str,"II"); start = i(1); fin = ii(1); extractBetween(str,start,fin-1)
ans = "I From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee. "
Extract Text Data from PDF
Extract the text from exampleSonnets.pdf
using extractFileText
. The file exampleSonnets.pdf
contains Shakespeare's sonnets in a PDF file.
str = extractFileText("exampleSonnets.pdf");
View the second sonnet.
ii = strfind(str,"II"); iii = strfind(str,"III"); start = ii(1); fin = iii(1); extractBetween(str,start,fin-1)
ans = "II When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. "
Extract the text from pages 3, 5, and 7 of the PDF file.
pages = [3 5 7]; str = extractFileText("exampleSonnets.pdf", ... 'Pages',pages);
View the 10th sonnet.
x = strfind(str,"X"); xi = strfind(str,"XI"); start = x(1); fin = xi(1); extractBetween(str,start,fin-1)
ans = "X Is it for fear to wet a widow's eye, That thou consum'st thy self in single life? Ah! if thou issueless shalt hap to die, The world will wail thee like a makeless wife; The world will be thy widow and still weep That thou no form of thee hast left behind, When every private widow well may keep By children's eyes, her husband's shape in mind: Look! what an unthrift in the world doth spend Shifts but his place, for still the world enjoys it; But beauty's waste hath in the world an end, And kept unused the user so destroys it. No love toward others in that bosom sits That on himself such murd'rous shame commits. X For shame! deny that thou bear'st love to any, Who for thy self art so unprovident. Grant, if thou wilt, thou art belov'd of many, But that thou none lov'st is most evident: For thou art so possess'd with murderous hate, That 'gainst thy self thou stick'st not to conspire, Seeking that beauteous roof to ruinate Which to repair should be thy chief desire. "
Import Text from Multiple Files Using a File Datastore
If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.
Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt
", where N
is the number of the sonnet. Specify the read function to be extractFileText
.
readFcn = @extractFileText; fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);
Create an empty bag-of-words model.
bag = bagOfWords
bag = bagOfWords with properties: Counts: [] Vocabulary: [1x0 string] NumWords: 0 NumDocuments: 0
Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag
.
while hasdata(fds) str = read(fds); document = tokenizedDocument(str); bag = addDocument(bag,document); end
View the updated bag-of-words model.
bag
bag = bagOfWords with properties: Counts: [4x276 double] Vocabulary: ["From" "fairest" "creatures" "we" "desire" "increase" "," "That" "thereby" "beauty's" "rose" "might" "never" "die" "But" "as" "the" "riper" "should" ... ] (1x276 string) NumWords: 276 NumDocuments: 4
Extract Text from HTML
To extract text data directly from HTML code, use extractHTMLText
and specify the HTML code as a string.
code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = "THE SONNETS by William Shakespeare"
Input Arguments
filename
— Name of file
string scalar | character vector | 1-by-1 cell array containing a character vector
Name of the file, specified as a string scalar, character vector, or a 1-by-1 cell array containing a character vector.
Data Types: string
| char
| cell
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'Pages',[1 3 5]
specifies to read pages 1, 3, and 5
from a PDF file.
Encoding
— Character encoding
'auto'
(default) | 'UTF-8'
| 'ISO-8859-1'
| 'windows-1251'
| 'windows-1252'
| ...
Character encoding to use, specified as the comma-separated pair
consisting of 'Encoding'
and a character vector or a
string scalar. The character vector or string scalar must contain a
standard character encoding scheme name such as the following.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
If you do not specify an encoding scheme, then the function performs heuristic auto-detection for the encoding to use. The heuristics depend on your locale. If these heuristics fail, then you must specify one explicitly.
This option only applies when the input is a plain text file.
Data Types: char
| string
ExtractionMethod
— Extraction method
'tree'
(default) | 'article'
| 'all-text'
Extraction method, specified as the comma-separated pair consisting of
'ExtractionMethod'
and one of the
following:
Option | Description |
---|---|
'tree' | Analyze the DOM tree and text contents, then extract a block of paragraphs. |
'article' | Detect article text and extract a block of paragraphs. |
'all-text' | Extract all text in the HTML body, except for scripts and CSS styles. |
This option supports HTML file input only.
Password
— Password to open PDF file
character vector | string scalar
Password to open the PDF file, specified as the comma-separated pair
consisting of 'Password'
and a character vector or a
string scalar. This option only applies if the input file is a
PDF.
Example: 'Password','skroWhtaM'
Data Types: char
| string
Pages
— Pages to read from PDF file
vector of positive integers
Pages to read from PDF file, specified as the comma-separated pair
consisting of 'Pages'
and a vector of positive
integers. This option only applies if the input file is a PDF file. The
function, by default, reads all pages from the PDF file.
Example: 'Pages',[1 3 5]
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
Tips
To read text directly from HTML code, use
extractHTMLText
.To read text separated by lines in a text file, use
readlines
.
Version History
Introduced in R2017bR2020b: extractFileText
no longer supports extracting text from Microsoft Word 97–2003 binary DOC files
Support for extracting text from Microsoft® Word 97–2003 binary DOC files using the
extractFileText
function has been removed. Microsoft Word DOCX files will continue to be supported.
To extract text data from Microsoft Word 97–2003 binary DOC files, first save the file as a PDF, Microsoft Word DOCX, HTML, or plain text file, then use the
extractFileText
function.
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)