Add sentence numbers to documents
addSentenceDetails to add sentence information to
The function supports English, Japanese, German, and Korean text.
addSentenceDetails before using the
removeStopWords functions as
addSentenceDetails uses information that is removed by
Add Sentence Details to Documents
Create a tokenized document array.
str = [ ... "This is an example document. It has two sentences." "This document has one sentence." "Here is another example document. It also has two sentences."]; documents = tokenizedDocument(str);
Add sentence details to the documents using
addSentenceDetails. This function adds the sentence numbers to the table returned by
tokenDetails. View the updated token details of the first few tokens.
documents = addSentenceDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
ans=8×6 table Token DocumentNumber SentenceNumber LineNumber Type Language __________ ______________ ______________ __________ ___________ ________ "This" 1 1 1 letters en "is" 1 1 1 letters en "an" 1 1 1 letters en "example" 1 1 1 letters en "document" 1 1 1 letters en "." 1 1 1 punctuation en "It" 1 2 1 letters en "has" 1 2 1 letters en
View the token details of the second sentence of the third document.
idx = tdetails.DocumentNumber == 3 & ... tdetails.SentenceNumber == 2; tdetails(idx,:)
ans=6×6 table Token DocumentNumber SentenceNumber LineNumber Type Language ___________ ______________ ______________ __________ ___________ ________ "It" 3 2 1 letters en "also" 3 2 1 letters en "has" 3 2 1 letters en "two" 3 2 1 letters en "sentences" 3 2 1 letters en "." 3 2 1 punctuation en
documents — Input documents
Input documents, specified as a
comma-separated pairs of
the argument name and
Value is the corresponding value.
Name must appear inside quotes. You can specify several name and value
pair arguments in any order as
'Abbreviations',["cm" "mm" "in"]specifies to detect sentences boundaries where these abbreviations are followed by a period and a capitalized sentence starter.
Abbreviations — List of abbreviations
string array | character vector | cell array of character vectors | table
List of abbreviations, specified as a string array, character vector, cell array of character vectors, or a table.
Abbreviations is a string array, character vector, or cell array of character vectors, then the function treats these as regular abbreviations. If the next word is a capitalized sentence starter, then the function breaks at the trailing period. The function ignores any differences in the letter case of the abbreviations. Specify the sentence starters using the
Starters name-value pair.
To specify different behaviors when splitting sentences at abbreviations, specify
Abbreviations as a table. The table must have variables named
Abbreviation contains the abbreviations, and
Usage contains the type of each abbreviation. The following table describes the possible values of
Usage, and the behavior of the function when passed abbreviations of these types.
|Usage||Behavior||Example Abbreviation||Example Text||Detected Sentences|
|If the next word is a capitalized sentence starter, then break at the trailing period. Otherwise, do not break at the trailing period.||"appt."|
|Do not break after trailing period.||"Dr."|
|If the next token is not a number, then break at a trailing period. If the next token is a number, then do not break at the trailing period.||"fig."|
|If the previous word is a number and the following word is a capitalized sentence starter, then break at a trailing period.||"in."|
|If the previous word is a number and the following word is not capitalized, then do not break at a trailing period.|
|If the previous word is not a number, then break at a trailing period.|
The default value is the output of the
abbreviations function. For Japanese and Korean text, abbreviations do not
usually impact sentence detection.
By default, the function treats single letter abbreviations, such as "V.", or tokens with mixed single letters and periods, such as "U.S.A." as regular abbreviations. You do not need to include these abbreviations in
["cm" "mm" "in"]
Starters — Words that start a sentence
string array | character vector | cell array of character vectors
Words that start a sentence, specified as a string array, character vector, or a cell array of character vectors. If a sentence starter appears capitalized after a regular abbreviation, then the function detects a sentence boundary at the trailing period. The function ignores any differences in the letter case of the sentence starters.
The default value is the output of the
DiscardKnownValues — Option to discard previously computed details
false (default) |
Option to discard previously computed details and recompute them, specified as
addSentenceDetails function detects sentence boundaries based on
punctuation characters and line number information. For English and German text, the
function also uses a list of abbreviations passed to the function.
For other languages, you might need to specify your own list of abbreviations for sentence
detection. To do this, use the
'Abbreviations' option of
If emoticons or emoji characters appear after a terminating punctuation character, then the function splits the sentence after the emoticons and emoji.