addPartOfSpeechDetails
Add part-of-speech tags to documents
Syntax
Description
Use addPartOfSpeechDetails
to add part-of-speech tags to
documents.
The function supports English, Japanese, German, and Korean text.
detects parts of speech in updatedDocuments
= addPartOfSpeechDetails(documents
)documents
and updates the token
details. The function, by default, retokenizes the text for part-of-speech tagging.
For example, the function splits the word "you're" into the tokens "you" and "'re".
To get the part-of-speech details from updatedDocuments
, use
tokenDetails
.
specifies additional options using one or more name-value pair arguments.updatedDocuments
= addPartOfSpeechDetails(documents
,Name,Value
)
Tip
Use addPartOfSpeechDetails
before using the
lower
, upper
,
erasePunctuation
,
normalizeWords
, removeWords
,
and removeStopWords
functions as
addPartOfSpeechDetails
uses information that is removed by
these functions.
Examples
Add Part-of-Speech Details to Documents
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
View the token details of the first few tokens.
tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber LineNumber Type Language ___________ ______________ __________ _______ ________ "fairest" 1 1 letters en "creatures" 1 1 letters en "desire" 1 1 letters en "increase" 1 1 letters en "thereby" 1 1 letters en "beautys" 1 1 letters en "rose" 1 1 letters en "might" 1 1 letters en
Add part-of-speech details to the documents using the addPartOfSpeechDetails
function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails
. View the updated token details of the first few tokens.
documents = addPartOfSpeechDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ___________ ______________ ______________ __________ _______ ________ ______________ "fairest" 1 1 1 letters en adjective "creatures" 1 1 1 letters en noun "desire" 1 1 1 letters en noun "increase" 1 1 1 letters en noun "thereby" 1 1 1 letters en adverb "beautys" 1 1 1 letters en noun "rose" 1 1 1 letters en noun "might" 1 1 1 letters en auxiliary-verb
Get Part of Speech Details of Japanese Text
Tokenize Japanese text using tokenizedDocument
.
str = [ "恋に悩み、苦しむ。" "恋の悩みで 苦しむ。" "空に星が輝き、瞬いている。" "空の星が輝きを増している。" "駅までは遠くて、歩けない。" "遠くの駅まで歩けない。" "すもももももももものうち。"]; documents = tokenizedDocument(str);
For Japanese text, you can get the part-of-speech details using tokenDetails
. For English text, you must first use addPartOfSpeechDetails
.
tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber LineNumber Type Language PartOfSpeech Lemma Entity _______ ______________ __________ ___________ ________ ____________ _______ __________ "恋" 1 1 letters ja noun "恋" non-entity "に" 1 1 letters ja adposition "に" non-entity "悩み" 1 1 letters ja verb "悩む" non-entity "、" 1 1 punctuation ja punctuation "、" non-entity "苦しむ" 1 1 letters ja verb "苦しむ" non-entity "。" 1 1 punctuation ja punctuation "。" non-entity "恋" 2 1 letters ja noun "恋" non-entity "の" 2 1 letters ja adposition "の" non-entity
Get Part of Speech Details of German Text
Tokenize German text using tokenizedDocument
.
str = [ "Guten Morgen. Wie geht es dir?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str)
documents = 2x1 tokenizedDocument: 8 tokens: Guten Morgen . Wie geht es dir ? 6 tokens: Heute wird ein guter Tag .
To get the part of speech details for German text, first use addPartOfSpeechDetails
.
documents = addPartOfSpeechDetails(documents);
To view the part of speech details, use the tokenDetails
function.
tdetails = tokenDetails(documents); head(tdetails)
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech ________ ______________ ______________ __________ ___________ ________ ____________ "Guten" 1 1 1 letters de adjective "Morgen" 1 1 1 letters de noun "." 1 1 1 punctuation de punctuation "Wie" 1 2 1 letters de adverb "geht" 1 2 1 letters de verb "es" 1 2 1 letters de pronoun "dir" 1 2 1 letters de pronoun "?" 1 2 1 punctuation de punctuation
Input Arguments
documents
— Input documents
tokenizedDocument
array
Input documents, specified as a tokenizedDocument
array.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'DiscardKnownValues',true
specifies to discard
previously computed details and recompute them.
RetokenizeMethod
— Method to retokenize documents
'part-of-speech'
(default) | 'none'
Method to retokenize documents, specified as one of the following:
'part-of-speech'
– Transform the tokens for part-of-speech tagging. The function performs these tasks:Split compound words. For example, split the compound word
"wanna"
into the tokens"want"
and"to"
. This includes compound words containing apostrophes. For example, the function splits the word"don't"
into the tokens"do"
and"n't"
.Merge periods that do not end sentences with preceding tokens. For example, merge the tokens
"Mr"
and"."
into the token"Mr."
.For German text, merge abbreviations that span multiple tokens. For example, merge the tokens
"z"
,"."
,"B"
, and"."
into the single token"z. B."
.Merge runs of periods into ellipses. For example, merge three instances of
"."
into the single token"..."
.
'none'
– Do not retokenize the documents.
Abbreviations
— List of abbreviations
string array | character vector | cell array of character vectors | table
List of abbreviations for sentence detection, specified as a string array, character vector, cell array of character vectors, or a table.
If the input documents do not contain sentence details, then the
function first runs the addSentenceDetails
function and specifies the
abbreviation list given by 'Abbreviations'
. To
specify more options for sentence detection (for example, sentence
starters) use the addSentenceDetails
function
before using addPartOfSpeechDetails
details.
If Abbreviations
is a string array, character vector, or cell array of character vectors, then the function treats these as regular abbreviations. If the next word is a capitalized sentence starter, then the function breaks at the trailing period. The function ignores any differences in the letter case of the abbreviations. Specify the sentence starters using the Starters
name-value pair.
To specify different behaviors when splitting sentences at abbreviations, specify Abbreviations
as a table. The table must have variables named Abbreviation
and Usage
, where Abbreviation
contains the abbreviations, and Usage
contains the type of each abbreviation. The following table describes the possible values of Usage
, and the behavior of the function when passed abbreviations of these types.
Usage | Behavior | Example Abbreviation | Example Text | Detected Sentences |
---|---|---|---|---|
regular | If the next word is a capitalized sentence starter, then break at the trailing period. Otherwise, do not break at the trailing period. | "appt." | "Book an appt. We'll meet then." |
|
"Book an appt. today." | "Book an appt. today." | |||
inner | Do not break after trailing period. | "Dr." | "Dr. Smith." | "Dr. Smith." |
reference | If the next token is not a number, then break at a trailing period. If the next token is a number, then do not break at the trailing period. | "fig." | "See fig. 3." | "See fig. 3." |
"Try a fig. They are nice." |
| |||
unit | If the previous word is a number and the following word is a capitalized sentence starter, then break at a trailing period. | "in." | "The height is 30 in. The width is 10 in." |
|
If the previous word is a number and the following word is not capitalized, then do not break at a trailing period. | "The item is 10 in. wide." | "The item is 10 in. wide." | ||
If the previous word is not a number, then break at a trailing period. | "Come in. Sit down." |
|
The default value is the output of the abbreviations
function. For Japanese and Korean text, abbreviations do not
usually impact sentence detection.
Tip
By default, the function treats single letter abbreviations, such as "V.", or tokens with mixed single letters and periods, such as "U.S.A." as regular abbreviations. You do not need to include these abbreviations in Abbreviations
.
Data Types: char
| string
| table
| cell
DiscardKnownValues
— Option to discard previously computed details
false
(default) | true
Option to discard previously computed details and recompute them, specified as
true
or false
.
Data Types: logical
Output Arguments
updatedDocuments
— Updated documents
tokenizedDocument
array
Updated documents, returned as a tokenizedDocument
array. To get the token details from
updatedDocuments
, use tokenDetails
.
More About
Part-of-Speech Tags
The addPartOfSpeechDetails
function adds
part-of-speech tags to the table returned by the tokenDetails
function. The function tags each token with a
categorical tag with one of the following class names:
adjective
— Adjectiveadposition
— Adpositionadverb
— Adverbauxiliary-verb
— Auxiliary verbcoord-conjunction
— Coordinating conjunctiondeterminer
— Determinerinterjection
— Interjectionnoun
— Nounnumeral
— Numeralparticle
— Particlepronoun
— Pronounproper-noun
— Proper nounpunctuation
— Punctuationsubord-conjunction
— Subordinating conjunctionsymbol
— Symbolverb
— Verbother
— Other
Algorithms
If the input documents do not contain sentence details, then the function first runs
addSentenceDetails
.
Version History
Introduced in R2018b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)