mecabOptions
Options for MeCab tokenization
Description
A mecabOptions
object specifies additional options for
tokenizing Japanese and Korean text.
To tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod'
option of tokenizedDocument
.
Creation
Description
creates a MeCab
tokenization option set with the default values for tokenizing Japanese.options
= mecabOptions
additionally sets additional Properties using one or
more name-value pair arguments.options
= mecabOptions(Name,Value
)
Properties
Model
— Path to trained model
string scalar | character vector
Path to trained model (MeCab dictionary), specified as a string scalar or a character vector.
The default value is a path to the internal dictionary for Japanese tokenization.
Example: "C:\myDict"
Data Types: char
| string
UserModel
— Files containing model extensions
""
(default) | string array | character vector | cell array of character vectors
Files containing model extensions (MeCab user dictionary .dic
files), specified as a string array, a character vector, or a cell array of character
vectors.
Example: "C:\myFile.dic"
Data Types: char
| string
| cell
LemmaExtractor
— Function extracting lemma from MeCab reply
@textanalytics.ja.mecabToLemma
(default) | function handle
Function extracting lemma from MeCab reply, specified as a function handle.
The function must have the form lemmata = fun(words,info)
, where
words
is a string vector of tokens and info
is a
struct with the following fields:
Feature
– String vector of tokens of the same size aswords
containing the MeCab output lines in ChaSen format without the split tokens themselves.PartOfSpeech
– Numerical code used inside the dictionary for the part-of-speech classification.
The output lemmata
is a string array of the same size as
words
containing the extracted lemmata.
The default lemma extractor is the textanalytics.ja.mecabToLemma
function.
Data Types: function_handle
POSExtractor
— Function extracting part-of-speech information from MeCab reply
@textanalytics.ja.mecabToPOS
(default) | function handle
Function extracting part-of-speech information from MeCab reply, specified as a function handle.
The function must have the form posTags = fun(words,info)
, where
words
is a string vector of tokens and info
is a
struct with the following fields:
Feature
– String vector of tokens of the same size aswords
containing the MeCab output lines in ChaSen format without the split tokens themselves.PartOfSpeech
– Numerical code used inside the dictionary for the part-of-speech classification.
The output posTags
is a categorical array of the same
size as words
containing the extracted part-of-speech tags from the
following categories:
adjective
adposition
adverb
auxiliary-verb
coord-conjunction
determiner
interjection
noun
numeral
pronoun
proper-noun
punctuation
symbol
verb
other
The default part-of-speech information extractor is the textanalytics.ja.mecabToPOS
function.
Data Types: function_handle
NERExtractor
— Function extracting named entity information from MeCab reply
@textanalytics.ja.mecabToNER
(default) | function handle
Function extracting named entity information from MeCab reply, specified as a function handle.
The function must have the form entities = fun(words,info)
, where
words
is a string vector of tokens and info
is a
struct with the following fields:
Feature
– String vector of tokens of the same size aswords
containing the MeCab output lines in ChaSen format without the split tokens themselves.PartOfSpeech
– Numerical code used inside the dictionary for the part-of-speech classification.
The output entities
is a categorical array of the same
size as words
containing the extracted entities from the following categories:
non-entity
person
organization
location
other
The default part-of-speech information extractor is the textanalytics.ja.mecabToNER
function.
Data Types: function_handle
Examples
Create MeCab Options Object
Create a MecabOptions
object containing the default options for Japanese tokenization.
options = mecabOptions
options = MecabOptions with properties: Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic" UserModel: "" LemmaExtractor: @textanalytics.ja.mecabToLemma POSExtractor: @textanalytics.ja.mecabToPOS NERExtractor: @textanalytics.ja.mecabToNER
Specify MeCab User Dictionary for Tokenization
Tokenize Japanese text using custom MeCab options.
Create a string array of Japanese text.
str = [ "恋に悩み、苦しむ。" "恋の悩みで苦しむ。" "空に星が輝き、瞬いている。" "空の星が輝きを増している。"];
Create a MecabOptions
object and specify a user model as a .dic
file using the 'UserModel'
option.
options = mecabOptions('UserModel','myFile.dic')
options = MecabOptions with properties: Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic" UserModel: "myFile.dic" LemmaExtractor: @textanalytics.ja.mecabToLemma POSExtractor: @textanalytics.ja.mecabToPOS NERExtractor: @textanalytics.ja.mecabToNER
Tokenize the text using the specified options using the 'TokenizeMethod'
option.
documents = tokenizedDocument(str,'TokenizeMethod',options)
documents = 4×1 tokenizedDocument: 6 tokens: 恋 に 悩み 、 苦しむ 。 6 tokens: 恋 の 悩み で 苦しむ 。 10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。 10 tokens: 空 の 星 が 輝き を 増し て いる 。
Version History
Introduced in R2019b
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)