Korean Language Support
This topic summarizes the Text Analytics Toolbox™ features that support Korean text.
Tokenization
The tokenizedDocument function automatically detects Korean input.
Alternatively, set the 'Language' option in tokenizedDocument to 'ko'. This option specifies the
language details of the tokens. To view the language details of the tokens, use
tokenDetails. These language details determine the behavior of the removeStopWords,
addPartOfSpeechDetails, normalizeWords, addSentenceDetails, and addEntityDetails
functions on the tokens.
To specify additional MeCab options for tokenization, create a mecabOptions object. To
tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod' option of tokenizedDocument.
Part of Speech Details
The tokenDetails function, by default, includes part of speech details with
the token details.
Named Entity Recognition
The tokenDetails function, by default, includes entity details with the
token details.
Stop Words
To remove stop words from documents according to the token language details, use
removeStopWords.
For a list of Korean stop words set the 'Language' option in
stopWords to 'ko'.
Lemmatization
To lemmatize tokens according to the token language details, use normalizeWords and set the 'Style' option to
'lemma'.
Language-Independent Features
Word and N-Gram Counting
The bagOfWords and bagOfNgrams functions support tokenizedDocument input regardless of language. If you have a tokenizedDocument array containing your data, then you can use these functions.
Modeling and Prediction
The fitlda and fitlsa functions support bagOfWords and bagOfNgrams input regardless of language. If you have a bagOfWords or bagOfNgrams object containing your data, then you can use these functions.
The trainWordEmbedding function supports tokenizedDocument or file input regardless of language. If you have a tokenizedDocument array or a file containing your data in the correct format, then you can use this function.
See Also
tokenizedDocument | removeStopWords | stopWords | addPartOfSpeechDetails | tokenDetails | normalizeWords | addLanguageDetails | addEntityDetails