Main Content

Modeling and Prediction

Develop predictive models using topic models and word embeddings

To find clusters and extract features from high-dimensional text datasets, you can use machine learning techniques and models such as LSA, LDA, and word embeddings. You can combine features created with Text Analytics Toolbox™ with features from other data sources. With these features, you can build machine learning models that take advantage of textual, numeric, and other types of data.

Functions

expand all

bagOfWordsBag-of-words model
bagOfNgramsBag-of-n-grams model
addDocumentAdd documents to bag-of-words or bag-of-n-grams model
removeDocumentRemove documents from bag-of-words or bag-of-n-grams model
removeInfrequentWordsRemove words with low counts from bag-of-words model
removeInfrequentNgramsRemove infrequently seen n-grams from bag-of-n-grams model
removeWordsRemove selected words from documents or bag-of-words model
removeNgramsRemove n-grams from bag-of-n-grams model
removeEmptyDocumentsRemove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
topkwordsMost important words in bag-of-words model or LDA topic
topkngramsMost frequent n-grams
encodeEncode documents as matrix of word or n-gram counts
tfidfTerm Frequency–Inverse Document Frequency (tf-idf) matrix
joinCombine multiple bag-of-words or bag-of-n-grams models
vaderSentimentScoresSentiment scores with VADER algorithm (Since R2019b)
ratioSentimentScoresSentiment scores with ratio rule (Since R2019b)
encodeTokenize and encode text for transformer neural network (Since R2023b)
decodeConvert token codes to tokens (Since R2023b)
encodeTokensConvert tokens to token codes (Since R2023b)
subwordTokenizeTokenize text into subwords using BERT tokenizer (Since R2023b)
wordTokenizeTokenize text into words using tokenizer (Since R2023b)
bertPretrained BERT model (Since R2023b)
bertDocumentClassifierBERT document classifier (Since R2023b)
classifyClassify document using BERT document classifier (Since R2023b)
bertTokenizerWordPiece BERT tokenizer (Since R2023b)
bpeTokenizerByte pair encoding tokenizer (Since R2024a)
encodeTokenize and encode text for transformer neural network (Since R2023b)
decodeConvert token codes to tokens (Since R2023b)
encodeTokensConvert tokens to token codes (Since R2023b)
subwordTokenizeTokenize text into subwords using BERT tokenizer (Since R2023b)
trainBERTDocumentClassifierTrain BERT document classifier (Since R2023b)
wordTokenizeTokenize text into words using tokenizer (Since R2023b)
documentEmbeddingDocument embedding model to map documents to vectors (Since R2024a)
embedMap document to embedding vector (Since R2024a)
fastTextWordEmbeddingPretrained fastText word embedding
wordEncodingWord encoding model to map words to indices and back
doc2sequenceConvert documents to sequences for deep learning
wordEmbeddingLayerWord embedding layer for deep learning neural network
word2vecMap word to embedding vector
word2indMap word to encoding index
vec2wordMap embedding vector to word
ind2wordMap encoding index to word
isVocabularyWordTest if word is member of word embedding or encoding
readWordEmbeddingRead word embedding from file
trainWordEmbeddingTrain word embedding
writeWordEmbeddingWrite word embedding file
wordEmbeddingWord embedding model to map words to vectors and back
extractSummaryExtract summary from documents (Since R2020a)
rakeKeywordsExtract keywords using RAKE (Since R2020b)
textrankKeywordsExtract keywords using TextRank (Since R2020b)
bleuEvaluationScoreEvaluate translation or summarization with BLEU similarity score (Since R2020a)
rougeEvaluationScoreEvaluate translation or summarization with ROUGE similarity score (Since R2020a)
bm25SimilarityDocument similarities with BM25 algorithm (Since R2020a)
cosineSimilarityDocument similarities with cosine similarity (Since R2020a)
textrankScoresDocument scoring with TextRank algorithm (Since R2020a)
lexrankScoresDocument scoring with LexRank algorithm (Since R2020a)
mmrScoresDocument scoring with Maximal Marginal Relevance (MMR) algorithm (Since R2020a)
fitldaFit latent Dirichlet allocation (LDA) model
fitlsaFit LSA model
resumeResume fitting LDA model
logpDocument log-probabilities and goodness of fit of LDA model
predictPredict top LDA topics of documents
transformTransform documents into lower-dimensional space
ldaModelLatent Dirichlet allocation (LDA) model
lsaModelLatent semantic analysis (LSA) model
addEntityDetailsAdd entity tags to documents
trainHMMEntityModelTrain HMM-based model for named entity recognition (NER) (Since R2023a)
predictPredict entities using named entity recognition (NER) model (Since R2023a)
hmmEntityModelHMM-based model for named entity recognition (NER) (Since R2023a)
wordcloudCreate word cloud chart from text, bag-of-words model, bag-of-n-grams model, or LDA model
textscatter2-D scatter plot of text
textscatter33-D scatter plot of text

Topics

Classification and Modeling

Sentiment Analysis and Keyword Extraction

Deep Learning

Language Support