You can use Text Analytics Toolbox™ to extract text data from a variety of file formats including Microsoft Word, PDF, and text files. Text data extraction can be done at the individual file level or on large collections of files.
Raw text data is often messy and requires preprocessing to extract meaningful words. You can remove content, such as URLs, HTML tags, and punctuation, using high-level filtering functions, and then automatically split text into words using a tokenization algorithm. Additionally, you can filter out very long or very short words, or words that occur infrequently or too frequently. This enables you to focus on the most meaningful words when you are distinguishing between documents.
English words often have morphological or inflectional endings to denote characteristics such as possessiveness or tense. You can use the Porter Stemmer algorithm to apply a set of rules that removes these endings and reduces words to their stem or root. For example, applying this step allows for the words “test,” “tests,” and “testing” to be normalized to their root word “test.”
You can plot text data using built-in visualizations. Word clouds visually show the relative frequency of words using font size and color. Text scatter plots show text in Cartesian coordinate systems and can be used for interpreting the output from machine learning algorithms.
You can convert tokenized text into a numeric form using a bag-of-words model that counts the number of times each word occurs in each document. Converting the data in the bag-of-words model to a term frequency-inverse document frequency (TF-IDF) form gives more weight to words that appear often in a document (term frequency) and less weight to those that appear in many documents (inverse document frequency). You can choose from many common weighting functions for calculating term frequency and inverse document frequency.
Converting text to numeric formats using a bag-of-words model.
Word-embedding models provide maps from individual words to corresponding vectors. The vectors attempt to preserve the relationships between words and can serve as an alternative to bag-of-words models for converting text data to numeric formats.
You can train word-embedding models using the word2vec algorithm. Additionally, word-embedding models are used as a preprocessing step for applying deep learning techniques to text data. With Neural Network Toolbox™, you can use a word2vec model in conjunction with a deep neural network, such as an LSTM network, for tasks such as text classification.
Training a word-embedding model requires a large amount of data. Consequently, it is common to use a word-embedding model that has already been trained on a general dataset. You can import pretrained word-embedding models such as those available in word2vec, FastText, and GloVe formats. You can then use these pretrained models to map the words in your dataset to their corresponding word vectors.
Once text data has been converted to a numeric representation, the resulting feature matrix is often “wide”—meaning it has many features. You can apply machine learning algorithms, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), to work with these wide matrices. You can also use LDA to identify different topics across a set of documents. Additionally, you can use LDA or LSA to convert the wide feature matrices into a lower-dimensional space, and then apply other machine learning algorithms for tasks such as document classification.