Data Sets for Text Analytics
This page provides a list of different data sets that you can use to get started with text analytics applications.
Data Set | Description | Task |
---|---|---|
Factory Reports
| The Factory Reports data set is a table containing approximately 500 reports with various
attributes including a plain text description in the variable Read the Factory Reports data from the file filename = "factoryReports.csv"; data = readtable(filename,'TextType','string'); textData = data.Description; labels = data.Category; For an example showing how to process this data for deep learning, see Classify Text Data Using Deep Learning (Deep Learning Toolbox). |
Text classification, topic modeling |
Shakespeare's Sonnets
| The file Read the Shakespeare's Sonnets data from the file
filename = "sonnets.txt";
textData = extractFileText(filename);
The sonnets are indented by two whitespace characters and are separated by two newline
characters. Remove the indentations using textData = replace(textData," ",""); textData = split(textData,[newline newline]); textData = textData(5:2:end); For an example showing how to process this data for deep learning, see Generate Text Using Deep Learning (Deep Learning Toolbox). |
Topic modeling, text generation |
ArXiv Metadata
| The ArXiv API allows you to access the metadata of scientific e-prints submitted to https://arxiv.org including the abstract and subject areas. For more information, see https://arxiv.org/help/api. Import a set of abstracts and category labels from math papers using the arXiV API. url = "https://export.arxiv.org/oai2?verb=ListRecords" + ... "&set=math" + ... "&metadataPrefix=arXiv"; options = weboptions('Timeout',160); code = webread(url,options); For an example showing how to parse the returned XML code and import more records, see Multilabel Text Classification Using Deep Learning. |
Text classification, topic modeling |
Books from Project Gutenberg
| You can download many books from Project Gutenberg. For example, download the text from Alice's Adventures in Wonderland by Lewis Carroll from https://www.gutenberg.org/files/11/11-h/11-h.htm using the url = "https://www.gutenberg.org/files/11/11-h/11-h.htm";
code = webread(url); The HTML code contains the relevant text inside tree = htmlTree(code);
selector = "p";
subtrees = findElement(tree,selector); Extract the text data from the HTML subtrees using the textData = extractHTMLText(subtrees);
textData(textData == "") = []; For an example showing how to process this data for deep learning, see Word-by-Word Text Generation Using Deep Learning. |
Topic modeling, text generation |
Weekend updates
| The file Extract the text data from the file filename = "weekendUpdates.xlsx"; tbl = readtable(filename,'TextType','string'); textData = tbl.TextData; For an example showing how to process this data, see Analyze Sentiment in Text. |
Sentiment analysis |
Roman Numerals
| The CSV file Load the decimal-Roman numeral pairs from the CSV file filename = fullfile("romanNumerals.csv"); options = detectImportOptions(filename, ... 'TextType','string', ... 'ReadVariableNames',false); options.VariableNames = ["Source" "Target"]; options.VariableTypes = ["string" "string"]; data = readtable(filename,options); For an example showing how to process this data for deep learning, see Sequence-to-Sequence Translation Using Attention. |
Sequence-to-sequence translation |
Finance Reports
|
The Securities and Exchange Commission (SEC) allows you to access financial reports via the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) API. For more information, see https://www.sec.gov/search-filings/edgar-search-assistance/accessing-edgar-data. To download this data, use the function year = 2019; qtr = 4; maxLength = 2e6; textData = financeReports(year,qtr,maxLength); For an example showing how to process this data, see Generate Domain Specific Sentiment Lexicon. |
Sentiment analysis |
Related Topics
- Extract Text Data from Files
- Parse HTML and Extract Text Content
- Prepare Text Data for Analysis
- Analyze Text Data Containing Emojis
- Create Simple Text Model for Classification
- Classify Text Data Using Deep Learning
- Analyze Text Data Using Topic Models
- Analyze Sentiment in Text
- Sequence-to-Sequence Translation Using Attention
- Generate Text Using Deep Learning (Deep Learning Toolbox)