Main Content

本页的翻译已过时。点击此处可查看最新英文版本。

使用深度学习进行逐单词文本生成

此示例说明如何训练深度学习 LSTM 网络来逐单词生成文本。

要训练深度学习网络以逐单词生成文本,请训练“序列到序列”的 LSTM 网络,以预测单词序列中的下一个单词。要训练网络以预测下一个单词,请将响应指定为移位一个时间步的输入序列。

此示例从网站上读取文本。它读取并解析 HTML 代码以提取相关文本,然后使用自定义的小批量数据存储 documentGenerationDatastore 将文档作为小批量序列数据输入网络。数据存储将文档转换为数值单词索引序列。深度学习网络是包含单词嵌入层的 LSTM 网络。

小批量数据存储是支持批量读取数据的数据存储实现。您可以使用小批量数据存储作为深度学习应用程序的训练数据集、验证数据集、测试数据集以及预测数据集的源。使用小批量数据存储可读取无法放入内存的数据,或者在读取批量数据时执行特定的预处理操作。

您可以通过自定义函数来调整自定义小批量数据存储 documentGenerationDatastore.m,使之适合您的数据。有关说明如何创建您自己的自定义小批量数据存储的示例,请参阅Develop Custom Mini-Batch Datastore

加载训练数据

加载训练数据。从 Project Gutenberg 读取 Alice's Adventures in Wonderland by Lewis Carroll 中的 HTML 代码。

url = "https://www.gutenberg.org/files/11/11-h/11-h.htm";
code = webread(url);

解析 HTML 代码

HTML 代码包含 <p>(段落)元素内的相关文本。通过使用 htmlTree 解析 HTML 代码,然后找到元素名为 "p" 的所有元素,来提取相关文本。

tree = htmlTree(code);
selector = "p";
subtrees = findElement(tree,selector);

使用 extractHTMLText 从 HTML 子树中提取文本数据,并查看前 10 段。

textData = extractHTMLText(subtrees);
textData(1:10)
ans = 10×1 string array
    ""
    ""
    ""
    ""
    ""
    ""
    "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’ "
    "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. "
    "There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I shall be late!’ (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge. "
    "In another moment down went Alice after it, never once considering how in the world she was to get out again. "

删除空段落并查看更新后的前 10 个段落。

textData(textData == "") = [];
textData(1:10)
ans = 10×1 string array
    "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’ "
    "So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her. "
    "There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I shall be late!’ (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge. "
    "In another moment down went Alice after it, never once considering how in the world she was to get out again. "
    "The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well. "
    "Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled ‘ORANGE MARMALADE’, but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it. "
    "‘Well!’ thought Alice to herself, ‘after such a fall as this, I shall think nothing of tumbling down stairs! How brave they’ll all think me at home! Why, I wouldn’t say anything about it, even if I fell off the top of the house!’ (Which was very likely true.) "
    "Down, down, down. Would the fall never come to an end! ‘I wonder how many miles I’ve fallen by this time?’ she said aloud. ‘I must be getting somewhere near the centre of the earth. Let me see: that would be four thousand miles down, I think-’ (for, you see, Alice had learnt several things of this sort in her lessons in the schoolroom, and though this was not a very good opportunity for showing off her knowledge, as there was no one to listen to her, still it was good practice to say it over) ‘-yes, that’s about the right distance-but then I wonder what Latitude or Longitude I’ve got to?’ (Alice had no idea what Latitude was, or Longitude either, but thought they were nice grand words to say.) "
    "Presently she began again. ‘I wonder if I shall fall right through the earth! How funny it’ll seem to come out among the people that walk with their heads downward! The Antipathies, I think-’ (she was rather glad there was no one listening, this time, as it didn’t sound at all the right word) ‘-but I shall have to ask them what the name of the country is, you know. Please, Ma’am, is this New Zealand or Australia?’ (and she tried to curtsey as she spoke-fancy curtseying as you’re falling through the air! Do you think you could manage it?) ‘And what an ignorant little girl she’ll think me for asking! No, it’ll never do to ask: perhaps I shall see it written up somewhere.’ "
    "Down, down, down. There was nothing else to do, so Alice soon began talking again. ‘Dinah’ll miss me very much to-night, I should think!’ (Dinah was the cat.) ‘I hope they’ll remember her saucer of milk at tea-time. Dinah my dear! I wish you were down here with me! There are no mice in the air, I’m afraid, but you might catch a bat, and that’s very like a mouse, you know. But do cats eat bats, I wonder?’ And here Alice began to get rather sleepy, and went on saying to herself, in a dreamy sort of way, ‘Do cats eat bats? Do cats eat bats?’ and sometimes, ‘Do bats eat cats?’ for, you see, as she couldn’t answer either question, it didn’t much matter which way she put it. She felt that she was dozing off, and had just begun to dream that she was walking hand in hand with Dinah, and saying to her very earnestly, ‘Now, Dinah, tell me the truth: did you ever eat a bat?’ when suddenly, thump! thump! down she came upon a heap of sticks and dry leaves, and the fall was over. "

用文字云可视化文本数据。

figure
wordcloud(textData);
title("Alice's Adventures in Wonderland")

准备要训练的数据

使用 documentGenerationDatastore 创建包含训练数据的数据存储。要创建数据存储,请先将自定义小批量数据存储 documentGenerationDatastore.m 保存到路径。对于预测变量,此数据存储使用单词编码将文档转换为单词索引序列。每个文档的第一个单词索引对应于“文本开始”标记。“文本开始”标记由字符串 "startOfText" 给出。作为响应,数据存储返回移位了一个单词的分类序列。

使用 tokenizedDocument 对文本数据进行分词。

documents = tokenizedDocument(textData);

使用分词后的文档创建文档生成数据存储。

ds = documentGenerationDatastore(documents);

要减少添加到序列中的填充量,请按序列长度对数据存储中的文档进行排序。

ds = sort(ds);

创建和训练 LSTM 网络

定义 LSTM 网络架构。要将序列数据输入到网络中,请包含一个序列输入层并将输入大小设置为 1。接下来,包含一个维度为 100 且与单词编码具有相同单词数的单词嵌入层。接下来,包含一个 LSTM 层并指定隐藏单元个数为 100。最后,添加一个大小与类数相同的全连接层、一个 softmax 层和一个分类层。类的数量是词汇表中的单词数加上一个针对“文本结束”类的额外类。

inputSize = 1;
embeddingDimension = 100;
numWords = numel(ds.Encoding.Vocabulary);
numClasses = numWords + 1;

layers = [ 
    sequenceInputLayer(inputSize)
    wordEmbeddingLayer(embeddingDimension,numWords)
    lstmLayer(100)
    dropoutLayer(0.2)
    fullyConnectedLayer(numClasses)
    softmaxLayer
    classificationLayer];

指定训练选项。指定求解器为 'adam'。进行 300 轮训练,学习率为 0.01。将小批量大小设置为 32。要保持数据按序列长度排序,请将 'Shuffle' 选项设置为 'never'。要监控训练进度,请将 'Plots' 选项设置为 'training-progress'。要隐藏详细输出,请将 'Verbose' 设置为 false

options = trainingOptions('adam', ...
    'MaxEpochs',300, ...
    'InitialLearnRate',0.01, ...
    'MiniBatchSize',32, ...
    'Shuffle','never', ...
    'Plots','training-progress', ...
    'Verbose',false);

使用 trainNetwork 训练网络。

net = trainNetwork(ds,layers,options);

生成新文本

根据训练数据中文本的所有首个单词的概率分布抽取一个单词来生成文本的第一个单词。接着使用经过训练的 LSTM 网络基于当前已生成的文本序列预测下一时间步,以生成其余单词。继续逐个生成单词,直到网络预测到“文本结尾”单词。

要使用网络进行第一次预测,请输入表示“文本开始”标记的索引。使用 word2ind 函数和文档数据存储所使用的单词编码来查找索引。

enc = ds.Encoding;
wordIndex = word2ind(enc,"startOfText")
wordIndex = 1

在后续的预测中,会根据网络的预测分数来抽取下一个单词。预测分数表示下一个单词的概率分布。从网络输出层的类名给出的词汇表中抽取单词。

vocabulary = string(net.Layers(end).Classes);

使用 predictAndUpdateState 逐单词进行预测。对于每次预测,都输入前一个单词的索引。当网络预测到文本结尾单词或生成的文本长度达到 500 个字符时,停止预测。对于大型数据集合、长序列或大型网络,在 GPU 上进行预测计算通常比在 CPU 上快。其他情况下,在 CPU 上进行预测计算通常更快。对于单时间步预测,请使用 CPU。要使用 CPU 进行预测,请将 predictAndUpdateState'ExecutionEnvironment' 选项设置为 'cpu'

generatedText = "";
maxLength = 500;
while strlength(generatedText) < maxLength
    % Predict the next word scores.
    [net,wordScores] = predictAndUpdateState(net,wordIndex,'ExecutionEnvironment','cpu');
    
    % Sample the next word.
    newWord = datasample(vocabulary,1,'Weights',wordScores);
    
    % Stop predicting at the end of text.
    if newWord == "EndOfText"
        break
    end
    
    % Add the word to the generated text.
    generatedText = generatedText + " " + newWord;
    
    % Find the word index for the next input.
    wordIndex = word2ind(enc,newWord);
end

生成过程在每个预测之间引入空白字符,这意味着一些标点字符前后会出现不必要的空格。通过删除相应标点字符前后的空格来重新构造生成的文本。

删除特定标点字符前的空格。

punctuationCharacters = ["." "," "’" ")" ":" "?" "!"];
generatedText = replace(generatedText," " + punctuationCharacters,punctuationCharacters);

删除特定标点字符后的空格。

punctuationCharacters = ["(" "‘"];
generatedText = replace(generatedText,punctuationCharacters + " ",punctuationCharacters)
generatedText = 
" ‘Sure, it’s a good Turtle!’ said the Queen in a low, weak voice."

要生成多篇文本,请在每次生成完成后使用 resetState 重置网络状态。

net = resetState(net);

另请参阅

(Text Analytics Toolbox) | (Text Analytics Toolbox) | (Text Analytics Toolbox) | | | | | (Text Analytics Toolbox) | (Text Analytics Toolbox) | (Text Analytics Toolbox) | (Text Analytics Toolbox)

相关主题