Analyze German Text Data

This example shows how to import, prepare, and analyze German text data using a topic model.

German text data can be large and can contain lots of noise that negatively affects statistical analysis. For example, the text data can contain the following:

Variations in word forms. For example, „rot“, „rote“, and „roten“.
Words that add noise. For example, stop words such as „der“, „die“, and „das“.
Punctuation and special characters.

These word clouds illustrate word frequency analysis applied to some raw text data and a preprocessed version of the same text data.

This example first shows how to import and prepare German text data, and then it shows how to analyze the text data using a Latent Dirichlet Allocation (LDA) model. An LDA model is a topic model that discovers underlying topics in a collection of documents and infers the word probabilities in topics. Use these steps in preparing the text data and fitting the model:

Import the text data from a CSV file and extract the relevant data.
Prepare the text data for analysis using standard preprocessing techniques.
Fit a topic model and visualize the results.

Import Data

Load the example data "FabrikBerichte.csv". The data contains factory reports, including a text description and categorical labels for each event in German.

Read the data using the readtable function. Set the text type to string. One of the column headers includes an umlaut ("Auflösung"). To keep this name, set VariableNamingRule to "preserve", otherwise MATLAB issues a warning and removes the umlaut from the column header.

filename = "FabrikBerichte.csv";
data = readtable(filename,TextType="string",VariableNamingRule="preserve");
head(data)

                                       Beschreibung                                             Kategorie          Dringlichkeit                Auflösung                 Kosten
    __________________________________________________________________________________    _____________________    _____________    __________________________________    ______

    "Im Inneren des Mixers ist ein lautes Rasseln zu hören."                              "Mechanischer Fehler"      "Niedrig"      "Zur Beobachtungsliste hinzufügen"       72 
    "Ein paar Tropfen Flüssigkeit zeigen sich unter dem Konstruktionsaggregat."           "Leck"                     "Mittel"       "Zur Beobachtungsliste hinzufügen"       63 
    "Einige Materialien, die vom Konstruktionsaggregat erzeugt werden, sind verdreht."    "Mechanischer Fehler"      "Niedrig"      "Maschine neu einstellen"                44 
    "Eine Sicherung in der Controller-Montage ist durchgebrannt."                         "Elektrischer Fehler"      "Niedrig"      "Komponenten ersetzen"                  348 
    "Im Mischer ist eine Sicherung durchgebrannt."                                        "Elektrischer Fehler"      "Niedrig"      "Komponenten ersetzen"                  441 
    "Es ist ein schlagendes Geräusch zu hören im Zylinder Konstruktionsaggregat."         "Mechanischer Fehler"      "Mittel"       "Maschine neu einstellen"                35 
    "Der Scanner produziert seltsame elektrische Geräusche."                              "Elektrischer Fehler"      "Niedrig"      "Zur Beobachtungsliste hinzufügen"       80 
    "Abnormales Überhitzen der Sortieranlage."                                            "Leck"                     "Hoch"         "Vollständiger Ersatz"                 9000

Extract the text data from the variable Beschreibung (the description of the factory event).

textData = data.Beschreibung;

Visualize the text data in a word cloud.

figure
wordcloud(textData);

Figure contains an object of type wordcloud.

Tokenize Text Data

Create an array of tokenized documents using the tokenizedDocument function.

documents = tokenizedDocument(textData);
documents(1:10)

ans = 
  10x1 tokenizedDocument:

    11 tokens: Im Inneren des Mixers ist ein lautes Rasseln zu hören .
    10 tokens: Ein paar Tropfen Flüssigkeit zeigen sich unter dem Konstruktionsaggregat .
    12 tokens: Einige Materialien , die vom Konstruktionsaggregat erzeugt werden , sind verdreht .
     8 tokens: Eine Sicherung in der Controller-Montage ist durchgebrannt .
     7 tokens: Im Mischer ist eine Sicherung durchgebrannt .
    11 tokens: Es ist ein schlagendes Geräusch zu hören im Zylinder Konstruktionsaggregat .
     7 tokens: Der Scanner produziert seltsame elektrische Geräusche .
     5 tokens: Abnormales Überhitzen der Sortieranlage .
     6 tokens: Akustisches elektrisches Geräusch im Arbeitsagenten .
    12 tokens: Nach der Installation der Software ist keine Verbindung zum Netzwerk möglich .

Get Part-of-Speech Tags

Add the part of speech details using the addPartOfSpeechDetails function.

documents = addPartOfSpeechDetails(documents);

Get the token details and then view the details of the first few tokens.

tdetails = tokenDetails(documents);
head(tdetails)

      Token      DocumentNumber    SentenceNumber    LineNumber     Type      Language    PartOfSpeech
    _________    ______________    ______________    __________    _______    ________    ____________

    "In"               1                 1               1         letters       de        adposition 
    "dem"              1                 1               1         letters       de        determiner 
    "Inneren"          1                 1               1         letters       de        noun       
    "des"              1                 1               1         letters       de        determiner 
    "Mixers"           1                 1               1         letters       de        noun       
    "ist"              1                 1               1         letters       de        verb       
    "ein"              1                 1               1         letters       de        determiner 
    "lautes"           1                 1               1         letters       de        adjective

The PartOfSpeech variable in the table contains the part-of-speech tags of the tokens. Create word clouds of all the nouns and adjectives, respectively.

figure
idx = tdetails.PartOfSpeech == "noun";
tokens = tdetails.Token(idx);
subplot(1,2,1)
wordcloud(tokens);
title("Nouns")

idx = tdetails.PartOfSpeech == "adjective";
tokens = tdetails.Token(idx);
subplot(1,2,2)
wordcloud(tokens);
title("Adjectives")

Figure contains objects of type wordcloud. The chart of type wordcloud has title Nouns. The chart of type wordcloud has title Adjectives.

Prepare Text Data for Analysis

Tokenize the text using tokenizedDocument and view the first few documents.

documentsRaw = tokenizedDocument(textData);
documents = documentsRaw;
documents(1:10)

ans = 
  10x1 tokenizedDocument:

    11 tokens: Im Inneren des Mixers ist ein lautes Rasseln zu hören .
    10 tokens: Ein paar Tropfen Flüssigkeit zeigen sich unter dem Konstruktionsaggregat .
    12 tokens: Einige Materialien , die vom Konstruktionsaggregat erzeugt werden , sind verdreht .
     8 tokens: Eine Sicherung in der Controller-Montage ist durchgebrannt .
     7 tokens: Im Mischer ist eine Sicherung durchgebrannt .
    11 tokens: Es ist ein schlagendes Geräusch zu hören im Zylinder Konstruktionsaggregat .
     7 tokens: Der Scanner produziert seltsame elektrische Geräusche .
     5 tokens: Abnormales Überhitzen der Sortieranlage .
     6 tokens: Akustisches elektrisches Geräusch im Arbeitsagenten .
    12 tokens: Nach der Installation der Software ist keine Verbindung zum Netzwerk möglich .

Remove the stop words.

documents = removeStopWords(documents);
documents(1:10)

ans = 
  10x1 tokenizedDocument:

    6 tokens: Inneren Mixers lautes Rasseln hören .
    6 tokens: paar Tropfen Flüssigkeit zeigen Konstruktionsaggregat .
    8 tokens: Einige Materialien , Konstruktionsaggregat erzeugt , verdreht .
    4 tokens: Sicherung Controller-Montage durchgebrannt .
    4 tokens: Mischer Sicherung durchgebrannt .
    6 tokens: schlagendes Geräusch hören Zylinder Konstruktionsaggregat .
    6 tokens: Scanner produziert seltsame elektrische Geräusche .
    4 tokens: Abnormales Überhitzen Sortieranlage .
    5 tokens: Akustisches elektrisches Geräusch Arbeitsagenten .
    6 tokens: Installation Software Verbindung Netzwerk möglich .

Normalize the text using the normalizeWords function.

documents = normalizeWords(documents);
documents(1:10)

ans = 
  10x1 tokenizedDocument:

    6 tokens: inn mix laut rasseln hor .
    6 tokens: paar tropf flussig zeig konstruktionsaggregat .
    8 tokens: einig materiali , konstruktionsaggregat erzeugt , verdreht .
    4 tokens: sicher controller-montag durchgebrannt .
    4 tokens: misch sicher durchgebrannt .
    6 tokens: schlagend gerausch hor zylind konstruktionsaggregat .
    6 tokens: scann produziert seltsam elektr gerausch .
    4 tokens: abnormal uberhitz sortieranlag .
    5 tokens: akust elektr gerausch arbeitsagent .
    6 tokens: installation softwar verbind netzwerk moglich .

Erase the punctuation using the erasePunctuation function.

documents = erasePunctuation(documents);
documents(1:10)

ans = 
  10x1 tokenizedDocument:

    5 tokens: inn mix laut rasseln hor
    5 tokens: paar tropf flussig zeig konstruktionsaggregat
    5 tokens: einig materiali konstruktionsaggregat erzeugt verdreht
    3 tokens: sicher controllermontag durchgebrannt
    3 tokens: misch sicher durchgebrannt
    5 tokens: schlagend gerausch hor zylind konstruktionsaggregat
    5 tokens: scann produziert seltsam elektr gerausch
    3 tokens: abnormal uberhitz sortieranlag
    4 tokens: akust elektr gerausch arbeitsagent
    5 tokens: installation softwar verbind netzwerk moglich

Visualize the raw and cleaned data in word clouds.

figure
subplot(1,2,1)
wordcloud(documentsRaw);
title("Raw Data")

subplot(1,2,2)
wordcloud(documents);
title("Cleaned Data")

Figure contains objects of type wordcloud. The chart of type wordcloud has title Raw Data. The chart of type wordcloud has title Cleaned Data.

Create Preprocessing Function

Creating a function that performs preprocessing can be useful to prepare different collections of text data in the same way. For example, you can use a function to preprocess new data using the same steps as the training data.

Create a function which tokenizes and preprocesses the text data to use for analysis. The function preprocessGermanText, listed at the end of the example, performs these steps:

Tokenize the text using tokenizedDocument.
Remove a list of stop words (such as „der“, „die“, and „das“) using removeStopWords.
Normalize the words using normalizeWords.
Erase punctuation using erasePunctuation.

Remove the empty documents after preprocessing using the removeEmptyDocuments function. If you have associated data stored in a separate array, such as timestamps or labels, then also remove the corresponding elements. To get indices of the elements to remove, also return the indices of the removed documents.

In this example, use the preprocessing function preprocessGermanText, listed at the end of the example, to prepare the text data.

documents = preprocessGermanText(textData);
documents(1:5)

ans = 
  5x1 tokenizedDocument:

    5 tokens: inn mix laut rasseln hor
    5 tokens: paar tropf flussig zeig konstruktionsaggregat
    5 tokens: einig materiali konstruktionsaggregat erzeugt verdreht
    3 tokens: sicher controllermontag durchgebrannt
    3 tokens: misch sicher durchgebrannt

Remove the empty documents using the removeEmptyDocuments function.

documents = removeEmptyDocuments(documents);

Fit Topic Model

Fit a latent Dirichlet allocation (LDA) topic model to the data. An LDA model discovers underlying topics in a collection of documents and infers word probabilities in topics.

To fit an LDA model to the data, you first must create a bag-of-words model. A bag-of-words model (also known as a term-frequency counter) records the number of times that words appear in each document of a collection. Create a bag-of-words model using bagOfWords.

bag = bagOfWords(documents);

Remove the empty documents from the bag-of-words model.

bag = removeEmptyDocuments(bag);

Fit an LDA model with seven topics using fitlda. To suppress the verbose output, set 'Verbose' to 0.

numTopics = 7;
mdl = fitlda(bag,numTopics,Verbose=0);

Visualize the first four topics using word clouds.

figure
for i = 1:4
    subplot(2,2,i)
    wordcloud(mdl,i);
    title("Topic " + i)
end

Figure contains objects of type wordcloud. The chart of type wordcloud has title Topic 1. The chart of type wordcloud has title Topic 2. The chart of type wordcloud has title Topic 3. The chart of type wordcloud has title Topic 4.

Visualize multiple topic mixtures using stacked bar charts. View five input documents at random and visualize the corresponding topic mixtures.

numDocuments = numel(documents);
idx = randperm(numDocuments,5);
documents(idx)

ans = 
  5x1 tokenizedDocument:

    4 tokens: seltsam rasseln intern misch
    4 tokens: ausstoss konstruktionsaggregat sammelt flussig
    3 tokens: abnormal uberhitz sortieranlag
    4 tokens: roboterarm gibt schwarz rauch
    3 tokens: misch schaltet manchmal

topicMixtures = transform(mdl,documents(idx));

figure
barh(topicMixtures(1:5,:),"stacked")
xlim([0 1])
title("Topic Mixtures")
xlabel("Topic Probability")
ylabel("Document")
legend("Topic " + string(1:numTopics),Location="northeastoutside")

Figure contains an axes object. The axes object with title Topic Mixtures, xlabel Topic Probability, ylabel Document contains 7 objects of type bar. These objects represent Topic 1, Topic 2, Topic 3, Topic 4, Topic 5, Topic 6, Topic 7.

Example Preprocessing Function

The function preprocessGermanText performs these steps:

Tokenize the text using tokenizedDocument.
Remove a list of stop words (such as „der“, „die“, and „das“) using removeStopWords.
Normalize the words using normalizeWords.
Erase punctuation using erasePunctuation.

function documents = preprocessGermanText(textData)

% Tokenize the text.
documents = tokenizedDocument(textData);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Normalize the words.
documents = normalizeWords(documents);

% Erase the punctuation.
documents = erasePunctuation(documents);

end