How to improve K-means clustering with TF-IDF?
12 次查看(过去 30 天)
显示 更早的评论
Hi all,
I’m currently working on a project where I need to classify company segments based on their activity descriptions.
I’ve implemented K-means clustering using TF-IDF for feature extraction from text data. However, the current clustering results aren’t entirely accurate, especially when it comes to grouping semantically similar segments (e.g., "cars" and "vehicles" are placed into separate clusters). Is this possible to optmise it, or use another approche rather than TF-IDF.
See cluster 13. More than 50% of the items were assigned to this cluster. I also tried using other distance parameters, but the results didn't improve.
Here is my code:
clear
close
% load and preprocess
d = readtable("segmentos95Translated.xlsx");
t = d.TRANSLATED;
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'EXCEPT');
t{i} = strtrim(splitStr{1});
end
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'WITHOUT PREDOMINANCE');
t{i} = strtrim(splitStr{1});
end
% tokenization
t = lower(t);
t = tokenizedDocument(t);
t = removeStopWords(t);
t = normalizeWords(t);
customStopWords = ["manufactur","activ",",","rental","(",")","*","exempt"...
"commerci","repres","agent","trade","product","retail","sale","waiv","special","wholesal"];
t = removeWords(t,customStopWords);
% bag of words and TF-IDF
bag = bagOfWords(t);
tfidfMatrix = tfidf(bag);
X = full(tfidfMatrix);
% kmeans
rng(1)
numClusters = 25; % about 10%
[idx, C, sumd, D] = kmeans(X, numClusters);
d.clusters = idx;
% display results
for i = 1:numClusters
fprintf('Cluster %d:\n', i);
disp(d.TRANSLATED(idx == i));
end
sortrows(groupcounts(d,"clusters"),"Percent","descend")
0 个评论
采纳的回答
Sandeep Mishra
2024-10-8
Hi Geovane,
I can observe that you are trying to enhance the accuracy of your K-means clustering implementation.
The current implementation using 'TF-IDF' fails to capture the semantic meanings between words, which can lead to unrelated synonyms or related terms being treated as distinct.
To resolve this, you can use word embeddings such as 'fastText' which represent words in a continuous vector space, capturing semantic meanings.
You can leverage the 'Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding' add-on in MATLAB to implement 'fastText' word embedding.
Consider the following implementation:
% Converting tokenized documents to cell array
textData = arrayfun(@(doc) joinWords(doc), t, 'UniformOutput', false);
% Loading fastText word embedding
emb = fastTextWordEmbedding;
% Converting text to embedding
X = zeros(numel(textData), emb.Dimension);
for i = 1:numel(textData)
words = split(textData{i});
validWords = words(isVocabularyWord(emb, words));
if ~isempty(validWords)
vecs = word2vec(emb, validWords);
X(i, :) = mean(vecs, 1);
end
end
[idx, C] = kmeans(X, numClusters);
Refer to the following MathWorks Documentation to learn more about ‘Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding’ function in MATLAB: https://www.mathworks.com/matlabcentral/fileexchange/66229-text-analytics-toolbox-model-for-fasttext-english-16-billion-token-word-embedding
I hope this helps.
4 个评论
Christopher Creutzig
2024-10-22,5:57
Also worth checking out are documentEmbedding and, for a different workflow with “soft clustering,” fitlda.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Language Support 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!