How do I reduce categorical array containing alternate spellings and formats?

5 次查看(过去 30 天)
I bet this problem comes up all the time:
I have a categorical array of length ~ 10,000 and typical string length ~ 10 to 20. Accounting for alternative spellings and formatting (etc.), I believe it actually contains only ~ 100 unique categories (a reduction on the order of ~100). What is the best way to achieve this reduction?
I am considering using the file exchange function EditDist with a distance of around 1 to 5 but I am open to other ideas. One issue I am considering is how to programmatically pick a "version" of each categorical with the least distance to all its other versions (the "average" version of each categorical).
What do you all recommend? Can I get it done in a single for loop or are nested forloops required? Is there a pre-existing solution? Thanks!
  5 个评论
Chris Hooper
Chris Hooper 2024-9-18
@Walter Roberson it's a good point about edit distance. Wonder if there is a better alternative, or if keeping the edit distance small is good enough in my case.
I'll have a look at the toolbox you suggested, thanks!
Walter Roberson
Walter Roberson 2024-9-18
The edit distance between pork and fork is 1. There is no smaller edit distance available.
It is true that you could measure the distance shifted in the alphabet... but that would tell you that fork is pretty similar to cork. You need to decide whether it is reasonable to replace
Eat your pork with a fork.
with
Eat your cork with a cork.

请先登录,再进行评论。

回答(2 个)

Peter Perkins
Peter Perkins 2024-9-18
Chris, something puzzles me. You say,
"I have a categorical array of length ~ 10,000 and typical string length ~ 10 to 20. Accounting for alternative spellings and formatting (etc.), I believe it actually contains only ~ 100 unique categories (a reduction on the order of ~100)."
100*100 is 10000, so are you saying you have a categorical array whose length is 10K and that has 10K categories? That can't be right. Maybe you have a string array whose length is 10K and that will have 100 categories after cleaning it up?
In any case, I can't help much with identifying which misspelled/whatever categories are equivalent to each other, but cleaning up this sort of thing is what renamecats, mergecats, etc. are for. You probably already know that.
  2 个评论
Chris Hooper
Chris Hooper 2024-9-18
编辑:Chris Hooper 2024-9-18
Hey Peter, thanks for the question hopefully I can clarify:
In the example I have an array of strings longer than 10,000 that reduces ~10,000 unique strings (or categoricals) but I believe it can be reduced to ~100 categories.
Some suggestions I've recieved are to consider text normalization (I am doing this already to an extent by changing to lower case without punctuation), fuzzy search, and “similarity search” (described to me as relating to natural language processing – beyond what I was hoping to tackle unless it’s a packaged matlab solution).
My own naive notion was to measure edit distance between each of the n=10,000 categories to yield a n x n score matrix, then look for clusters of categories sharing low mutual edit distance. @ Walter Roberson pointed out edit distance is a potentially weak approach. I was hoping a better solution might already exist in some neatly packaged form.

请先登录,再进行评论。


Rahul
Rahul 2024-9-30
Hi,
I believe that you’re trying to cluster a categorical array, based on the EditDist” function metric as distance between two categories, and further trying to get the cluster centroid for each of the clusters, which are at minimum average distance to the other strings in the same cluster.
The problem seems to resemble a k-means clustering problem, function "clusterdata" can be extended to be used for categorical elements as strings, where the strings need to be encoded to numeric values before being passed as observation vectors.
Data Generation: can be done by initially having 100 unique categorical strings of length (say 15), and then performing random 1-5 ‘editDist’ operations 100 times to create variations in the respective cluster, which can be achieved as:
% Parameters
num_unique_categories = 100;
num_variations_per_category = 100;
total_size = num_unique_categories * num_variations_per_category;
string_length = 15; % Length of the base strings
% Function to apply random edit operations (substitution, insertion, deletion)
apply_random_edit = @(str) edit_string_randomly(str);
% Function to randomly edit a string (simulate substitution, insertion, and deletion)
function new_str = edit_string_randomly(str)
operations = {'substitute', 'insert', 'delete'}; % Possible edit operations
num_operations = randi([1, 3]); % Random number of operations to apply (1-3)
for i = 1:num_operations
operation = operations{randi([1, 3])}; % Randomly pick an operation
switch operation
case 'substitute'
% Substitution: replace a random character
pos = randi([1, length(str)]); % Random position
str(pos) = char(randi([97, 122])); % Random lowercase letter
case 'insert'
% Insertion: insert a random character at a random position
pos = randi([0, length(str)]); % Random position (0 means insert at the beginning)
new_char = char(randi([97, 122])); % Random lowercase letter
str = [str(1:pos), new_char, str(pos+1:end)];
case 'delete'
% Deletion: remove a random character
if length(str) > 1 % Ensure the string has more than one character
pos = randi([1, length(str)]); % Random position
str(pos) = []; % Remove character at the chosen position
end
end
end
new_str = str;
end
% Create base categories
base_categories = cell(1, num_unique_categories);
for i = 1:num_unique_categories
base_categories{i} = generate_random_string(string_length); % Generate 5-character base strings
end
% Generate the full categorical array with variations
categorical_array = cell(1, total_size);
idx = 1;
for i = 1:num_unique_categories
base_str = base_categories{i};
for j = 1:num_variations_per_category
num_modifications = randi([0, 5]); % Modifications in the range [0, 5]
modified_str = base_str;
for k = 1:num_modifications
% Apply random edit (substitute, insert, delete)
modified_str = apply_random_edit(modified_str);
end
categorical_array{idx} = modified_str;
idx = idx + 1;
end
end
Encoding & Clustering: Convert each string into a numerical form, such as using word embeddings (like, word2vec), or creating a bag-of-words. This representation gives a spare symmetrical distance matrix with distance between each element, to be clustered using the “kmeans” or “kmedoids” function:
% Number of categories
numCategories = length(categoriesStr);
% Initialize a distance matrix
distanceMatrix = zeros(numCategories);
% Calculate pairwise edit distances
for i = 1:numCategories
for j = i+1:numCategories
distanceMatrix(i, j) = editDistance(categoriesStr{i}, categoriesStr{j});
distanceMatrix(j, i) = distanceMatrix(i, j); % Symmetric matrix
end
end
% Threshold for unique categoricals
threshold = 5;
% Cluster categories based on the threshold
clusters = clusterdata(distanceMatrix, 'Criterion', 'distance', 'Cutoff', threshold)
A threshold criterion of 5, determines the max distance within a cluster along with the cluster centroid indices.
You can also use “kmeans” or “kmedoids” functions to group categoricals, while providing a custom distance parameter to the function for measuring distance between two elements:
[I,c] = kmeans(categoriesStr, 'distance', @distFun);
function d = distFun(s1, s2)
s = size(s2);
d = zeros(s(1), 1);
for i=1:size(s2, 1)
d(i, 1) = editDistance(s1, s2(i));
end
end
For more information regarding usage of functions and parameters mentioned above, refer to the documentations below:
  1. Create Custom Distance Metric using “kmedoids”: https://in.mathworks.com/help/stats/kmedoids.html#buioqlz-X:~:text=Custom%20distance%20function%20handle.%20A%20distance%20function%20has%20the%20form
  2. Create Clusters using “clusterData”: https://in.mathworks.com/help/stats/clusterdata.html

类别

Help CenterFile Exchange 中查找有关 Characters and Strings 的更多信息

产品


版本

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by