Vectorized Levenshtein distances between arrays of text labels?

6 次查看(过去 30 天)
I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering to determine which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. That is, the NxN array of distances have to be iterated through on an element-by-element basis in order to calculate their values.
I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I'm surprised that a vectorized implementation doesn't exist. Am I missing something obvious?
Annex: Matlab implementations of Levenshtein distance
  2 个评论
Stephen23
Stephen23 2024-6-11
"Am I missing something obvious?"
The lack of an easily vectorizable algorithm.
FM
FM 2024-6-11
OK, so that confirms that I haven't overlooked anything in my search. Thanks!

请先登录,再进行评论。

采纳的回答

Nipun
Nipun 2024-6-13
Hi FM,
I understand that you want to compute the Levenshtein distances between several thousand ID labels in a vectorized manner using MATLAB and then interface it with Python.
Here is a MATLAB code that utilizes parallel processing to compute the Levenshtein distances more efficiently. This approach is not fully vectorized due to the nature of the algorithm, but it leverages MATLAB's parallel computing capabilities to speed up the computation.
function distanceMatrix = computeLevenshteinDistances(labels)
% Number of labels
N = numel(labels);
% Initialize the distance matrix
distanceMatrix = zeros(N, N);
% Parallel loop to compute Levenshtein distances
parfor i = 1:N
for j = i+1:N
distanceMatrix(i, j) = levenshtein(labels{i}, labels{j});
distanceMatrix(j, i) = distanceMatrix(i, j); % Symmetric matrix
end
end
end
function d = levenshtein(s1, s2)
% Calculate the Levenshtein distance between two strings
m = length(s1);
n = length(s2);
% Initialize the distance matrix
D = zeros(m+1, n+1);
for i = 1:m+1
D(i, 1) = i-1;
end
for j = 1:n+1
D(1, j) = j-1;
end
% Compute the distances
for i = 2:m+1
for j = 2:n+1
cost = (s1(i-1) ~= s2(j-1));
D(i, j) = min([D(i-1, j) + 1, ... % Deletion
D(i, j-1) + 1, ... % Insertion
D(i-1, j-1) + cost]); % Substitution
end
end
d = D(m+1, n+1);
end
To integrate this MATLAB function with Python, you can use the MATLAB Engine API for Python. Here's an example of how to call the MATLAB function from Python:
import matlab.engine
import numpy as np
# Start MATLAB engine
eng = matlab.engine.start_matlab()
# Example labels
labels = ['label1', 'label2', 'label3', 'labelN']
# Convert Python list to MATLAB cell array
matlab_labels = matlab.cell.array(labels)
# Call the MATLAB function
distance_matrix = eng.computeLevenshteinDistances(matlab_labels)
# Convert the result to a numpy array
distance_matrix = np.array(distance_matrix)
# Stop MATLAB engine
eng.quit()
# Print the distance matrix
print(distance_matrix)
To summarize:
  1. The MATLAB function computeLevenshteinDistances computes the Levenshtein distances between all pairs of labels using parallel processing.
  2. The levenshtein function calculates the distance between two strings.
  3. The Python script uses the MATLAB Engine API to call the MATLAB function and retrieve the distance matrix.
By using MATLAB's parallel computing capabilities, you can significantly speed up the computation of the Levenshtein distances for a large number of labels. I recommend using "parfor" instead of "for" loops to leverage parallel computation in MATLAB.
Refer to the following MathWorks documentation for more information on "parfor" in MATLAB: https://www.mathworks.com/help/parallel-computing/parfor.html
Hope this helps.
Regards,
Nipun
  3 个评论
Paul
Paul 2024-6-13
editDistance uses both recursion and arrayfun. It basically calls itself in a loop and on each call is validating the input strings, which really only needs to be done once, not to mention the overhead of arrayfun (which I think has been discussed elseswhere on this forum), and all of the other checks it runs each time it's called. Maybe all of that overhead contributes to its slowdown by a factor of 70/2.2.
FM
FM 2024-6-14
Yes, it does seem that editDistance is for interactive one-offs rather than large data sets en masse. I'm new enough to this area that I'm not sure what the use case for that is, at least in a programmatic context.

请先登录,再进行评论。

更多回答(2 个)

Christopher Creutzig
For this application, I recommend using editDistanceSearcher: You probably don't want to look at clusters larger than some size anyway, and knnsearch on editDistanceSearcher will give you the neighbors of interest. It performs precomputation, trading memory for being faster in amortized time, as long as you call it often enough on the same dataset, which it sounds is exactly what you want to do.
  7 个评论
Christopher Creutzig
There are many ways to define clusters. What exactly is the one you are looking for?
If your clusters are defined by “groups of words separated by edit distance,” i.e., you regard all those as a cluster where you have stepping stones to get form A to B without making any steps with a Levenshtein distance of, say, 4 or more, then knowing all the neighbors of your words is all you need.
That is not data you would put into kmeans or kmedoids, of course. Such neighboring information transforms the clustering into a graph theoretical problem. You'd set up an undirected graph with the neighborhood information you have and ask for connected components (or block-cut components to avoid spurious connections from outliers).
FM
FM 2024-6-20
This is my first foray into clustering, so it may be naive. My plan, however, was to get a 2D matrix of all-pairs distances and feed that into HDBSCAN. I've only read about HDBSCAN, so as I try this, I will become more familiar with it. Of course, the O(N^2) calculation of all-pairs distances may make this infeasible, depending on the number of text labels and whether the distance calculation is vectorized or parallelized.

请先登录,再进行评论。


FM
FM 2024-6-17
编辑:FM 2024-6-18
@Stephen23 provided the technically correct answer, though @Nipun's provided an extremely helpful alternative to explore though I haven't yet decided to adopt this alternative. I feel that both responses are useful answers, but only one can be accepted. Is there any MATLAB forum guidance on what should be formally designated as the answer, i.e., whether it is decided by which tecnically answers the question (which is still valuable) vs. which contributes possibly alternative solutions?

产品


版本

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by