Vectorized Levenshtein distances between arrays of text labels?

Question

FM 2024-6-10

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2127236-vectorized-levenshtein-distances-between-arrays-of-text-labels

评论： FM 2024-6-20

I have to compare "N" ID labels (several thousand) to each other in order to determine which are mistypings of each other. The labels have up to 20 characters. Preliminarily, I am considering the calculation of the N(N-1)/2 Levenshtein distances between them and using clustering to determine which labels correspond to the same ID. It is being done in Python, but none of the Levenshtein distance implementations are vectorized. That is, the NxN array of distances have to be iterated through on an element-by-element basis in order to calculate their values.

I thought that there might be a vectorized Matlab version of Levenshtein distance, which I could package for deployment and invocation from Python. I found the a few shown in the Annex below, as well as an "editDistance" function available in R2023b. None of these vectorize the calculation of N(N-2)/2 distances. I'm surprised that a vectorized implementation doesn't exist. Am I missing something obvious?

Annex: Matlab implementations of Levenshtein distance

2 个评论
显示无隐藏无

Stephen23 2024-6-11

"Am I missing something obvious?"

The lack of an easily vectorizable algorithm.

FM 2024-6-11

OK, so that confirms that I haven't overlooked anything in my search. Thanks!

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Nipun 2024-6-13

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2127236-vectorized-levenshtein-distances-between-arrays-of-text-labels#answer_1471346

在 MATLAB Online 中打开

Hi FM,

I understand that you want to compute the Levenshtein distances between several thousand ID labels in a vectorized manner using MATLAB and then interface it with Python.

Here is a MATLAB code that utilizes parallel processing to compute the Levenshtein distances more efficiently. This approach is not fully vectorized due to the nature of the algorithm, but it leverages MATLAB's parallel computing capabilities to speed up the computation.

function distanceMatrix = computeLevenshteinDistances(labels)
    % Number of labels
    N = numel(labels);
    
    % Initialize the distance matrix
    distanceMatrix = zeros(N, N);
    
    % Parallel loop to compute Levenshtein distances
    parfor i = 1:N
        for j = i+1:N
            distanceMatrix(i, j) = levenshtein(labels{i}, labels{j});
            distanceMatrix(j, i) = distanceMatrix(i, j); % Symmetric matrix
        end
    end
end
function d = levenshtein(s1, s2)
    % Calculate the Levenshtein distance between two strings
    m = length(s1);
    n = length(s2);
    
    % Initialize the distance matrix
    D = zeros(m+1, n+1);
    for i = 1:m+1
        D(i, 1) = i-1;
    end
    for j = 1:n+1
        D(1, j) = j-1;
    end
    
    % Compute the distances
    for i = 2:m+1
        for j = 2:n+1
            cost = (s1(i-1) ~= s2(j-1));
            D(i, j) = min([D(i-1, j) + 1, ...     % Deletion
                           D(i, j-1) + 1, ...     % Insertion
                           D(i-1, j-1) + cost]);  % Substitution
        end
    end
    
    d = D(m+1, n+1);
end

To integrate this MATLAB function with Python, you can use the MATLAB Engine API for Python. Here's an example of how to call the MATLAB function from Python:

import matlab.engine
import numpy as np
# Start MATLAB engine
eng = matlab.engine.start_matlab()
# Example labels
labels = ['label1', 'label2', 'label3', 'labelN']
# Convert Python list to MATLAB cell array
matlab_labels = matlab.cell.array(labels)
# Call the MATLAB function
distance_matrix = eng.computeLevenshteinDistances(matlab_labels)
# Convert the result to a numpy array
distance_matrix = np.array(distance_matrix)
# Stop MATLAB engine
eng.quit()
# Print the distance matrix
print(distance_matrix)

To summarize:

The MATLAB function computeLevenshteinDistances computes the Levenshtein distances between all pairs of labels using parallel processing.
The levenshtein function calculates the distance between two strings.
The Python script uses the MATLAB Engine API to call the MATLAB function and retrieve the distance matrix.

By using MATLAB's parallel computing capabilities, you can significantly speed up the computation of the Levenshtein distances for a large number of labels. I recommend using "parfor" instead of "for" loops to leverage parallel computation in MATLAB.

Refer to the following MathWorks documentation for more information on "parfor" in MATLAB: https://www.mathworks.com/help/parallel-computing/parfor.html

Hope this helps.

Regards,

Nipun

3 个评论
显示 1更早的评论隐藏 1更早的评论

FM 2024-6-13

编辑：FM 2024-6-14

Thanks, Nipun. This is my iniital exposure to the parallel computing toolbox. I think it could be useful in the future. For this particular case, it might not be the right way to go.

I was initially hoping that Matlab's JiT compilation would allow it to bust through a bottleneck of 1.5 hours in my Python implementation. My tests of an unparallelized Matlab implementaton show that it would take 70 hours ( https://www.mathworks.com/matlabcentral/answers/2127991 ). For a typical machine with (spitballing) 5 cores, and assuming 2 threads/core, we can estimate an upper bound to the parallelization speedup of 2x5=10 times, It is not enough to bust through my current bottleneck.

I use the above ballparking calculations because I can't use the "conventional" way of estimating the time it would take. The "conventional" way is just to use a variable to count the number of Levenshtein distance calculations, then print this out alongside the time stamp. Using values from the first few minutes of execution, I can project out to 70 hours for the 450+ million calculations required. Unfortunately, however, the serial counting of calculations doesn't make sense when the outer loop is parallelized. A much more complex means is needed to estimate the total wall clock time, but it doesn't seem justified because the ballparking calculation above is sufficiently telling.

I am test-driving your levenshtein() function to see if it is faster than editDistance and will report back in this comment. From the notation, it seems to operate on cell arrays of char instead of strings per se.

Update: Replacing editDistance() with your levenshtein() reduces the estimated runtime to 2.2 hours for an unparallelized implementation. Parallelizing might knock that down by up to an order of magnitude on a non-specialized machine. This definitely puts the solution in the realm of worthwhile considering. There are challenges with progress monitoring. There are shared utilities for parfor progress monitoring, which could take some time to figure out. If they simply show the percentage of outer loops completed, it would be nonideal because the inner loops iterate a very variable number of times. An added unknown is that I'm not yet sure how compatible the use of such 3rd party software components is with our computing restrictions. If it was just Matlab source code, that would be a non-issue.

I still need to test-drive the use of the MATLAB Engine API for Python...for now, I will let the unparallelized implementation finish and confirm that the Levenshtein distances match those from Python (afternote: The results match) and actually test drive the clustering alogorthm being fed by the distance calculations before trying to find faster ways to generate them.

Paul 2024-6-13

editDistance uses both recursion and arrayfun. It basically calls itself in a loop and on each call is validating the input strings, which really only needs to be done once, not to mention the overhead of arrayfun (which I think has been discussed elseswhere on this forum), and all of the other checks it runs each time it's called. Maybe all of that overhead contributes to its slowdown by a factor of 70/2.2.

FM 2024-6-14

Yes, it does seem that editDistance is for interactive one-offs rather than large data sets en masse. I'm new enough to this area that I'm not sure what the use case for that is, at least in a programmatic context.

请先登录，再进行评论。

Answer 2

Christopher Creutzig 2024-6-14

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2127236-vectorized-levenshtein-distances-between-arrays-of-text-labels#answer_1471991

For this application, I recommend using editDistanceSearcher: You probably don't want to look at clusters larger than some size anyway, and knnsearch on editDistanceSearcher will give you the neighbors of interest. It performs precomputation, trading memory for being faster in amortized time, as long as you call it often enough on the same dataset, which it sounds is exactly what you want to do.

7 个评论
显示 5更早的评论隐藏 5更早的评论

Christopher Creutzig 2024-6-19

There are many ways to define clusters. What exactly is the one you are looking for?

If your clusters are defined by “groups of words separated by edit distance,” i.e., you regard all those as a cluster where you have stepping stones to get form A to B without making any steps with a Levenshtein distance of, say, 4 or more, then knowing all the neighbors of your words is all you need.

That is not data you would put into kmeans or kmedoids, of course. Such neighboring information transforms the clustering into a graph theoretical problem. You'd set up an undirected graph with the neighborhood information you have and ask for connected components (or block-cut components to avoid spurious connections from outliers).

FM 2024-6-20

This is my first foray into clustering, so it may be naive. My plan, however, was to get a 2D matrix of all-pairs distances and feed that into HDBSCAN. I've only read about HDBSCAN, so as I try this, I will become more familiar with it. Of course, the O(N^2) calculation of all-pairs distances may make this infeasible, depending on the number of text labels and whether the distance calculation is vectorized or parallelized.

请先登录，再进行评论。

Answer 3

FM 2024-6-17

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2127236-vectorized-levenshtein-distances-between-arrays-of-text-labels#answer_1472986

编辑：FM 2024-6-18

@Stephen23 provided the technically correct answer, though @Nipun's provided an extremely helpful alternative to explore though I haven't yet decided to adopt this alternative. I feel that both responses are useful answers, but only one can be accepted. Is there any MATLAB forum guidance on what should be formally designated as the answer, i.e., whether it is decided by which tecnically answers the question (which is still valuable) vs. which contributes possibly alternative solutions?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Vectorized Levenshtein distances between arrays of text labels?

2 个评论
显示无隐藏无

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（2 个）

7 个评论
显示 5更早的评论隐藏 5更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Vectorized Levenshtein distances between arrays of text labels?

2 个评论 显示 无隐藏 无

采纳的回答

3 个评论 显示 1更早的评论隐藏 1更早的评论

更多回答（2 个）

7 个评论 显示 5更早的评论隐藏 5更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

2 个评论
显示无隐藏无

3 个评论
显示 1更早的评论隐藏 1更早的评论

7 个评论
显示 5更早的评论隐藏 5更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论