How to delete similar data from big data file?

Question

M.Shaarawy 2020-1-9

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/499680-how-to-delete-similar-data-from-big-data-file

评论： Sindar 2020-1-13

I have a big data file consists of 100000 rows and 5 columns ..and within this file there rows that are similar to each other,so I need to delete the similar ones and let one of them and delete the other similar ones ( fir example, if I have 5 rows have the same data ..I need a line code to choose the maximum value of the five rows and delete the other 4 rows and if all similar ...delete 4 and makes one only present.

6 个评论
显示 4更早的评论隐藏 4更早的评论

per isakson 2020-1-13

在 MATLAB Online 中打开

You received two good answers. However, one question: does the order of the rows in the result matter?

Your sample data is, in more readable form

%%
data = [     
        1970 30   30   4 
        1971 30.5 30.5 4.2 
        1970 31   32   4.2 
        1972 32   33   5 
        1972 32   33   5 
        1973 32.1 32.2 5 
        1974 34   35   4.5
        ];

Sindar 2020-1-13

Ah, yes, I forgot to mention that both my answers will sort by the first three columns (year, then x, then y).

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Sindar 2020-1-12

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/499680-how-to-delete-similar-data-from-big-data-file#answer_409803

在 MATLAB Online 中打开

Alright, here's my solution. Its speed depends on how often you expect duplicates (controlled in test by test_vals). I ran a test with very few duplicates (test_vals = 200, only 600/1e5 duplicates) and it took 8s. More duplicates runs faster. There are probably ways to speed up the idx search, but it seems fast enough.

% setting up some test data, skip for your real data
test_rows = 1e5;
test_vals = 5;
% make a test matrix with random integers 1-test_vals in first three columns, random number in last
data = [randi(test_vals,test_rows,3) rand(test_rows,1)];
% find duplicate data in the first three columns
% ia lists the row # in data of the first row in each duplicate
% ic lists the duplicate-group of each row in data
[~,ia,ic] = unique(data(:,1:3),'rows');
% loop over the duplicate-groups
for ind=1:length(ia)
    % skip ahead if no duplicates (comment out if you expect few unique rows)
    if nnz(ic==ind)==1
        idx(ind)=ia(ind);
    else
        % data(:,4).*(ic==ind) takes the z-column, and zeros it except for the current duplicate-group
        % then, stores the index of the max z row in idx
        [~,idx(ind)] = max(data(:,4).*(ic==ind));
    end
end
% select only the max-z rows for each duplicate-group
clean_data = data(idx,:);

3 个评论
显示 1更早的评论隐藏 1更早的评论

Sindar 2020-1-13

actually, it can't. We need the index from the whole array, not the subset that logical indexing will give you. There may be a better way than mine, but this isn't it.

per isakson 2020-1-13

编辑：per isakson 2020-1-13

在 MATLAB Online 中打开

Yes indeed, that's wrong. Isn't it even syntactically incorrect and ought to have thrown an error.

I should have written

data( ic==ind, 4 )

请先登录，再进行评论。

Answer 2

Sindar 2020-1-12

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/499680-how-to-delete-similar-data-from-big-data-file#answer_409808

编辑：per isakson 2020-1-13

在 MATLAB Online 中打开

Looking into it a bit (I have a similar problem), it seems like there is a way using the "Split-Apply-Combine Workflow". It is faster for many duplicate groups, but slower for few (for my in-memory tests), but I think it should scale much better to distributed memory and/or parallel computation.

% lists the duplicate-group of each row in data, equal to ic, above
G = findgroups(data(:,1),data(:,2),data(:,3));
% for each group, apply the maxrow function
% this takes the 4th column of data (your z's), finds the index of the max, then 
% returns that full row repeated for each duplicate-group
splitapply(@(x1) maxrow(x1(:,4),x1),data,G)
% return the row of full_data corresponding to the index of the maximum of sort_vec
function x = maxrow(sort_vec,full_data)
    [~,idx] = max(A);
    x = B(idx,:);
end

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

How to delete similar data from big data file?

6 个评论
显示 4更早的评论隐藏 4更早的评论

回答（2 个）

3 个评论
显示 1更早的评论隐藏 1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

How to delete similar data from big data file?

6 个评论 显示 4更早的评论隐藏 4更早的评论

回答（2 个）

3 个评论 显示 1更早的评论隐藏 1更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

6 个评论
显示 4更早的评论隐藏 4更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论