How to delete similar data from big data file?

2 次查看(过去 30 天)
I have a big data file consists of 100000 rows and 5 columns ..and within this file there rows that are similar to each other,so I need to delete the similar ones and let one of them and delete the other similar ones ( fir example, if I have 5 rows have the same data ..I need a line code to choose the maximum value of the five rows and delete the other 4 rows and if all similar ...delete 4 and makes one only present.
  6 个评论
per isakson
per isakson 2020-1-13
You received two good answers. However, one question: does the order of the rows in the result matter?
Your sample data is, in more readable form
%%
data = [
1970 30 30 4
1971 30.5 30.5 4.2
1970 31 32 4.2
1972 32 33 5
1972 32 33 5
1973 32.1 32.2 5
1974 34 35 4.5
];
Sindar
Sindar 2020-1-13
Ah, yes, I forgot to mention that both my answers will sort by the first three columns (year, then x, then y).

请先登录,再进行评论。

回答(2 个)

Sindar
Sindar 2020-1-12
Alright, here's my solution. Its speed depends on how often you expect duplicates (controlled in test by test_vals). I ran a test with very few duplicates (test_vals = 200, only 600/1e5 duplicates) and it took 8s. More duplicates runs faster. There are probably ways to speed up the idx search, but it seems fast enough.
% setting up some test data, skip for your real data
test_rows = 1e5;
test_vals = 5;
% make a test matrix with random integers 1-test_vals in first three columns, random number in last
data = [randi(test_vals,test_rows,3) rand(test_rows,1)];
% find duplicate data in the first three columns
% ia lists the row # in data of the first row in each duplicate
% ic lists the duplicate-group of each row in data
[~,ia,ic] = unique(data(:,1:3),'rows');
% loop over the duplicate-groups
for ind=1:length(ia)
% skip ahead if no duplicates (comment out if you expect few unique rows)
if nnz(ic==ind)==1
idx(ind)=ia(ind);
else
% data(:,4).*(ic==ind) takes the z-column, and zeros it except for the current duplicate-group
% then, stores the index of the max z row in idx
[~,idx(ind)] = max(data(:,4).*(ic==ind));
end
end
% select only the max-z rows for each duplicate-group
clean_data = data(idx,:);
  3 个评论
Sindar
Sindar 2020-1-13
actually, it can't. We need the index from the whole array, not the subset that logical indexing will give you. There may be a better way than mine, but this isn't it.
per isakson
per isakson 2020-1-13
编辑:per isakson 2020-1-13
Yes indeed, that's wrong. Isn't it even syntactically incorrect and ought to have thrown an error.
I should have written
data( ic==ind, 4 )

请先登录,再进行评论。


Sindar
Sindar 2020-1-12
编辑:per isakson 2020-1-13
Looking into it a bit (I have a similar problem), it seems like there is a way using the "Split-Apply-Combine Workflow". It is faster for many duplicate groups, but slower for few (for my in-memory tests), but I think it should scale much better to distributed memory and/or parallel computation.
% lists the duplicate-group of each row in data, equal to ic, above
G = findgroups(data(:,1),data(:,2),data(:,3));
% for each group, apply the maxrow function
% this takes the 4th column of data (your z's), finds the index of the max, then
% returns that full row repeated for each duplicate-group
splitapply(@(x1) maxrow(x1(:,4),x1),data,G)
% return the row of full_data corresponding to the index of the maximum of sort_vec
function x = maxrow(sort_vec,full_data)
[~,idx] = max(A);
x = B(idx,:);
end

类别

Help CenterFile Exchange 中查找有关 Logical 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by