Remove duplicate rows in table

Question

DavidL88 2021-1-20

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/722478-remove-duplicate-rows-in-table

评论： DavidL88 2021-1-28

Hi

I have a table with four columns and roughly 45,000 rows (example below). The first column is the name of statistical test (of which there are several hundred different tests). For every statistical test the values in the 4th column are duplicated (at .25 and 0.5). Can anyone advise how I delete the first of these rows (the first one of the .25 and the first one of the 0.5 rows) for every statistical test?

'Perm t-test equal [250ms,500ms 92, 108]: Avg: 11_right   FCL'	-1.349	0.185	0.492
'Perm t-test equal [250ms,500ms 92, 108]: Avg: 11_right   FCL'	-1.457	0.155	0.496
'Perm t-test equal [250ms,500ms 92, 108]: Avg: 11_right   FCL'	-1.544	0.134	0.500
'Perm t-test equal [500ms,900ms 92, 108]: Avg: 11_right   FCL'	-1.544	0.129	0.500
'Perm t-test equal [500ms,900ms 92, 108]: Avg: 11_right   FCL'	-1.615	0.112	0.503
'Perm t-test equal [500ms,900ms 92, 108]: Avg: 11_right   FCL'	-1.665	0.100	0.507

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

dpb 2021-1-20

I don't see the duplication in the sample dataset? (I'm presuming the 0.25 and 0.5 are confidence limits of the test and not values of the statistic as Adam presumed below).

To my eyes anyways, the above data are all for the same test for the first three and then the second set of three; but the fourth column data values are unique other than by happenstance it appears that the last @250ms is same in second and fourth columns as the first @500ms.

Not at all clear what is the result wanted from this dataset, to me, anyways...

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Adam Danz 2021-1-20

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/722478-remove-duplicate-rows-in-table#answer_602570

编辑：Adam Danz 2021-1-20

在 MATLAB Online 中打开

Follow the demo.

T is a table
T.Test contains the test names which can be strings, character vectors, categoricals, or numeric.
T.col4 is the name of column 4.

The demo removes the first line where column 4 equals 0.25 or 0.50 for each test. The tests do not have to be in order.

% Create table
rng('default') % for reproducibility
T = table(repelem({'A';'B';'C'},5,1),rand(15,1), rand(15,1), repmat([0;.25;.25;.5;.5],3,1),...
    'VariableNames',{'Test','col2','col3','col4'});
T.col4([7,14]) = .33; 
disp(T)
    Test      col2        col3      col4
    _____    _______    ________    ____

    {'A'}    0.81472     0.14189       0
    {'A'}    0.90579     0.42176    0.25
    {'A'}    0.12699     0.91574    0.25
    {'A'}    0.91338     0.79221     0.5
    {'A'}    0.63236     0.95949     0.5
    {'B'}    0.09754     0.65574       0
    {'B'}     0.2785    0.035712    0.33
    {'B'}    0.54688     0.84913    0.25
    {'B'}    0.95751     0.93399     0.5
    {'B'}    0.96489     0.67874     0.5
    {'C'}    0.15761     0.75774       0
    {'C'}    0.97059     0.74313    0.25
    {'C'}    0.95717     0.39223    0.25
    {'C'}    0.48538     0.65548    0.33
    {'C'}    0.80028     0.17119     0.5
% For each testtype, identify the first row where col4 is .25 and .50
[testID, testNames] = findgroups(T.Test);
rowNum1 = arrayfun(@(i) {find(testID==i & T.col4==0.25, 2)}, unique(testID));
rowNum2 = arrayfun(@(i) {find(testID==i & T.col4==0.50, 2)}, unique(testID));
rowNums = cell2mat(cellfun(@(c){padarray(c,[2-numel(c),0],NaN,'post')},[rowNum1', rowNum2']));
rmRows = rowNums(2, ~isnan(rowNums(2,:)));
% remove rows from table
T(rmRows, : ) = []
T = 11x4 table
    Test      col2        col3      col4
    _____    _______    ________    ____

    {'A'}    0.81472     0.14189       0
    {'A'}    0.90579     0.42176    0.25
    {'A'}    0.91338     0.79221     0.5
    {'B'}    0.09754     0.65574       0
    {'B'}     0.2785    0.035712    0.33
    {'B'}    0.54688     0.84913    0.25
    {'B'}    0.95751     0.93399     0.5
    {'C'}    0.15761     0.75774       0
    {'C'}    0.97059     0.74313    0.25
    {'C'}    0.48538     0.65548    0.33
    {'C'}    0.80028     0.17119     0.5

15 个评论
显示 13更早的评论隐藏 13更早的评论

DavidL88 2021-1-20

在 MATLAB Online 中打开

Hi Adam

The effect as you demonstrate is what I'm looking for. I'm not sure why I got a different result. There is a duplicate of all 0.25 and 0.5s. I copy a sample of the table below before running this code. The 0.25 values for this section are in rows 39 and 40 of the table T.

 FCL'	0.449377841816944	0.653086728317921	0.242187500000000
 FCL' 	0.379117217892076	0.705573606598350	0.246093750000000
 FCL'	0.411715894798510	0.683829042739315	0.250000000000000
 FCL'	0.411715894798510	0.680329917520620	0.250000000000000
 FCL'	0.564101287653156	0.573856535866034	0.253906250000000
 FCL'	0.794131830628734	0.429142714321420	0.257812500000000

This is the same section after running the code. In rowNum1 I can see both 39 and 40 listed.

 FCL'	0.449377841816944	0.653086728317921	0.242187500000000
 FCL'	0.379117217892076	0.705573606598350	0.246093750000000
 FCL'	0.564101287653156	0.573856535866034	0.253906250000000
 FCL'	0.794131830628734	0.429142714321420	0.257812500000000

This the exact code I ran on my table T. T3 is the last column and T4 is the first column.

% For each testtype, identify the first row where col4 is .25 and .50
[testID, testNames] = findgroups(T.T4);
rowNum1 = arrayfun(@(i) {find(testID==i & T.T3==0.25, 1)}, unique(testID));
rowNum2 = arrayfun(@(i) {find(testID==i & T.T3==0.50, 1)}, unique(testID));
% remove rows from table
T([rowNum1{:}, rowNum2{:}], : ) = [];

Adam Danz 2021-1-21

编辑：Adam Danz 2021-1-21

在 MATLAB Online 中打开

> For rowNum1 and 2 the same vales are there

Impossible. rowNum1 values are based on T.T3==.25; rowNum2 values are based on T.T3==.50; It would therefore be impossible to have the same values in both variables unless the result is an empty array (no matches). Or maybe you meant that they have the same values as the previous version which would only happen if all tests had duplicates for .25 and .50.

> both are listed as 648x1 cell

It's expected that they are cell arrays with the same size.

> It shouldn't be a floating point as those numbers represent exact time-stamps

They aren't integers so it's not debatable whether they are represented by floating point or not. The question is whether their floating point representation is causing a problem with the equality tests. It doesn't matter that T.T3(39) equals T.T3(40). What matters is if those values equal 0.25 or 0.50, exactly.

Example:

 4/3
ans = 1.3333
 4/3-1 
ans = 0.3333
(4/3- 1)*3 
ans = 1.0000
(4/3- 1)*3 == 1
ans = logical
   0

I wonder if you're using long format which would also explain the trailing 0s.

Could you attach a mat file containing the table?

Adam Danz 2021-1-23

在 MATLAB Online 中打开

You were close...

idx below returns a logical vector the same size as testNames indicating which test-names are flagged. Then you have to identify which rows of the table have those test names.

rowNum3 = arrayfun(@(i) {find(testID==i & T.col4<0.50)}, unique(testID));
idx = cellfun(@isempty,rowNum3);
rmIdx = ismember(T.Test, testNames(idx));
T(rmIdx,:) = []

DavidL88 2021-1-28

This worked thanks!

请先登录，再进行评论。

Remove duplicate rows in table

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

15 个评论
显示 13更早的评论隐藏 13更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Remove duplicate rows in table

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

采纳的回答

15 个评论 显示 13更早的评论隐藏 13更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

15 个评论
显示 13更早的评论隐藏 13更早的评论