Remove duplicate variables depending on a variable in a second column

Question

0 个投票

Dear experts, I have a list of variables where I need to remove duplicate variables based on the variable in column 2. Variables with a '1' in column 2 are of better quality than variables with a '0'.

1) In case of duplicate variables, I want to keep the variables that have value 1 in the second column. In cases when there are multiple duplicates with a 1 then it needs to keep randomly only one variable. See example below: Here I want to keep the variable BG1028 where the data in the third column is 1.3. For BG1030, I want to keep the variable with 3.0 or 0.3 in the third column.

2) In case of duplicate variables which all have a zero in the second column then it needs to keep randomly only one variable. See example below: I need to keep one variable of BG1027 (random choice).

I hope it is clear. Im puzzling how to do this. This is the code I came up with so far with help from Kirby Fear.

ppn = [ {'BG1026';'BG1027';'BG1027';'BG1028';'BG1028';'BG1028';'BG1029';'BG1029';...
    'BG1030';'BG1030';'BG1030';'BG1030'},... % start col 2
    {'0';'0';'0';'1';'0';'0';'1';'0';'0';'1';'0';'1'},... % start col 3
    {'1.2';'2.2';'5.2';'4.2';'0.2';'8.9';'3.4';'3.0';'0.3';'1.3';'0.3';'1.7'} ];
% Storing ppn column 2 as numerical values
bPpn=cell2mat(cellfun(@(c)str2double(c),ppn(:,2),...
    'UniformOutput',false));
% Get names of duplicates
chooseNames = ppn([strcmp(ppn(1:end-1,1),ppn(2:end,1));false],1);
% Loop over chooseNames and keep one at random.
if numel(chooseNames)>0,
    for j=1:numel(chooseNames),
        dupidx=find(strcmp(chooseNames{j},ppn(:,1)));
        dupidx(randi(numel(dupidx)))=[];
        ppn(dupidx,:)=[];
    end
end

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

WAT 2015-9-24

在 MATLAB Online 中打开

1 个投票

Give something like this a try:

 ppn = [ {'BG1026';'BG1027';'BG1027';'BG1028';'BG1028';'BG1028';'BG1029';'BG1029';...
    'BG1030';'BG1030';'BG1030';'BG1030'},... % start col 2
    {'0';'0';'0';'1';'0';'0';'1';'0';'0';'1';'0';'1'},... % start col 3
    {'1.2';'2.2';'5.2';'4.2';'0.2';'8.9';'3.4';'3.0';'0.3';'1.3';'0.3';'1.7'} ];
 [uniqNames, ia, ic] = unique(ppn(:,1));
 ia = [ia; 1+length(ic)];
 ppn_out = {}; % initialize output 
 for i = 1:length(uniqNames);
    sub = ppn(ia(i):ia(i+1)-1,:); % find only entries with uniqNames(i)
    sub = sub(find(cell2mat(sub(:,2)) == max(cell2mat(sub(:,2)))),:); % find only those entries with the maximal value in col 2
    ppn_out = [ppn_out;  sub(randi(size(sub,1)),:)]; % select one entry at random, put it in ppn_out
 end

3 个评论
显示 1更早的评论隐藏 1更早的评论

WAT 2015-9-24

编辑：WAT 2015-9-24

That's odd, it's also skipping BG1026 for you. It seems to be behaving fine for me, I wonder if there's something goofy in the unique() command? (I'm on R2013a or R2015a and it works fine on both)

Try getting rid of all the semicolons, it's short enough that it should be easy to follow what the code is doing.

Marty Dutch 2015-9-25

Yes, youre right. I tried running it on 2013b and in 2015a and in both it seems to work fine now... Thanks for the response. It does not worked in the 2012b version.

请先登录，再进行评论。

Remove duplicate variables depending on a variable in a second column

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（0 个）

类别

标签

Community Treasure Hunt

Remove duplicate variables depending on a variable in a second column

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论 隐藏 1更早的评论

更多回答（0 个）

类别

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论