Remove duplicate variables depending on a variable in a second column
2 次查看(过去 30 天)
显示 更早的评论
Dear experts, I have a list of variables where I need to remove duplicate variables based on the variable in column 2. Variables with a '1' in column 2 are of better quality than variables with a '0'.
1) In case of duplicate variables, I want to keep the variables that have value 1 in the second column. In cases when there are multiple duplicates with a 1 then it needs to keep randomly only one variable. See example below: Here I want to keep the variable BG1028 where the data in the third column is 1.3. For BG1030, I want to keep the variable with 3.0 or 0.3 in the third column.
2) In case of duplicate variables which all have a zero in the second column then it needs to keep randomly only one variable. See example below: I need to keep one variable of BG1027 (random choice).
I hope it is clear. Im puzzling how to do this. This is the code I came up with so far with help from Kirby Fear.
ppn = [ {'BG1026';'BG1027';'BG1027';'BG1028';'BG1028';'BG1028';'BG1029';'BG1029';...
'BG1030';'BG1030';'BG1030';'BG1030'},... % start col 2
{'0';'0';'0';'1';'0';'0';'1';'0';'0';'1';'0';'1'},... % start col 3
{'1.2';'2.2';'5.2';'4.2';'0.2';'8.9';'3.4';'3.0';'0.3';'1.3';'0.3';'1.7'} ];
% Storing ppn column 2 as numerical values
bPpn=cell2mat(cellfun(@(c)str2double(c),ppn(:,2),...
'UniformOutput',false));
% Get names of duplicates
chooseNames = ppn([strcmp(ppn(1:end-1,1),ppn(2:end,1));false],1);
% Loop over chooseNames and keep one at random.
if numel(chooseNames)>0,
for j=1:numel(chooseNames),
dupidx=find(strcmp(chooseNames{j},ppn(:,1)));
dupidx(randi(numel(dupidx)))=[];
ppn(dupidx,:)=[];
end
end
0 个评论
采纳的回答
WAT
2015-9-24
Give something like this a try:
ppn = [ {'BG1026';'BG1027';'BG1027';'BG1028';'BG1028';'BG1028';'BG1029';'BG1029';...
'BG1030';'BG1030';'BG1030';'BG1030'},... % start col 2
{'0';'0';'0';'1';'0';'0';'1';'0';'0';'1';'0';'1'},... % start col 3
{'1.2';'2.2';'5.2';'4.2';'0.2';'8.9';'3.4';'3.0';'0.3';'1.3';'0.3';'1.7'} ];
[uniqNames, ia, ic] = unique(ppn(:,1));
ia = [ia; 1+length(ic)];
ppn_out = {}; % initialize output
for i = 1:length(uniqNames);
sub = ppn(ia(i):ia(i+1)-1,:); % find only entries with uniqNames(i)
sub = sub(find(cell2mat(sub(:,2)) == max(cell2mat(sub(:,2)))),:); % find only those entries with the maximal value in col 2
ppn_out = [ppn_out; sub(randi(size(sub,1)),:)]; % select one entry at random, put it in ppn_out
end
3 个评论
WAT
2015-9-24
编辑:WAT
2015-9-24
That's odd, it's also skipping BG1026 for you. It seems to be behaving fine for me, I wonder if there's something goofy in the unique() command? (I'm on R2013a or R2015a and it works fine on both)
Try getting rid of all the semicolons, it's short enough that it should be easy to follow what the code is doing.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Logical 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!