Removing redundant rows where not every row has the same number of elements

Question

Dillon Heidenreich 2021-7-30

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/888922-removing-redundant-rows-where-not-every-row-has-the-same-number-of-elements

编辑： Dillon Heidenreich 2021-7-30

Hello,

I have data that often looks like this:

"HIST1H2BC" "K13"

"HIST1H2BC" "K13;K16"

"HIST1H2BC" "K16"

"HIST1H2BH" "K13"

"HIST1H2BH" "K13;K16"

"HIST1H2BH" "K16"

"HIST1H2BO" "K13;K16"

"HIST1H2BO" "K16"

"HIST2H2BE" "K13;K16"

"HIST2H2BE" "K16"

I have been trying to code a function that splits the second columns at the ';' and then removes any rows for which every element is contained in another row, which would hopefully yield something like this:

"HIST1H2BC" "K13" "K16"

"HIST1H2BH" "K13" "K16"

"HIST1H2BO" "K13" "K16"

"HIST2H2BE" "K13" "K16"

All of the solutions I have tried have been very excessive and difficult to wrap my head around.

Thank you in advance!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Chunru 2021-7-30

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/888922-removing-redundant-rows-where-not-every-row-has-the-same-number-of-elements#answer_756952

编辑：Chunru 2021-7-30

在 MATLAB Online 中打开

x = ["HIST1H2BC"	"K13"
"HIST1H2BC"	"K13;K16"
"HIST1H2BC"	"K16"
"HIST1H2BH"	"K13"
"HIST1H2BH"	"K13;K16"
"HIST1H2BH"	"K16"
"HIST1H2BO"	"K13;K16"
"HIST1H2BO"	"K16"
"HIST2H2BE"	"K13;K16"
"HIST2H2BE"	"K16"
"1H1D"	"K137"
"1H1D"	"K137|K138"
"1H1D"	"K138"
"1H1D"	"K136"
"1H1E"	"K136|K137"
"1H1E"	"K137"];
s = split(x(1, 2), {';', '|'});
y = {x(1, 1), s'};
for i=2:size(x, 1)
    s = split(x(i, 2), {';', '|'});
    [lix, locy] = ismember(x(i, 1), [y{:, 1}]);
    if ~lix
        % new entry
        y =[ y; {x(i, 1) ,s'}];
    else
        [lis, loc] = ismember(s, y{locy, 2});
        y{locy, 2} = [y{locy, 2} s(~lis)'];
    end
end
y
y = 6×2 cell array
    {["HIST1H2BC"]}    {["K13"    "K16"            ]}
    {["HIST1H2BH"]}    {["K13"    "K16"            ]}
    {["HIST1H2BO"]}    {["K13"    "K16"            ]}
    {["HIST2H2BE"]}    {["K13"    "K16"            ]}
    {["1H1D"     ]}    {["K137"    "K138"    "K136"]}
    {["1H1E"     ]}    {["K136"    "K137"          ]}

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

KSSV 2021-7-30

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/888922-removing-redundant-rows-where-not-every-row-has-the-same-number-of-elements#answer_756957

在 MATLAB Online 中打开

str = ["HIST1H2BC"	"K13"
"HIST1H2BC"	"K13;K16"
"HIST1H2BC"	"K16"
"HIST1H2BH"	"K13"
"HIST1H2BH"	"K13;K16"
"HIST1H2BH"	"K16"
"HIST1H2BO"	"K13;K16"
"HIST1H2BO"	"K16"
"HIST2H2BE"	"K13;K16"
"HIST2H2BE"	"K16"];
iwant = strings([],3) ; 
count = 0 ; 
for i = 1:length(str)
    s = strsplit(str(i,2),';') ; 
    if length(s) == 2
        count = count+1 ; 
        iwant(count,:) = [str(i,1) s] ; 
    end
end

8 个评论
显示 6更早的评论隐藏 6更早的评论

Chunru 2021-7-30

See the revised answer.

Dillon Heidenreich 2021-7-30

编辑：Dillon Heidenreich 2021-7-30

I've attached a small version of the excel file trimmed down to only the relevant information, but it's unfortunately hard to find the infrequent combinations of data that give me trouble. The base file is around 12 thousand rows with around 60 columns, but only 9 or so of them are relevant to my work currently. The small version is titled HistExample whereas the larger version, though still not the complete file, is titled HistTrimmed. My process starts by concatenating columns A,B,C to each other top to bottom, and repeating the same for the other two groups of columns, then concatenating them all together horzontally, effectively yielding

Seq A Common A XL A

Seq B Common B XL B

Seq C Common C XL C

which I name wholePPR

I then take the first column of wholePPR and use the unique function to find unique elements. I then run the below code:

for x = 1:size(uniquePep,1)

tempPPR = wholePPR(uniquePep(x) == wholePPR(:,1),:);

cellFam(x,1) = {tempPPR(:,2)};

cellFam(x,2) = {tempPPR(:,3)};

end

where uniquePep is the result of running the unique function of the first column. I then run this code:

for x = 1:size(cellFam,1)

tempFams = [cellFam{x,1},cellFam{x,2}];

uFams(x) = {unique(tempFams,'Rows')};

end

in order to get the data I showed at the beginning of this question. (uFams)

请先登录，再进行评论。

Removing redundant rows where not every row has the same number of elements

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（2 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

8 个评论
显示 6更早的评论隐藏 6更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Removing redundant rows where not every row has the same number of elements

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（2 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

8 个评论 显示 6更早的评论隐藏 6更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

WeChat

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

8 个评论
显示 6更早的评论隐藏 6更早的评论