Count number of nucleotide bases (letters) in file with multiple entries

14 次查看(过去 30 天)
Below is an example of the type of sequences I'm dealing with. The sequences are truncated for readability. Some IDs correspond to one sequence only whereas others to more than one squence. The objective is to count the number of nucleotides (letters) per ID.
>AAEL001292::intron | Intron sequence
GTGGG
>AAEL001312::intron | Intron sequence
GTAAGAAC
>AAEL001320::intron | Intron sequence
GTAAGACCTACTCA
>AAEL001807::intron | Intron sequence
GTACT
>AAEL002085::intron | Intron sequence
GTTAGTTGTA
>AAEL002085::intron | Intron sequence
GTAAGT
>AAEL002085::intron | Intron sequence
GTT
>AAEL002085::intron | Intron sequence
GTACGA
>AAEL002633::intron | Intron sequence
GTAAGAAAAAT
I've managed to generate a table like:
AAEL001292 5
AAEL001312 8
AAEL001320 14
AAEL001807 5
AAEL002085 10
AAEL002085 6
AAEL002085 3
AAEL002085 6
AAEL002633 11
The desired output is:
AAEL001292 5
AAEL001312 8
AAEL001320 14
AAEL001807 5
AAEL002085 25 # sum of number of nuceotides under to ID AAEL002085
AAEL002633 11
Any suggestions will be most helpful.
  1 个评论
dpb
dpb 2024-8-30
Be much easier if you would attach a data file for folks to work with...as for hints, without a data file from which to see just what you're starting with, I'd venture using a table and @doc:rowfun or groupsummary even with grouping variables will give you the desired results directly.

请先登录,再进行评论。

采纳的回答

Voss
Voss 2024-8-30
% this is what the attached file contains:
type file.txt
>AAEL001292::intron | Intron sequence GTGGG >AAEL001312::intron | Intron sequence GTAAGAAC >AAEL001320::intron | Intron sequence GTAAGACCTACTCA >AAEL001807::intron | Intron sequence GTACT >AAEL002085::intron | Intron sequence GTTAGTTGTA >AAEL002085::intron | Intron sequence GTAAGT >AAEL002085::intron | Intron sequence GTT >AAEL002085::intron | Intron sequence GTACGA >AAEL002633::intron | Intron sequence GTAAGAAAAAT
data = readlines('file.txt');
ID = regexp(data(1:2:end),'>(.*?):','tokens','once');
ID = vertcat(ID{:});
introns = data(2:2:end);
[G,ID] = findgroups(ID);
F = @(x)sum(cellfun(@numel,x));
count = groupsummary(introns,G,F);
T = table(ID,count)
T = 6x2 table
ID count ____________ _____ "AAEL001292" 5 "AAEL001312" 8 "AAEL001320" 14 "AAEL001807" 5 "AAEL002085" 25 "AAEL002633" 11

更多回答(2 个)

dpb
dpb 2024-8-31
编辑:dpb 2024-9-1
I'll convert the comment to an Answer because I believe it is/will be part of the required solution and is probably easier to clean the data to match the calculation assumptions than to modify the calculations to match the varying number of rows per group ID.
ADDENDUM
Another way to approach the cleanup would be to compute the line number difference between the lines beginning with ">" -- if it is >2, then that group has one or more wrapped lines.
data = readlines('grp1_introns.txt');
data(strlength(data)==0)=[]; % remove blank lines, if any
data(1:12)
ans = 12x1 string array
">AAEL001292::intron | Intron sequence" "GTGGGTGATTGTGTTTTTTTGATTTTGATACAAGTTAACAACATTGTATTTTTCAG" ">AAEL001312::intron | Intron sequence" "GTAAGAACACATGCGTCGTCAATGCTTAACAAGATTTAATACAATTTGCGCATCTGTTTC" "AG" ">AAEL001320::intron | Intron sequence" "GTAAGACCTACTCATTAGATTTATAGACAGTGATGTAATACTAATTATGCATTCATTTTA" "G" ">AAEL001807::intron | Intron sequence" "GTACTTTTGGTTGACGGAAGTTTTAGAATGAGAATTCGTCTAATCGTTATATATATTTTA" "CAG" ">AAEL002085::intron | Intron sequence"
ix=find(startsWith(data,'>'));
dx=[diff(ix);0];
iy=find(dx>2);
whos ix dx iy
Name Size Bytes Class Attributes dx 458x1 3664 double ix 458x1 3664 double iy 290x1 2320 double
[ix(1:10) dx(1:10) iy(1:10)]
ans = 10x3
1 2 2 3 3 3 6 3 4 9 3 5 12 4 8 16 2 9 18 2 10 20 3 11 23 9 12 32 9 13
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
NOTA BENE: There are wrapped lines with more than two lines; the sequence at line 16 is three while that at line 32 is 8 since the differences are 4 and 9, respectfully.
One will need to iterate through and combine all those rows in the counting, either by catenating the lines into one long one and removing the wrapped lines so the file follows the assumption of only one line or by summing the lengths of the lines between groups.
Well, let's see if I can manage to keep the indexing straight without too much trouble...
markfordelete=[]; % accumulator for marked rows
for i=1:numel(iy) % iterate over the elements that are multiple lines
l1=ix(iy(i))+1; % the first row in the group past the header
l2=l1+dx(iy(i))-2; % the last row in group
try
data(l1)=join(data(l1:l2),''); % put those rows together, no spaces
markfordelete=cat(1,markfordelete,[l1+1:l2].'); % keep list of rows to delete
catch
e=lasterror; e.message % for debugging
[i iy(i) ix(iy(i)) dx(iy(i)) l1 l2]
data(l1-1:end)
return
end
end
data(markfordelete)=[];
At this point, you should be able to apply @Voss method...it seems to work with at least a cursory checking. You'll want to output to a file and inspect carefully to make sure, but I think it works as intended.
ix=find(startsWith(data,'>'));
height(ix)
ans = 458
any(diff(ix)~=2)
ans = logical
0
The quick sanity check that the updated file doesn't have any segments of more than one line and the number of segments is the same is agoodsign™
ADDENDUM
Interesting thought -- what's the length of the lines now that are all strung together?
[Lmax,iMax]=max(dx)
Lmax = 401
iMax = 57
L=strlength(data(2:2:end));
UL=unique(L);
numel(UL)
ans = 157
NL=histc(L,UL);
[UL NL]
ans = 157x2
13 1 14 1 50 1 51 3 52 3 53 20 54 11 55 13 56 16 57 32
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
[min(UL) max(UL)]
ans = 1x2
13 23955
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
subplot(2,1,1)
bar(UL,NL)
set(gca,'XScale','log')
xlim([UL(1) UL(end)])
subplot(2,1,2)
bar(UL,NL)
set(gca,'XScale','log')
xlim([UL(1)-1 200])
xticks([UL(1) 20:10:50 100 200])
sum(L)
ans = 444117
There are a couple of introns that have 400+ rows that adds up to string lengths of 20,000+. I dunno if this is correct or not, but opening the text file confirms that is the content of the uploaded file.
  2 个评论
dpb
dpb 2024-9-1
编辑:dpb 2024-9-1
No problem; got interesting. Is the file really right that there are strings of 25,000+?
While the forum won't let accept more than one Answer, if the above actually is useful, you could cast a vote... :)

请先登录,再进行评论。


George
George 2024-8-31
Thank you very much for the prompt reply. I tried what you suggested. I get the following error message.
count = groupsummary(introns,G,F);
Error using matlab.internal.math.parseGroupVars (line 129)
Grouping variables and data matrix must have same number of rows.
Error in groupsummary (line 180)
[groupingData,groupVars] = matlab.internal.math.parseGroupVars(groupVars,tableFlag,"groupsummary:Group",T);
I'm attaching the data file in case it helps.
Once again many thanks
  2 个评论
dpb
dpb 2024-8-31
编辑:dpb 2024-8-31
data = readlines('grp1_introns.txt');
data(1:10,:)
ans = 10x1 string array
">AAEL001292::intron | Intron sequence" "GTGGGTGATTGTGTTTTTTTGATTTTGATACAAGTTAACAACATTGTATTTTTCAG" ">AAEL001312::intron | Intron sequence" "GTAAGAACACATGCGTCGTCAATGCTTAACAAGATTTAATACAATTTGCGCATCTGTTTC" "AG" ">AAEL001320::intron | Intron sequence" "GTAAGACCTACTCATTAGATTTATAGACAGTGATGTAATACTAATTATGCATTCATTTTA" "G" ">AAEL001807::intron | Intron sequence" "GTACTTTTGGTTGACGGAAGTTTTAGAATGAGAATTCGTCTAATCGTTATATATATTTTA"
The attached file seems to have a linefeed after a fixed-length record for the longer sequences -- is that real in the file or an artifact of how it was saved/uploaded?
That breaks the assumption @Voss made that there are two lines per section and in using findgroups it is found out when the number of rows don't match.
That assumption is also in @Star Strider's code, but his will still run; just those sequences in the second/wrapped line won't be counted.
I gather it must be true in the real files, too, if yours failed; in that case it will take some additional preprocessing to apply either solution.
Be sure you haven't inadvertently introduced that line break by having looked at the files in a word processing type editor such as Notepad that automatically word wrapped the longer lines (or, maybe even the MATLAB code editor might if the line length is set, I know it is sometimes quite annoying, too, with wanting to wrap comments at ends of somewhat longer code lines...
L=max(strlength(data)) % the maximum line length
ans = 60
ix=find(strlength(data)==L); % locate those long lines
iy=find(startsWith(data,'>')); % and the sequence header lines
iy=iy(iy>ix(1)); % keep only those after the first long line
for i=1:numel(ix)
if iy(i)-ix(i)>2 % wrapped line
data(ix,:)=join(data(ix:ix+1,:)); % put them back together
end
end
data(ix+1,:)=[]; % now remove the wrapped lines

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Text Data Preparation 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by