Read text file lines and analyze

1 次查看(过去 30 天)
I would appreciate help with reading and analyzing a text file. The text file (rosalind_gc1.txt) is in this format:
>Rosalind_4949
ACTTCTATGTAGCGCGCTATTTCAAGGGATCGGCCAATAGTACGACGTGTTTCATCTAGT GCGACAAATGTATATACCGTTTTCATTACGTACCACGATAAGTTGAAGCCCGTATTC AGACGCGGGAGCCGTCTGCTGGACAAGTACTAGCTGGTCCATCCTCCCCACCAAAGGGAA
>Rosalind_7490
AACTGGGAATTTCTATATTGGGCGGTAAGCTCGGGGCAATCTATTAGTTGAATGCAACAG TAACAAACTTGCCGTCGGTCGCTGTTCGCGCAGCATTAATAATAACTCTGGCGAGTAGAT
>Rosalind_8337
CCTTGTTGTCTACCCACCAAGTCAGATAGACAGTTGGCTGTCTCCAACGCAGATTTTCTA CGCTTCATGCTCTTGCGACTCATGTCGCCTGGGTTTATTGCTTCTCTACGGGATAACCGC CCGGGCTCACTCTACCCGCGGGAAGGCCGCCCTCTCTCCCGTGTGCCTACATAA
I would like to determine the %GC for the data sets between each “>Rosalind” heading. For example, in the example above there are 3 data sets. The %GC for the text between “>Rosalind_4949” and “>Rosalind_7490” is 48.5876% and between “>Rosalind_7490” and “>Rosalind_8337” is 45.000%.
I’m trying to use the following code but I don’t know how to read the lines as blocks between each “>” and I don’t know how to concatenate the lines as I read them. I would appreciate any help.
fid = fopen('rosalind_gc1.txt');
while ~feof(fid)
templine = fgetl(fid);
a = strcmp(templine, '>');
if a == 0
G = length(strfind(templine,'G'));
C = length(strfind(templine,'C'));
z = length(templine);
%Per = (G+C)*100/z
end
end
Per = (G+C)*100/z

采纳的回答

Lmm3
Lmm3 2017-9-9
The following code is what I used to read from the data file and determine %GC:
fid = fopen('rosalind_gc.txt');
n = 1;
G = 0;
C = 0;
z = 1;
while ~feof(fid)
templine = fgetl(fid);
a = strfind(templine, '>');
TF = isempty(a);
if TF == 1;
n= n+1;
G(1) = 0;
C(1) = 0;
z(1) = 0;
G(n) = length(strfind(templine,'G'));
C(n) = length(strfind(templine,'C'));
z(n) = length(templine);
G(n) = G(n) + G(n-1);
C(n) = C(n) + C(n-1);
z(n) = z(n) + z(n-1);
continue
% Per(n) = (G(n)+C(n))*100/z(n)
else TF == 0 ;
Per = (G(end)+C(end))*100/z(end)
disp(templine)
G(:,:) = [];
C(:,:) = [];
z (:,:)=[];
continue
end
end
Per =(G(end)+C(end))*100/z(end)

更多回答(2 个)

KSSV
KSSV 2017-7-24
编辑:KSSV 2017-7-24
Let data.txt be your text file...You can count the number of G in your file as below:
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
N = 0 ;
for i = 1:length(S)
N = N+length(strfind(S{i}, 'G'));
end
Without loop :
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
Ni = strfind(S,'G') ;
N = sum(cellfun(@numel,Ni)) ;
  1 个评论
Lmm3
Lmm3 2017-7-25
KSSV thank you for your response. Could you explain to me what the line S = S{1} is doing? The code returns the total number of "G" occurrences for the data file, but do you have a suggestion how to get the "G" occurrences between each of the headers that begin with ">Rosalind"? For example, in the data set above, I would like to get 3 values, the number of G occurrences between (“>Rosalind_4949” and “>Rosalind_7490”) between (“>Rosalind_7490” and “>Rosalind_8337”) and G occurrences below (">Rosalind_8337).

请先登录,再进行评论。


OCDER
OCDER 2017-9-9
If you deal with a lot of fasta files, look into fastaread (Matlab Bioinformatics Toolbox) or readFasta (a code I made for another project).
Also, cellfun and regexp become pretty handy tools.
To get GC %:
[Header, Seq] = readFasta('Seq.txt');
PercGC = cellfun(@(S)length(regexpi(S, 'G|C'))/length(S)*100, Seq);
PercGC =
48.5876
45.0000
55.1724

类别

Help CenterFile Exchange 中查找有关 Cell Arrays 的更多信息

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by