Read text file lines and analyze

Question

Lmm3 2017-7-24

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/349955-read-text-file-lines-and-analyze

回答： OCDER 2017-9-9

采纳的回答： Lmm3

在 MATLAB Online 中打开

I would appreciate help with reading and analyzing a text file. The text file (rosalind_gc1.txt) is in this format:

>Rosalind_4949

ACTTCTATGTAGCGCGCTATTTCAAGGGATCGGCCAATAGTACGACGTGTTTCATCTAGT GCGACAAATGTATATACCGTTTTCATTACGTACCACGATAAGTTGAAGCCCGTATTC AGACGCGGGAGCCGTCTGCTGGACAAGTACTAGCTGGTCCATCCTCCCCACCAAAGGGAA

>Rosalind_7490

AACTGGGAATTTCTATATTGGGCGGTAAGCTCGGGGCAATCTATTAGTTGAATGCAACAG TAACAAACTTGCCGTCGGTCGCTGTTCGCGCAGCATTAATAATAACTCTGGCGAGTAGAT

>Rosalind_8337

CCTTGTTGTCTACCCACCAAGTCAGATAGACAGTTGGCTGTCTCCAACGCAGATTTTCTA CGCTTCATGCTCTTGCGACTCATGTCGCCTGGGTTTATTGCTTCTCTACGGGATAACCGC CCGGGCTCACTCTACCCGCGGGAAGGCCGCCCTCTCTCCCGTGTGCCTACATAA

I would like to determine the %GC for the data sets between each “>Rosalind” heading. For example, in the example above there are 3 data sets. The %GC for the text between “>Rosalind_4949” and “>Rosalind_7490” is 48.5876% and between “>Rosalind_7490” and “>Rosalind_8337” is 45.000%.

I’m trying to use the following code but I don’t know how to read the lines as blocks between each “>” and I don’t know how to concatenate the lines as I read them. I would appreciate any help.

fid = fopen('rosalind_gc1.txt');
while ~feof(fid)
    templine = fgetl(fid);
    a = strcmp(templine, '>');
    if a == 0
        G = length(strfind(templine,'G'));
        C = length(strfind(templine,'C'));
        z = length(templine);
        %Per = (G+C)*100/z
    end
end
    Per = (G+C)*100/z

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Lmm3 2017-9-9

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/349955-read-text-file-lines-and-analyze#answer_280864

在 MATLAB Online 中打开

The following code is what I used to read from the data file and determine %GC:

fid = fopen('rosalind_gc.txt');
n = 1;
G = 0;
C = 0;
z = 1;
while ~feof(fid)
    templine = fgetl(fid);
    a = strfind(templine, '>');
    TF = isempty(a);
    if TF == 1;
        n= n+1;
        G(1) = 0;
        C(1) = 0;
        z(1) = 0;
        G(n) = length(strfind(templine,'G'));
        C(n) = length(strfind(templine,'C'));
        z(n) = length(templine);
          G(n) = G(n) + G(n-1);
          C(n) = C(n) + C(n-1);
          z(n) = z(n) + z(n-1);
          continue
         % Per(n) = (G(n)+C(n))*100/z(n)
      else TF == 0 ;
          Per = (G(end)+C(end))*100/z(end)
          disp(templine)
          G(:,:) = [];
          C(:,:) = [];
          z (:,:)=[];
          continue
      end
  end
  Per =(G(end)+C(end))*100/z(end)

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

KSSV 2017-7-24

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/349955-read-text-file-lines-and-analyze#answer_275272

编辑：KSSV 2017-7-24

在 MATLAB Online 中打开

Let data.txt be your text file...You can count the number of G in your file as below:

fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
N = 0 ;
for i = 1:length(S)
    N = N+length(strfind(S{i}, 'G'));
end

Without loop :

fid = fopen('data.txt') ;
  S = textscan(fid,'%s','delimiter','\n') ;
  fclose(fid) ;
  S = S{1} ;
Ni = strfind(S,'G') ;
N = sum(cellfun(@numel,Ni)) ;

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Lmm3 2017-7-25

KSSV thank you for your response. Could you explain to me what the line S = S{1} is doing? The code returns the total number of "G" occurrences for the data file, but do you have a suggestion how to get the "G" occurrences between each of the headers that begin with ">Rosalind"? For example, in the data set above, I would like to get 3 values, the number of G occurrences between (“>Rosalind_4949” and “>Rosalind_7490”) between (“>Rosalind_7490” and “>Rosalind_8337”) and G occurrences below (">Rosalind_8337).

请先登录，再进行评论。

Answer 3

OCDER 2017-9-9

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/349955-read-text-file-lines-and-analyze#answer_280878

在 MATLAB Online 中打开

readFasta.m

If you deal with a lot of fasta files, look into fastaread (Matlab Bioinformatics Toolbox) or readFasta (a code I made for another project).

Also, cellfun and regexp become pretty handy tools.

To get GC %:

[Header, Seq] = readFasta('Seq.txt');
PercGC = cellfun(@(S)length(regexpi(S, 'G|C'))/length(S)*100, Seq);
PercGC =
   48.5876
   45.0000
   55.1724

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Read text file lines and analyze

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（2 个）

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

Read text file lines and analyze

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

更多回答（2 个）

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论