searching through RNA sequence for a string of characters
1 次查看(过去 30 天)
显示 更早的评论
Hi all,
I am creating a program that gets a user input (txt file) of DNA nucleotides (random lists of A, G, C, T), and gives an output file of the RNA complement (got this done), the percentages of RNA nucleotides (got this done), and asks the user if they would like to search through the RNA sequence for a particular string of nucleotides—if the string exists in the RNA file, it will tell the location in the sequence and repeat that given string.
example: the user inputs "UACG," and my program says that string of basepairs occurred at location 59.
How can I do this? My current program operates by searching through the file on a per character basis, and converts each DNA basepair to an RNA basepair.
Thank you
0 个评论
回答(3 个)
Joseph Cheng
2017-4-24
so, do you store the full compliment data in an array such that you can
%gen dummy data
comp = 'AGCT';
list = comp(randi(4,1,100))
position = strfind(list,'CAT')
where position will be the index in the string that has CAT in it?
2 个评论
Joseph Cheng
2017-4-24
well then i would suggest that you do, unless the array is too large. if you read in the whole text file into an array you should be able to then use strfind to find the position, otherwise its a tricky swapping in and out of a buffer as you search for consecutive matches by looking one by one
dpb
2017-4-24
编辑:dpb
2017-4-24
Again, all the reason in the world to read the whole string at once as recommended earlier.
But, the answer to the question posed, given what you did previously is simply
idx=strfind(out,ASKEDFORSTRING);
AFTER the loop is complete. Fuggettabout trying to do this on a character-at-a-time basis.
idx will be empty if the sequence doesn't exist, or a vector of 1 to N position(s) where the sequence is found.
Done.
I can't emphasize enough how you should convert this from the looping you're doing to processing the input string as a vector...you'll find things will go much faster after you do, the code will become almost trivial by comparison and much simpler to develop/debug/maintain.
6 个评论
dpb
2017-4-24
编辑:dpb
2017-4-25
"My code is the same as from the previous post"
Can't be--you had to have processed the input string instead of continuing to try to read a file character-by-character or you would have had an FEOF error on the first next read.
We need to see the changes you did make...for the NaN result, it looks like you never collected a single count so the result was 0/0 for all.
ADDENDUM
OK, the input file is OK; one hurdle cleared.
What you need to do is after the above read statement, if you're still hung up on looping solution, replace the while loop with a counted for loop
for i=1:length(base)
and remove the fread statement at the end of the loop entirely.
Oh, just dawned on me--if you left the code alone, since base now refers to the entire string not just a single character, the various if clauses will never be satisfied. You must address each character in the string in turn:
for i=1:length(base)
if base(i)=='C'
...
etc., etc., etc., ...
But, of course, the thing to do when you read the full string is to use the vectorized solution that I showed you...there's where the real simplification comes from.
dpb
2017-4-25
编辑:dpb
2017-4-25
OK, while this isn't really answering the question of the thread (I did that up above), I fixed your script to operate on the full string. This isn't fully vectorized yet but goes a fair step along the way eliminating the unnecessary big loop and switch block for the lookup table approach.
% set up lookup table inputs
DNA=['C','G','T','A']; % base characters in DNA sequence
RNA=['G','C','A','U']; % corresponding characters in RNA
% get input file, do bookkeeping for output file, etc., ..
filename=uigetfile('*.txt','Select DNA sequence file');
if filename==0, error('User Canceled'),end
[dnaFile, message] = fopen(filename, 'r');
% %if input file has extension .dna, switch to .rna
if length(filename) > 4 && strcmp(filename(end-3 : end), '.dna')
outputFile = [filename(1:end-4) '.rna'];
else
outputFile = [filename '.rna'];
end
[rnaFile, message] = fopen(outputFile, 'w');
% now read full string into char array
dna=fread(dnaFile,'*char').'; % orient as row vector
fclose(dnaFile); clear dnaFile % we're done with input file
% and translate DNA --> RNA
rna=RNA(arrayfun(@(c) find(c==DNA),dna));
% compute counts, frequencies
counts=zeros(size(RNA)).';
for i=1:length(RNA)
counts(i)=sum(rna==RNA(i));
end
freqs=100*counts/sum(counts);
% output results
fprintf('Base pair frequencies: \n')
for i=1:length(RNA)
fprintf('%c: %6.2f\n',RNA(i),freqs(i));
end
fwrite(rnaFile, rna);
rnaFile=fclose(rnaFile); clear rnaFile
Sample run from your dna.txt--
> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies:
G: 14.63
C: 31.71
A: 26.83
U: 26.83
>>
The searching would be on the variable rna as outlined above after getting the desired sequence for which to search from the user.
I attached the m-file as well...HTH.
ADDENDUM
OK, I added the lookup; using the search string you typed in the original question the nul result is the expected answer...there is no sequence like that in the output string. I ran one and picked one I knew would work and results were:
>> dna2rna
Your output file will be: dna.txt.rna
Base pair frequencies:
G: 14.63
C: 31.71
A: 26.83
U: 26.83
Enter RNA sequence of interest: aauu
Sequence AAUU locations (empty report-->not found)
Occurrence Location
1 3
2 16
3 28
>>
So, the lookup does work if there is anything to be found.
2 个评论
dpb
2017-4-25
编辑:dpb
2017-4-25
Well, syntax looks ok but again we can't debug what we can't see, sorry.
Same old song, umpteenth verse. SHOW us enough we can see what you did and what you used to do it with (both code and data)...we can't see your terminal from here.
This can be done simply from command line with the output string you operated on and the search string as a minimum...
ADDENDUM
You did notice if you used my updated code that the output variable is now a more expressive name rna, not out, I hope?
If you're still stuck on your old code, need what said above...show us the string you're searching and for what...remember strfind is case-sensitive.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Genomics and Next Generation Sequencing 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!