how to extract only the type of the membrane protein,from its fasta file header?
1 次查看(过去 30 天)
显示 更早的评论
hie.i am trying to prepare some training data for ,let's say a learning machine. i have extracted some features from a fasta file. now , i want to specify the class of each training instance which is the "type" of the membrane protein, represented as a part of the fasta header. i just dunno how to get access to this part of the header and how to code each of five types, with a number.i have access to header through fastaread. i can see a regular pattern in the header representation.it is just that this regularity slightly changes for each instance. i mean consider three following cases :
41BB_HUMAN Q07011 homo sapiens (human). 4-1bb ligand receptor precursor (t-cell antigen 4-1bb homolog) (t-cell antigen ila) (cd137 antigen). 11/97
A33_HUMAN Q99795 homo sapiens (human). a33 antigen precursor. 11/97
A4_DROME P14599 drosophila melanogaster (fruit fly). beta-amyloid-like protein precursor. 11/97 can any body help please?
3 个评论
Luuk van Oosten
2015-1-25
编辑:Luuk van Oosten
2015-1-25
Could you change the end of the link to .html please (instead of .ht)? this one does not work.
If I understand you well, you already have the list of proteins mentioned in this article, and you know to which of the five classes they belong. So if I say SSR3_HUMAN, you say 'multipass transmembrane protein', right?
And now you want to extract this SSR3_HUMAN part from the header in the FASTA file, correct?
回答(1 个)
Luuk van Oosten
2015-1-25
Got an idea. You import your FASTA-file. you put your headers in one column, the corresponding sequence in the next column or something (anything that works for you).
Something like:
File = 'C:\Users\Documents\your_fastafile.fasta';
your_data = fastaread(File);
Note that both your header as the amino acid sequence are in quotations marks in the generated struct.
Now, remove those quotation marks with something similar to the following:
for i = 1:(length(your_data))
header{i,1} = {your_data(i).Header};
header_no_quotationmarks {i,1} = header{i,1}{:};
end
(this could be easier, but I had part of this still lying around from another project, I'm just copy-pasting here). Now you want to extract the part that describes your protein from the header; so let us take an example:
AMA1_PLACH P16445 plasmodium chabaudi. apical membrane antigen 1 precursor (merozoite surface antigen)
This can be seen as a string. And as you already noted yourself: there is some regularity in these strings. They all start with what you want: the AMA1_PLACH part. If you obtained the FASTA file from some other sources they tend to start with something like
sp|P0C2K0|A1KB_LOXBO etc. etc.
Where the sp|P0C2K0|-part will screw things up. Let me assume that ALL your headers start with the info you want. Anyway, we have a lot of strings containing the full header. Now take:
str = 'AMA1_PLACH P16445 plasmodium chabaudi. apical membrane antigen 1 precursor (merozoite surface antigen)'
Now use regexp (see help regexp / the online help documents of regexp for more info).
g = regexp(str, ' ', 'split');
What it does you generate 'g', which is your string 'str' which you split (hence 'split') in separate parts whenever it observes a space (the ' ' part in regexp).
if you now request
g{1}
MATLAB looks at 'g' and takes the first.... your AMA1_PLACH!!!
So if you write your script/program/function to loop over all your data, take the header as string, split it whenever it sees a space, then take g{1} and stores that info in a cell/array (whatever works for you, probably you want it next to the amino acid sequence....).
Maybe you can have a look here, it is where I got the idea of the regexp. You can write this in many different ways, and maybe this is not the most elegant, but hey, it seems to work (at least for my own mini-fastafile).
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Sequence and Numeric Feature Data Workflows 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!