Parsing a file multiple entries consisting of strings. Each entry contains a header followed by a descriptor. The objective is to use headers to obtain a subset of entries.

Question

George 2024-9-24

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2155215-parsing-a-file-multiple-entries-consisting-of-strings-each-entry-contains-a-header-followed-by-a-de

评论： Star Strider 2024-9-24

The following is an example of a file to be parsed.

Each entry contains a header indicated by ">" followed by sequences of letters (amino acid descriptors). Please note that the sequences shown in the example are truncated. The objective is to use list of headers such as "FBtr0077276", ">FBtr0080587" and fish out both the header and the corresponding amino acid sequences in the same format as that of the submitted file. The format is widely used in bioinformatics and is known as fasta.

Thank you for your comments/help

Example of input file. The headers are highlighted in bold

>FBtr0077276 | Cyp6v1 Cytochrome P450 6v1

MVYSTNILLAIVTILTGVFIWSRRTYVYWQRRRVKFVQPTHLLGNLSRVLRLEESFALQL

RRFYFDERFRNEPVVGIYLFHQPALLIRDLQLVRTVLVEDFVSFSNRFAKCDGRSDKMGA

>FBtr0079061 | Cyp28d2 Cytochrome P450 28d2

MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGSFPSIFTRKRNIAYDI

>FBtr0079925 | Cyp4e3 Cytochrome P450 4e3

MWLAVLALLVLPLITLVYFERKASQRRQLLKEFNGPTPVPILGNANRIGKNPAEILSTFF

>FBtr0080587 | Cyp28a5 Cytochrome P450 28a5

MVLITLTLVSLVVGLLYAVLVWNYDYWRKRGVPGPKPKLLCGNYPNMFTMKRHAIYDLDD

>FBtr0081077 | Cyp310a1 Cytochrome P450 310a1

MWLLLPILLYSAVFLSVRHIYSHWRRRGFPSEKAGITWSFLQKAYRREFRHVEAICEAYQ

SGKDRLLGIYCFFRPVLLVRNVELAQTILQQSNGHFSELKWDYISGYRRFNLLEKLAPMF

>FBtr0077276 | Cyp6v1 Cytochrome P450 6v1

MVYSTNILLAIVTILTGVFIWSRRTYVYWQRRRVKFVQPTHLLGNLSRVLRLEESFALQL

RRFYFDERFRNEPVVGIYLFHQPALLIRDLQLVRTVLVEDFVSFSNRFAKCDGRSDKMGA

>FBtr0079061 | Cyp28d2 Cytochrome P450 28d2

MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGSFPSIFTRKRNIAYDI

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Star Strider 2024-9-24

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2155215-parsing-a-file-multiple-entries-consisting-of-strings-each-entry-contains-a-header-followed-by-a-de#answer_1522005

The Bioinformatics Toolbox has a number of functions for these files. The fastaread function appears to be appropriate. (I don’t have that Toolbox, I’m simply aware of some of its functions.)

2 个评论
显示无隐藏无

George 2024-9-24

移动：Star Strider 2024-9-24

Thank you for the prompt reply.

I've used the fastaread function.

[header,sequence] = fastaread(___)

A truncated output is shown below.

head(header)

{'FBtr0077276 | Cyp6v1 Cytochrome P450 6v1' }

{'FBtr0079061 | Cyp28d2 Cytochrome P450 28d2' }

{'FBtr0079925 | Cyp4e3 Cytochrome P450 4e3' }

{'FBtr0080587 | Cyp28a5 Cytochrome P450 28a5' }

head(squence)

head(a)

{'MVYSTNILLAIVTILTGVFIWSR.....................'}

{'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}

I could generate a cell array from this and parse it but unfortunatelly the sequence format is lost.

Star Strider 2024-9-24

在 MATLAB Online 中打开

I am not certain what you intend by ‘the sequence format is lost’. I also don’t have your file, so I can’t run fastaread with it to test this.

Perhaps you could create a table with:

CYP = cell2table(sequence, 'RowNames',header)

That might work.

Experimenting with something like that —

header = {'FBtr0077276  | Cyp6v1 Cytochrome P450 6v1',
    'FBtr0079061  | Cyp28d2 Cytochrome P450 28d2'};
sequence = {{'MVYSTNILLAIVTILTGVFIWSR.....................'}
{'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}};
CYP_Table = cell2table(sequence, 'RowNames',header)
CYP_Table = 2x1 table
                                                                            sequence                         
                                                   __________________________________________________________

    FBtr0077276  | Cyp6v1 Cytochrome P450 6v1      {'MVYSTNILLAIVTILTGVFIWSR.....................'          }
    FBtr0079061  | Cyp28d2 Cytochrome P450 28d2    {'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}

This required a bit of manual editing because I don’t have the actual function outputs (and I don’t have actual experience wit the function).. It might be possible to avoid the manual edits, perhaps using cellfun, and maybe compose as well. (I can’t tell from here.)

It should be relatively straightforward to get the information from ‘CYP_Table’ after that, although I don’t know what you want to do with the data after reading it and creating the table (if that’s what you actually want to do).

.

请先登录，再进行评论。

Parsing a file multiple entries consisting of strings. Each entry contains a header followed by a descriptor. The objective is to use headers to obtain a subset of entries.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

2 个评论
显示无隐藏无

另请参阅

类别

标签

Community Treasure Hunt

Parsing a file multiple entries consisting of strings. Each entry contains a header followed by a descriptor. The objective is to use headers to obtain a subset of entries.

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

2 个评论 显示 无隐藏 无

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

2 个评论
显示无隐藏无