Best way to parse data from a large, mixed-format text file

3 次查看(过去 30 天)
I'm considering a new way to parse data from a large, mixed format text data file. Currently we call a C file to parse the data, use Mex functions to store the data in Matlab-compatible structures, and then save the parsed data in a .m file. Then Matlab can just read in the .m file to access the desired data. (There is also a definition file that the C-file reads in that allows customizing what data is returned in the structure.)
While this works, it is an old program, and system upgrades often cause library and Mex compatibility problems that must be resolved. I'd like to create a new data parser that can return data in a similar manner as the C parser (in fact I need to continue to support the same data output as the existing C-parser for existing scripts) but that allows some enhancements to the parsing. I'm looking for suggestions on how I might do this.
I'm considering Java (because I'm familiar with Java programming), but looking at the Matlab-Java interface, it seems to have only basic methods of transferring data from Java to Matlab.
For example, I would want to parse the data file into, say, an array of a 2000 structures, and then pass that array of structures into Matlab. Fortunately each structure is relatively simple and the fields would be either a string, an array of numbers, or an array of strings (although one is an array of arrays of numbers, but that could be converted within Matlab).
  2 个评论
Oleg Komarov
Oleg Komarov 2011-5-12
You could try to post an example of the mixed format text data file (here or on upload facility providing the link) and provide a brief (if feasible) description of what you want to import.
Scott
Scott 2011-5-12
The exact format of the file isn't that important. If it helps, imagine rows from several tables of a relational database stored in alternating blocks in a flat text file. I'm more curious about suggestions for a general approach rather than a specific routine to parse our particular data. I'm also curious whether anyone has written some kind of parser in Java and tried to import a large chunk of data into Matlab.

请先登录,再进行评论。

回答(1 个)

Jason Ross
Jason Ross 2011-5-12
Have you considered putting the data into a database and then using database calls to get the data out?
It seems from your description that you have already implemented a database of sorts, with the mixed format text file and C program serving as the database.
If whatever was generating the text data had the ability to export to a database, it might be a very effective go-between that takes some of the maintenance and upkeep of the existing code out of your hands.
  5 个评论
Oleg Komarov
Oleg Komarov 2011-5-13
We had to import .txt files of up to 400 mb sometime. Don't remember the exact time it used to process the whole bunch but something around some minutes mins for 40 files for a total of ~4 gb. If ti suits you then you could post a reduced example of your file
Jason Ross
Jason Ross 2011-5-13
If you don't want to do the DB route, Perl is indeed another very good option. It also lets you move up a layer from C and will likely insulate you from of the low-level churn you indicated that you were having, since your code is more highly likely to be more portable between different platforms.
MATLAB, is also more highly portable and available on more platforms, too. There are definitely benefits to keeping an all-MATLAB solution in terms of maintenance and keeping all the requisite parts together.
It's also a question of what your colleagues and organization might like to work with, too. If you are a bunch of Java, C and M folks, then adding another language into the mix is likely not going to be a good match, since you're likely to spend more time on code maintenance than designing the thing in the first place.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 String Parsing 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by