Applying regexpi to csv file
5 次查看(过去 30 天)
显示 更早的评论
I'm learning about regexpi. I have a csv file:
State Cloudy Sunshine Rainy
Indiana 52 89 0
California 66 99 1
New York 78 95 1
Indiana 25 52 5
California 45 42 2
Indiana 22 26 5
Indiana 22 89 2
Using regexpi or if there is a better way: I need a data output for only Indiana and Rainy so:
Indiana_Rainy = 0,5,2
So I will be skipping over 2 data columns and 3 commas since it is a csv file.
Thanks,
Amanda
0 个评论
采纳的回答
Cedric
2013-5-2
编辑:Cedric
2013-5-2
So it is true that your CSV has not comma as delimiter?
The general approach is to use TEXTREAD, TEXTSCAN, FSCAN, etc when you have formatted data, and REGEXP/I when there is a lack of structure/format and you have to go for pattern matching.
If there is really no comma to separate columns in your CSV file, you are a bit in an intermediary situation, as the content looks really well structured, yet we cannot use the white-space as a separator/delimiter as some states names contain white-spaces.
For me, you are left with either pattern matching, or, if the structure of columns is such that there is always 12 characaters in the first column, 8 (7?) in the second, and 13 (?) in the third, you could use FGETL and split the content according to these boundaries.
I can develop both methods for you if you want an example.
Note that if you are interested in pattern matching and to spend ~ a day full time learning regular expressions (I'd say that within 3 hours full time, you would already have a good idea about regexp fundamentals), Mathworks wrote a very good section about them in "Programming Fundamentals" (p.2-26 to 2-86) available here: http://www.mathworks.com/help/pdf_doc/allpdf.html.
4 个评论
Cedric
2013-5-3
编辑:Cedric
2013-5-3
Using REGEXPI, the simplest solution is probably the following:
>> content = fileread('myData.csv') ;
>> state = 'Indiana' ;
>> pattern = sprintf('(?<=%s(\\s+\\d+){2}\\s+)\\d+', state) ;
>> match = regexpi(content, pattern, 'match') ;
>> rain = str2double(match)
rain =
0 5 5 2
Here the pattern matches from 1 to as many numeric characters as possible \d+ preceded by (positive look behind: (?<=expression)) an expression made of the state name (literal) followed by twice (expression){2} the expression/pattern defined by \s+\d+, meaning from 1 to as many white space as possible and from 1 to as many numeric characters as possible, and again from 1 to as many white spaces as possible \s+. This is a way to match values in each column and the variable amount of white spaces that follow.
The \ character being a special character in the format spec of SPRINTF, it must be escaped with an additional \, which explains the \\. Patterns outputted by SPRINTF have, however, the following structure:
'(?<=Indiana(\s+\d+){2}\s+)\d+'
or equivalently (avoiding the repetition with {2}):
'(?<=Indiana\s+\d+\s+\d+\s+)\d+'
EDIT: as mentioned earlier, structured file content is usually read using TEXTREAD/SCAN, FSCANF, etc, so you should focus on these functions first. If there is some default in the structure (e.g. no comma and unable to use space as delimiter because a column contains spaces in the values), you can generally recover the situation using REGEXP/I. Regular expressions are The most powerful tool, available in most serious languages, for performing pattern matching/replacement in text. If you think that you will have a significant amount of pattern matching to perform in your field/career, it is really worth investing a day full time to get enough mastery so you at least know what resources they can provide to you. You won't become a regexp guru within a day, but I can guarantee that you would be able to understand a practice enough with most of the material that is explained in the document mentioned earlier.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!