Applying regexpi to csv file

Question

0 个投票

I'm learning about regexpi. I have a csv file:

State      Cloudy Sunshine  Rainy
Indiana      52      89          0
California  66      99          1
New York    78      95          1
Indiana      25      52          5
California  45      42          2
Indiana      22      26          5
Indiana      22      89          2

Using regexpi or if there is a better way: I need a data output for only Indiana and Rainy so:

Indiana_Rainy = 0,5,2

So I will be skipping over 2 data columns and 3 commas since it is a csv file.

Thanks,

Amanda

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Cedric 2013-5-2

编辑：Cedric 2013-5-2

0 个投票

So it is true that your CSV has not comma as delimiter?

The general approach is to use TEXTREAD, TEXTSCAN, FSCAN, etc when you have formatted data, and REGEXP/I when there is a lack of structure/format and you have to go for pattern matching.

If there is really no comma to separate columns in your CSV file, you are a bit in an intermediary situation, as the content looks really well structured, yet we cannot use the white-space as a separator/delimiter as some states names contain white-spaces.

For me, you are left with either pattern matching, or, if the structure of columns is such that there is always 12 characaters in the first column, 8 (7?) in the second, and 13 (?) in the third, you could use FGETL and split the content according to these boundaries.

I can develop both methods for you if you want an example.

Note that if you are interested in pattern matching and to spend ~ a day full time learning regular expressions (I'd say that within 3 hours full time, you would already have a good idea about regexp fundamentals), Mathworks wrote a very good section about them in "Programming Fundamentals" (p.2-26 to 2-86) available here: http://www.mathworks.com/help/pdf_doc/allpdf.html.

4 个评论
显示 2更早的评论隐藏 2更早的评论

Cedric 2013-5-2

编辑：Cedric 2013-5-3

在 MATLAB Online 中打开

I missed the part where you mention the commas in your question. So you have commas in the file? If your file looks like

 State,Cloudy,Sunshine,Rainy
 Indiana,52,89,0
 California,66,99,1
 New York,78,95,1
 Indiana,25,52,5
 California,45,42,2
 Indiana,22,26,5
 Indiana,22,89,2

and not like in your question, you can just go for a solution based on a formatted read of some sort, e.g.:

 >> [state,cloud,sunshine,rain] = textread('myData.csv', ...
            '%s %d %d %d', 'delimiter', ',', 'headerlines', 1)
 state = 
    'Indiana'
    'California'
    'New York'
    'Indiana'
    'California'
    'Indiana'
    'Indiana'
 cloud =
    52
    66
    78
    25
    45
    22
    22
 sunshine =
    89
    99
    95
    52
    42
    26
    89
 rain =
     0
     1
     1
     5
     2
     5
     2

Then use STRCMPI to find relevant indices in state:

>> id = strcmpi(state, 'Indiana') ;

which you can use to get relevant rain data

 >> rain(id)
 ans =
     0
     5
     5
     2

If you don't have commas, maybe you have tab characters instead and you can achieve the same thing using

 >> [state,cloud,sunshine,rain] = textread('myData.csv', ...
            '%s %d %d %d', 'delimiter', '\t', 'headerlines', 1)

If none works because you have neither commas nor tabs, we can discuss my first proposals.

Amanda 2013-5-3

Thanks you so much. I am learning so much. And I will build upon my knowledge.

Cedric 2013-5-3

编辑：Cedric 2013-5-3

在 MATLAB Online 中打开

Using REGEXPI, the simplest solution is probably the following:

 >> content = fileread('myData.csv') ;
 >> state = 'Indiana' ; 
 >> pattern = sprintf('(?<=%s(\\s+\\d+){2}\\s+)\\d+', state) ; 
 >> match = regexpi(content, pattern, 'match') ; 
 >> rain = str2double(match)
 rain =
     0     5     5     2

Here the pattern matches from 1 to as many numeric characters as possible \d+ preceded by (positive look behind: (?<=expression)) an expression made of the state name (literal) followed by twice (expression){2} the expression/pattern defined by \s+\d+, meaning from 1 to as many white space as possible and from 1 to as many numeric characters as possible, and again from 1 to as many white spaces as possible \s+. This is a way to match values in each column and the variable amount of white spaces that follow.

The \ character being a special character in the format spec of SPRINTF, it must be escaped with an additional \, which explains the \\. Patterns outputted by SPRINTF have, however, the following structure:

'(?<=Indiana(\s+\d+){2}\s+)\d+'

or equivalently (avoiding the repetition with {2}):

'(?<=Indiana\s+\d+\s+\d+\s+)\d+'

EDIT: as mentioned earlier, structured file content is usually read using TEXTREAD/SCAN, FSCANF, etc, so you should focus on these functions first. If there is some default in the structure (e.g. no comma and unable to use space as delimiter because a column contains spaces in the values), you can generally recover the situation using REGEXP/I. Regular expressions are The most powerful tool, available in most serious languages, for performing pattern matching/replacement in text. If you think that you will have a significant amount of pattern matching to perform in your field/career, it is really worth investing a day full time to get enough mastery so you at least know what resources they can provide to you. You won't become a regexp guru within a day, but I can guarantee that you would be able to understand a practice enough with most of the material that is explained in the document mentioned earlier.

请先登录，再进行评论。

Applying regexpi to csv file

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

4 个评论
显示 2更早的评论隐藏 2更早的评论

更多回答（0 个）

类别

标签

Community Treasure Hunt

Applying regexpi to csv file

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

4 个评论 显示 2更早的评论 隐藏 2更早的评论

更多回答（0 个）

类别

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

4 个评论
显示 2更早的评论隐藏 2更早的评论