Textscan with different formats

6 次查看(过去 30 天)
Vick
Vick 2017-10-24
编辑: Vick 2017-10-25
Hi,
I'm not familiar reading text files and also formats involved in text files. Here is my problem, i been trying to read a text file which has a unknown rows and columns, with a '\t' delimiter, column headers with more than 2( second one will be an unit which is not required for me, only first one is considered). I was using importdata for reading text and data separately, it was working fine but yesterday i found a problem like my input text file contains '*' for missing data, which during importing considered as character and as a row header.
There is been hundreds of questions asked for text file reading, ive found solutions like tableread, import as char and convert with str2double(which is slow),readtext(file exchange) but none of the solution is as fast as importdata function.
What i was expecting is read only the numeric data from the textfile(replace char with NaN during import itself as xlsread), I understand which can be done using textscan but i was unable to give formatspec for the files Or a faster str2double function.
When i give formatspec as ('%s %f') is the first row is taken as string or the first column?
Note: text file size is 100000*600 column.Some files second column(Units) may not be present,data starts form second column itself. Suppose if my delimiter changes to ',' for another file how to auto detect delimiter?
  4 个评论
Stephen23
Stephen23 2017-10-25
@surey: are the missing data always in that column, or can they occur in other columns as well?
Vick
Vick 2017-10-25
编辑:Vick 2017-10-25
Hi, There are more than 20 missing column in my actual data.. Can be in any column.. Additionally My missing data may be at a single row at any column,rather than being a whole column...

请先登录,再进行评论。

回答(1 个)

Walter Roberson
Walter Roberson 2017-10-24
"When i give formatspec as ('%s %f') is the first row is taken as string or the first column?"
No, not either. textscan() loops contining from the current file position, which might be in the middle of a line. If your format only reads a portion of a line, then the rest of the line is not discarded before the format is used again: instead the file position is updated right into the middle of a line and then it loops and applies the format again to where-ever it is.
For example, in the file
abc 123 456 789 1011
def
then a "%s%f" format would first read the 'abc' with %s format, then read the 123 with numeric format, temporarily leaving the textscan output as {{'abc'}, [123]} . Then textscan would re-apply the format from where it was, reading '456' with the %s format and 789 with the numeric format, updating the textscan output to {{'abc'; '456'}, [123; 789]}. Then the %s would grab the 1011, and the %f would choke on the def of the next line, leaving you with {{'abc'; '456'; '1011'}, [123; 789]} -- notice the numeric column is shorter than the text column because it happened to give up reading before that column was updated.
Now, if you happen to have the same number of format items as you have columns, then the effect is that each format item applies to a column. But if you hit a row that has a missing entry that is implied by spacing (no explicit delimiter between fields), or you have a numeric field specification but encounter a string instead and you do not have TreatAsEmpty set, or if %s column unexpectedly has a space in it... in any of those circumstances, the nice correspondence between column and format specifier will get messed up.
One of the key things you need to know about textscan() is that unless you have set 'WhiteSpace' to exclude the space character, that at the beginning of every format specifier, blanks starting at the current position are discarded -- even if the format specifier is %c or %s or %[]. This makes it tricky to deal with optional fields that are replaced by blanks, (unless you happen to be using a field separator such as comma or tab). The immediate thought might be to just remove space from the 'whitespace', but when that parameter does not include space, then leading spaces are an error for numeric fields! I showed how to get around that in https://www.mathworks.com/matlabcentral/answers/361377-textscan-failing-to-read-data-in-text-file#answer_286302
  1 个评论
Vick
Vick 2017-10-25
Hi Roberson,
Thanks for the detailed explanation. I'm now able to specify the format spec for simpler problems but Still i'm struggling to specify the formatspec for my problem.
Attached the file on @Stephen Cobeldick's comment.. https://in.mathworks.com/matlabcentral/answers/362921-textscan-with-different-formats#comment_496926

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Data Import and Export 的更多信息

产品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by