Using textscan to skip specific rows
7 次查看(过去 30 天)
显示 更早的评论
Right now I'm interested in using textscan to read specific rows of a .csv file which looks something like this:
type value latency
92 trigger 19593
type value latency
20 trigger 20000
type value latency
14 trigger 21000
type value latency
14 trigger 21000
type value latency
14 trigger 21000
type value latency
14 trigger 21000
type value latency
14 trigger 21000
type value latency
14 trigger 21000
type value latency
14 trigger 21000
I want to have textscan read the rows which have the values "92" and "19593", but to ignore the rows that are blank or the rows that have the words "type" "value" "latency". In addition, the documents I am scanning are of varying lengths, and I want textscan to automatically figure out which rows have the values of interest "92, 20, 14" and "19593" (latency values). How would I have textscan automatically scan the numerical values and ignore all other rows? In addition, I am not interested in scanning the column "value" at all. I only need to scan the two columns "type" and "latency".
Thank you in advance.
0 个评论
采纳的回答
Stephen23
2017-6-8
编辑:Stephen23
2017-6-8
Second attempt: this time we will first convert the awful file format into a standard CSV format with one header line, no empty rows, no quoted numbers, and no repeated headers. I used regexprep and simply defined appropriate regular expressions:
fnm = 'events file.csv';
str = fileread(fnm);
rgx = {'\s*\n\s*"type"[a-z",]+','"(\d+)\s*"'};
rep = {'','$1'};
str = regexprep(str,rgx,rep);
You could also save str if you want an improved CSV file. Now we can easily extract the data from this improved CSV string:
fmt = '%f%*q%f%f%f';
opt = {'HeaderLines',1, 'CollectOutput',true, 'Delimiter',','};
C = textscan(str,fmt,opt{:});
M = C{1}
giving:
M =
92 14287 0 1
20 16459 0 2
14 19147 0 3
14 20646 0 4
14 22047 0 5
14 24431 0 6
14 26844 0 7
14 28819 0 8
14 30071 0 9
14 31843 0 10
14 34788 0 11
14 36883 0 12
14 38439 0 13
14 39571 0 14
14 40691 0 15
14 41368 0 16
14 42443 0 17
20 42708 0 18
14 44955 0 19
14 46092 0 20
14 47771 0 21
14 48632 0 22
14 52780 0 23
14 54312 0 24
14 56616 0 25
14 58724 0 26
14 59831 0 27
14 61911 0 28
14 62859 0 29
14 64884 0 30
14 67534 0 31
14 68367 0 32
14 70052 0 33
20 70311 0 34
14 72450 0 35
14 73476 0 36
14 75143 0 37
14 76615 0 38
20 76622 0 39
Tested on this file:
更多回答(1 个)
Stephen23
2017-6-6
编辑:Stephen23
2017-6-7
One approach is to play with the EndOfLine character (a little hacky, I know):
opt = {'MultipleDelimsAsOne',true, 'EndOfLine','p',...
'CollectOutput',true, 'HeaderLines',1};
fmt = '%*s%*s%*s%f%*s%f';
fid = fopen('temp2.txt','rt');
C = textscan(fid,fmt,opt{:});
fclose(fid);
M = C{1}(1:2:end,:);
giving:
>> M
M =
92 19593
20 20000
14 21000
14 21000
14 21000
14 21000
14 21000
14 21000
14 21000
As long as the file is not huge then this will be a reasonably effective method, although it does create a larger output matrix than required.
Note that your comment that the file "looks something like this" and the misleading "a .csv file" (whereas there is not a single comma anywhere in your example data) means that you are not providing an accurate enough specification. Reading files requires a very precise specification: every newline, every separator, non-printable characters, trailing spaces, trailing delimiters, repeated space characters, etc, etc, makes a huge difference to how a file is read. Because you did not provide us with a sample file I created my own one to test my code on, available here:
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!