importing csv array of unknown size, characters

Question

0 个投票

I have an import script which I use to import data from a pressure/velocity flow sensor device. The data are of a general format, but the amount of header text varies, as does the rows & columns of the actual data. Each CSV file contains several "runs" or recordings of data. I need to import each run into a different array (so a cell works well). The one consistency is that each "run" is separated by the text "Run #x" where 'x' is the run number integer.

Originally, the software generating the CSV files was consistent in that all the CSV data always had 266 columns. With the help of another user here, I exploited that consistency in this code:

function data = import_KC(filename)
fid = fopen(filename);
run_num = 1;
%all runs contain n x m numbers and n is different for each run.
%So we cannot use a multidimensional array.
%We use a cell instead.
data = {};
while ~feof(fid)
    %Skip the header lines.  We don't know in advance how many there are
    %so cant use the header functionality in textscan.
    current_line = fgetl(fid);
    while ischar(current_line)
        match = strfind(current_line, 'Run #');
        if(match)
            break;
        end
        current_line = fgetl(fid);
    end
      %Load data.  We don't know dimensions in advance
      %res will initially contain all data from this run in a single column
      res=textscan(fid,'%f','Delimiter',',');
      res=res{1};
      %find number of rows.  Columns are fixed at 265
      rows = length(res)/265;
      %reshape into a matrix
      res = reshape(res,265,rows)';
      %Add this run to the set of runs
      data{run_num} = res;
      run_num=run_num+1;
  end
fclose(fid);
end

Now I'm trying to figure out a way to use the same algorythm, but with a variable number of columns, as the new firmware seems to vary the columns sometimes.

My thought was to first read the data (or a portion of it, say the first 100 rows) into an array of some sort using something like xlsread, then read the size of that array to get the number of columns, which would replace the "266" in the script above.

The problem with this is that xlsread is very slow for very large data files, and these CSV files are huge. I thought reading only the first 100 rows would solve this, but I can't figure out how to reference only some number of rows, without knowing the number of columns.

Is there another way to read in these data, such that I can extract the number of columns there should be, Or if you have other suggestions I'd be open to them. Thanks!

KC

ps - thanks to whomever it was who helped me with this script last time. I couldn't find the posts in the newsgroup, which is where I think I posted about this before.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Geoff 2012-6-13

在 MATLAB Online 中打开

1 个投票

So the general thing is this... Keep track of the current state outside the loop:

line_num = 0;  % Current CSV line number
run_num = 0;   % Current run number
ncols = 0;     % Number of columns in current run
nrows = 0;     % Number of rows in current run
runs = [];     % Tracks actual run numbers (if they are not 1,2,3,4...)
rundata = {};  % The data for each run

I would use regexp to get the actual run number

while ~feof(fid)
    current_line = fgetl(fid);
    line_num = line_num + 1;
    % Detect a new run
    runtoks = regexp(current_line, '^Run #(\d+)', 'tokens');
    if iscell(runtoks) && numel(runtoks) == 1
      run_num = str2double(runtoks{1});
      runs(end+1) = run_num;
      rundata{end+1} = [];
      ncols = 0;
      nrows = 0;
      continue;
    end
    if run_num == 0
      fprintf('Warning: skipping line %d - no run header\n', line_num);
      continue;
    end
    % First time round for each run, detect number of columns.
    if ncols == 0
      ncols = sum(current_line == ',') + 1;
      rundata{end} = zeros(0,ncols);       % yes, a 0-by-ncols matrix
    end
    % Parse row of data from line, and check for sanity
    row = textscan( current_line, '%d', 'delimiter', ',' );
    if numel(row) ~= ncols
      fprintf('Column count mismatch on line %d (run %d)\n', line_num, run_num);
      continue;
    end
    % Append the row to the current run data
    nrows = nrows + 1;             % (*) Note: see discussion
    rundata{end}(nrows, :) = row;
end

Note (*): This could be pretty slow, because we're expanding the matrix for each run one row at a time. You can improve this in a similar fashion to how C++ vectors work: preallocate by some size (1000 rows), and grow the matrix by a constant or exponential amount when necessary:

if size( rundata{end}, 1 ) < nrows
  rundata{end}(end+1:end+1000, :) = 0;
end

In the case of doing the above, you would also need to store the size of each run in an array (like I do with the runs array), and trim them to size afterwards.

Beware, I just typed this code straight into my browser. It might not compile, and may have other problems. I have not handled headers at all cos you didn't describe your data file completely, so you need to sort that out. That aside, this is the general approach I would use to read arbitrary data sets unless I needed something really special.

6 个评论
显示 4更早的评论隐藏 4更早的评论

kc 2012-6-14

It *used* to have the same # of columns each time (266), but now it doesn't. Detecting the # of columns in the data was the crux of my question. Looking at the raw CSV data in a text editor, there doesn't seem to be anything to indicate a new row, just commas for new columns. But when I open it in Excel, Excel knows when new rows begin. Strange.

Geoff 2012-6-18

Oh, in that case you probably have UNIX -style line endings. Open the file without translation (ie omit the "t" part of the format for fopen()). What text editor did you use? Notepad I suppose. That is NOT a text editor in my opinion. Try Notepad++. Using fgetl() should work correctly if your newlines are being detected properly.

请先登录，再进行评论。

importing csv array of unknown size, characters

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

6 个评论
显示 4更早的评论隐藏 4更早的评论

类别

标签

Community Treasure Hunt

importing csv array of unknown size, characters

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

回答（1 个）

6 个评论 显示 4更早的评论 隐藏 4更早的评论

类别

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

6 个评论
显示 4更早的评论隐藏 4更早的评论