Conversion of mat to csv is time and space complex

Question

Philipp Kutschmann 2024-1-21

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2072476-conversion-of-mat-to-csv-is-time-and-space-complex

回答： Shubham 2024-3-15

I have a script converting .mat data files to csv for further proseccing in python.

The .mat file contains a 'debug' structure with ~320 timeseries.

The problem is in line 36 with the cell2table function. It takes for ages and fills my 16GB of memory, while the original .mat file is only 270MB max.

What would be a better way of doing this?

% Load the .mat file
data_struct = load('path\to\mat_file');
disp('data_struct loaded');
% Get the struct variable
data = data_struct.debug;
% Access the struct field names
field_names = fieldnames(data);
% Extract the timestamps from the first timeseries
timestamps = data.(field_names{1}).Time;
% Initialize a cell array to store data
csvData = cell(length(timestamps)+1, length(field_names)+1);
disp('csvData initialized');
% Set the header row with field names and 'time' in the first column
csvData(1, 1) = {'time'};
csvData(1, 2:end) = field_names;
% Fill in the timestamps in the first column
csvData(2:end, 1) = num2cell(timestamps);
disp('timestamps created');
% Fill in the timeseries data in the remaining columns
for i = 1:length(field_names)
    current_field_name = field_names{i};
    timeseriesData = data.(current_field_name).Data;
    csvData(2:end, i+1) = num2cell(timeseriesData);
    fprintf('Data %d/%d has been written to cell \n', i, length(field_names));
end
% Convert the cell array to a table
disp('Converting cell to table');
csvTable = cell2table(csvData);
disp('Converting done');
% Write the table to a CSV file
writetable(csvTable, 'output.csv');
disp('csv created');

5 个评论
显示 3更早的评论隐藏 3更早的评论

Star Strider 2024-1-21

I haven’t used Python (yet). I’ll keep that in mind.

Philipp Kutschmann 2024-1-21

Ok, thanks for the answer. I'll try it with python but the compression might still be a problem there as well.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Shubham 2024-3-15

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2072476-conversion-of-mat-to-csv-is-time-and-space-complex#answer_1425846

在 MATLAB Online 中打开

Hi Philipp,

Converting a large .mat file with many timeseries into a CSV format can indeed be memory-intensive, especially when using cell arrays and tables due to the overhead associated with these data structures. A more efficient approach might be to write directly to the CSV file without converting the entire dataset into a table first. This can significantly reduce memory usage and potentially speed up the process.

Here's how you can modify your script to write directly to a CSV file:

% Load the .mat file
data_struct = load('path\to\mat_file');
disp('data_struct loaded');
% Get the struct variable
data = data_struct.debug;
% Access the struct field names
field_names = fieldnames(data);
% Extract the timestamps from the first timeseries
timestamps = data.(field_names{1}).Time;
% Open a file for writing
fid = fopen('output.csv', 'w');
% Write the header row with field names and 'time' in the first column
fprintf(fid, 'time,');
fprintf(fid, '%s,', field_names{1:end-1});
fprintf(fid, '%s\n', field_names{end});
% Write the data rows
for i = 1:length(timestamps)
    % Write the timestamp
    fprintf(fid, '%f,', timestamps(i));
    
    % Write the timeseries data for each field
    for j = 1:length(field_names)
        current_field_name = field_names{j};
        timeseriesData = data.(current_field_name).Data(i);
        if j == length(field_names)
            % For the last field, end the line
            fprintf(fid, '%f\n', timeseriesData);
        else
            fprintf(fid, '%f,', timeseriesData);
        end
    end
    
    % Optional: Display progress
    if mod(i, 1000) == 0
        fprintf('Row %d/%d has been written \n', i, length(timestamps));
    end
end
% Close the file
fclose(fid);
disp('CSV created');

This script does the following:

Instead of accumulating all data in memory, it directly writes to the file.
It prints the column names (time and all field_names) as the first row of the CSV.
For each timestamp, it writes the corresponding data from each timeseries. This is done row by row, significantly reducing memory usage since only one row of data is handled at a time.
After writing all the data, it properly closes the file.

Advantages:

This approach avoids creating a large cell array or table in memory, instead writing data directly to a file as it is processed.
Directly writing to a file can be faster than creating a large data structure and then converting it to a table and writing it to a file.

Considerations:

Ensure that the format (e.g., floating-point precision) is appropriate for your data. Adjust the fprintf format specifiers as needed.
Consider adding error handling, for example, checking if the file was successfully opened before proceeding with writing.
The optional progress reporting (every 1000 rows) can help monitor the script's progress, especially with large datasets. Adjust the frequency of these messages as needed based on your dataset size.