Processing Big Data Files

Question

Ugur Acar 2019-10-24

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/487254-processing-big-data-files

回答： Fangjun Jiang 2019-10-24

I have txt file of 120MB. It has around 3600000 rows in it. I need to read this data using script generated from import data menu.

But when i tried to run script it gives out of memory error. Is there any other way to read that big data ?

I have i7-7700HQ cpu @2.80Ghz and 8 gb of RAM, msi laptop computer.

%% Initialize variables.
filename = 'sicaklik.txt';
delimiter = '|';
startRow = 2;
formatSpec = '%s%s%s%s%s%s%s%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r','n','UTF-8');
%% Skip the BOM (Byte Order Mark).
fseek(fileID, 3, 'bof');
%%Read columns of data according to the format.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines' ,startRow-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
%% Close the text file.
fclose(fileID);
% Convert the contents of columns containing numeric text to numbers.
%% Replace non-numeric text with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
%%
for col=1:length(dataArray)-1
    raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1));
end
%%
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,3,4,5,6,7]
    % Converts text in the input cell array to numbers. Replaced non-numeric
    % text with NaN.
    rawData = dataArray{col};
    for row=1:size(rawData, 1)
        % Create a regular expression to detect and remove non-numeric prefixes and
        % suffixes.
        regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
        try
            result = regexp(rawData(row), regexstr, 'names');
            numbers = result.numbers;
            
            % Detected commas in non-thousand locations.
            invalidThousandsSeparator = false;
            if numbers.contains(',')
                thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
                if isempty(regexp(numbers, thousandsRegExp, 'once'))
                    numbers = NaN;
                    invalidThousandsSeparator = true;
                end
            end
            % Convert numeric text to numbers.
            if ~invalidThousandsSeparator
                numbers = textscan(char(strrep(numbers, ',', '')), '%f');
                numericData(row, col) = numbers{1};
                raw{row, col} = numbers{1};
            end
        catch
            raw{row, col} = rawData{row};
        end
    end
end
%% Split data into numeric and string columns.
rawNumericColumns = raw(:, [1,3,4,5,6,7]);
rawStringColumns = string(raw(:, 2));
%% Make sure any text containing <undefined> is properly converted to an <undefined> categorical
idx = (rawStringColumns(:, 1) == "<undefined>");
rawStringColumns(idx, 1) = "";
%% Create output variable
all_cities = table;
all_cities.Istasyon_No = cell2mat(rawNumericColumns(:, 1));
all_cities.Istasyon_Adi = categorical(rawStringColumns(:, 1));
all_cities.YIL = cell2mat(rawNumericColumns(:, 2));
all_cities.AY = cell2mat(rawNumericColumns(:, 3));
all_cities.GUN = cell2mat(rawNumericColumns(:, 4));
all_cities.SAAT = cell2mat(rawNumericColumns(:, 5));
all_cities.SICAKLIK_C = cell2mat(rawNumericColumns(:, 6));
%Clear temporary variables
clearvars filename delimiter startRow formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp rawNumericColumns rawStringColumns idx;