Processing Big Data Files

1 次查看(过去 30 天)
Ugur Acar
Ugur Acar 2019-10-24
I have txt file of 120MB. It has around 3600000 rows in it. I need to read this data using script generated from import data menu.
But when i tried to run script it gives out of memory error. Is there any other way to read that big data ?
I have i7-7700HQ cpu @2.80Ghz and 8 gb of RAM, msi laptop computer.
%% Initialize variables.
filename = 'sicaklik.txt';
delimiter = '|';
startRow = 2;
formatSpec = '%s%s%s%s%s%s%s%[^\n\r]';
%% Open the text file.
fileID = fopen(filename,'r','n','UTF-8');
%% Skip the BOM (Byte Order Mark).
fseek(fileID, 3, 'bof');
%%Read columns of data according to the format.
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'TextType', 'string', 'HeaderLines' ,startRow-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
%% Close the text file.
fclose(fileID);
% Convert the contents of columns containing numeric text to numbers.
%% Replace non-numeric text with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
%%
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1));
end
%%
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,3,4,5,6,7]
% Converts text in the input cell array to numbers. Replaced non-numeric
% text with NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1)
% Create a regular expression to detect and remove non-numeric prefixes and
% suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\,]*)+[\.]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\,]*)*[\.]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData(row), regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if numbers.contains(',')
thousandsRegExp = '^\d+?(\,\d{3})*\.{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'))
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric text to numbers.
if ~invalidThousandsSeparator
numbers = textscan(char(strrep(numbers, ',', '')), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch
raw{row, col} = rawData{row};
end
end
end
%% Split data into numeric and string columns.
rawNumericColumns = raw(:, [1,3,4,5,6,7]);
rawStringColumns = string(raw(:, 2));
%% Make sure any text containing <undefined> is properly converted to an <undefined> categorical
idx = (rawStringColumns(:, 1) == "<undefined>");
rawStringColumns(idx, 1) = "";
%% Create output variable
all_cities = table;
all_cities.Istasyon_No = cell2mat(rawNumericColumns(:, 1));
all_cities.Istasyon_Adi = categorical(rawStringColumns(:, 1));
all_cities.YIL = cell2mat(rawNumericColumns(:, 2));
all_cities.AY = cell2mat(rawNumericColumns(:, 3));
all_cities.GUN = cell2mat(rawNumericColumns(:, 4));
all_cities.SAAT = cell2mat(rawNumericColumns(:, 5));
all_cities.SICAKLIK_C = cell2mat(rawNumericColumns(:, 6));
%Clear temporary variables
clearvars filename delimiter startRow formatSpec fileID dataArray ans raw col numericData rawData row regexstr result numbers invalidThousandsSeparator thousandsRegExp rawNumericColumns rawStringColumns idx;

回答(1 个)

Fangjun Jiang
Fangjun Jiang 2019-10-24
Split the large file to smaller files and apply Tall Array

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by