Pre-indexing a tall datastore

1 次查看(过去 30 天)
Michael
Michael 2019-8-30
Hello,
I'd like to generate a set of indicies for my tall datastore in advance so, when I want to select rows, it will be fast.
Here is what I came up with.
I expect there is a much better way. Please help me improve it.
FYI, the datastore will be several million rows by four columns. I ran the following test on a very small file.
Thanks
Michael
Code:
% CREATE TALL DATASTORE
ds = datastore(fname);
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};
ds.SelectedFormats(1:3) = {'%{MM/dd/uuuu}D','%C','%C'};
tds = tall(ds);
gDATE = unique(tds.DATE);
gTICKER = unique(tds.TICKER);
gFIELD = unique(tds.FIELD);
% FIND UNIQUE LABELS
[uniqueDates, uniqueTickers, uniqueFields] = gather(gDATE,gTICKER,gFIELD);
% FIND THE INDEX OF EACH UNIQUE LABEL
for iTicker = 1:length(uniqueTickers)
gTicker = find(tds.TICKER==uniqueTickers(iTicker));
idxTicker{iTicker} = gather(gTicker);
end
for iField = 1:length(uniqueFields)
gField = find(tds.FIELD==uniqueFields(iField));
idxField{iField} = gather(gField);
end
% TABULATE
tTickerIndex = array2table([cellstr(uniqueTickers) idxTicker.'], ...
'VariableNames',{'TICKER','INDEX'});
tTickerIndex.TICKER= categorical(tTickerIndex.TICKER)
tFieldIndex = array2table([cellstr(uniqueFields) idxField.'], ...
'VariableNames',{'FIELD','INDEX'})
tFieldIndex.FIELD = categorical(tFieldIndex.FIELD)
% DISPLAY FIRST 10 VALUES FOR EQY_DVD_YLD_IND
out = tds(tFieldIndex.INDEX{tFieldIndex.FIELD=='EQY_DVD_YLD_IND'}(1:10),:);
gather(out)
Output
tTickerIndex = 1×2 table
TICKER INDEX
_______________ ________________
DFIVX US EQUITY [34112×1 double]
tFieldIndex = 6×2 table
FIELD INDEX
________________________________ _______________
DAY_TO_DAY_TOT_RETURN_GROSS_DVDS [6423×1 double]
DAY_TO_DAY_TOT_RETURN_NET_DVDS [6423×1 double]
DIVIDEND_INDICATED_YIELD [6242×1 double]
EQY_DVD_YLD_IND [6242×1 double]
EQY_DVD_YLD_IND_NET [6242×1 double]
FUND_TOTAL_ASSETS [2540×1 double]
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 1.1 sec
- Pass 2 of 2: Completed in 0.98 sec
Evaluation completed in 2.8 sec
ans = 10×4 table
DATE TICKER FIELD VALUE
__________ _______________ _______________ ______
06/01/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1484
06/02/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1611
06/03/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1718
06/06/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1654
06/07/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.159
06/08/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1442
06/09/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1484
06/10/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.138
06/13/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1277
06/14/1994 DFIVX US EQUITY EQY_DVD_YLD_IND 2.1277

回答(2 个)

Michael
Michael 2019-9-1
Apparently, indexing is completely ineffective. My data is about 950 million rows by 4 columns
ds = datastore('tallFinal.csv');
ds.SelectedFormats={'%q','%C','%C','%s'};
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};
tds = tall(ds);
i = int16(intersect(idxTicker,idxField));
size(i)
ans = 225 1
temp = tds(i,:);
temp = gather(temp);
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 3: Completed in 23 min 0 sec
- Pass 2 of 3: 33% complete

Rajani Mishra
Rajani Mishra 2020-2-13
I found below link related to indexing in tall array, find it below:
You can also consider creating a custom datastore and process data in smaller groups. For creating custom datastore please refer below:
Hope this helps!

类别

Help CenterFile Exchange 中查找有关 Tall Arrays 的更多信息

产品


版本

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by