Extracting a wide table from a very tall datastore array, also trying KDB+

1 次查看(过去 30 天)
Hello,
I am currently storing about 40GB of data in a 4 column tall array with about 1B rows. I'm using a Matlab Tall Array Datastore, but I also plan on experimenting with KDB+.
Could someone comment on the best way to extract a wide table, with about 350,000 rows and about 20 columns by selecting some of the items in the first three columns and the corresponding values,e.g DATE = {29-Jul-1983:31-Aug-1983} & STOCK = AAPL US EQUITY & FIELD= {MOV_AVG_50D & PRICE}?
Thank You,
Michael
Input
DATE,STOCK,FIELD,VALUE
29-Jul-1983 00:00:00,BHP AT EQUITY,MOV_AVG_50D,0.8979
31-Aug-1983 00:00:00,BHP AT EQUITY,PRICE,0.9029
29-Jul-1983 00:00:00,IBM US EQUITY,MOV_AVG_50D,0.9106
31-Aug-1983 00:00:00,IBM US EQUITY,PRICE_50D,0.9154
29-Jul-1983 00:00:00,AAPL US EQUITY,MOV_AVG_50D,0.9227
31-Aug-1983 00:00:00,AAPL US EQUITY,PRICE,0.9311
Output:
DATE,MOV_AVG_50D,PRICE
29-Jul-1983 00:00:00,0.9227,NaN
31-Aug-1983 00:00:00,NaN,0.9311
  1 个评论
Michael
Michael 2019-8-29
This is what I tried so far.
ds = datastore('bigtall.csv','DatetimeType','text');
ds.SelectedFormats={'%q','%C','%C','%f'};
tds = tall(ds);tm.cputime(1)=cputime;
tic;
t = tds(tds.TICKER=='UNIT US EQUITY' & tds.FIELD=='EBITDA_MARGIN',:);
T = gather(t);
tm.cputime(2)=cputime;
tm.tictoc(2)= toc;
fprintf('%d minutes CPU, %d minutes\n',[diff(tm.cputime) tm.tictoc(2)]./60);
fprintf('Table size [%dx%d]\n',size(T));
T.DATE = datetime(T.DATE);
tb = unstack(T(:,{'DATE','VALUE','FIELD'}), 'VALUE', 'FIELD');
fprintf('Table size [%dx%d]\n',size(tb));
The output took 23 minutes.
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 4).Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 20 min 44 sec
Evaluation completed in 21 min 26 sec
5 minutes CPU, 23 minutes
Table size [762x4]
Table size [762x2]

请先登录,再进行评论。

采纳的回答

Michael
Michael 2019-9-12
Hello,
Here's an update. Unfortunatly, or fortuntely depending on how you look at it, KDB+ is about 45x times faster (31 seconds vs. 23 minutes) and can be accessed through Matlab's datafeed toolbox.
Best,
Michael
  1 个评论
Michael
Michael 2019-9-12
This took 1 second instead of 23 minutes!
select from f where TICKER=`$("AAPL US EQUITY"), FIELD in (`MOV_AVG_50D,`HIGH), DATE within 1983.07.29 1983.08.31

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Big Data Processing 的更多信息

产品


版本

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by