multiple for loops split data
显示 更早的评论
hey guys,
currently my function is really slow because of the mass of the data and because it uses only one thread.
Since i have a multicore Processor (Ryzen 5 3600, 6 Cores / 12 Threads), i want to make use of it by splitting my data and using multiple times the same function on these data and putting them back together.
I have found the spmd and parfor command
The raw steps which i want to to:
- split the Data (tables) n times
- give each worker enough parts of the splitted data and the raw data (which i need for the function)
- run a function which modifies the splitted data on each worker
- put all the splitted data back together
Also i am limited to functions in Matlab 2015b for my use.
How can i do that? Can you please help me?
This is what i tried:
workers = 12;
divider = ceil(specs.numberOfRows/workers);
split1 = data((data.ID <= divider),:);
split2 = data((data.ID > divider) & (data.ID <= divider*2),:);
split3 = data((data.ID > divider*2) & (data.ID <= divider*3),:);
split4 = data((data.ID > divider*3) & (data.ID <= divider*4),:);
split5 = data((data.ID > divider*4) & (data.ID <= divider*5),:);
split6 = data((data.ID > divider*5) & (data.ID <= divider*6),:);
split7 = data((data.ID > divider*6) & (data.ID <= divider*7),:);
split8 = data((data.ID > divider*7) & (data.ID <= divider*8),:);
split9 = data((data.ID > divider*8) & (data.ID <= divider*9),:);
split10 = data((data.ID > divider*9) & (data.ID <= divider*10),:);
split11 = data((data.ID > divider*10) & (data.ID <= divider*11),:);
split12 = data((data.ID > divider*11) & (data.ID <= specs.numberOfRows),:);
dataset_array={split1, split2,split3,split4,split5,split6,split7,split8,split9,split10,split11,split12};
parfor i=1:12
newDataset_array(i) = myFunction(dataset_array(i),data);
end
for i = 1:1:12
newData = [newData;newDataset_array(i)]
end
Thanks in Advance
11 个评论
Jakob B. Nielsen
2020-1-15
编辑:Jakob B. Nielsen
2020-1-15
I think parfor only runs on parallel cores/workers with the parallel computing toolbox... I assume you have that? Can you give a little more info of what your issue is?
dpb
2020-1-15
" i dont quite understand how to use parfor the optimal way"
Read the introductory documentation and study the examples carefully, then.
Can't really comment on the parfor bit as I don't have the parallel toolbox. As far as I know, your parfor code probably works as you want, but it's not clear why you're passing both a portion of data (as dataset_array(i)) and the whole of data.
With regards to your code. Numbered variables are always a bad idea, even temporary ones. For a start it forces you to needlessly repeat the same code several times (witness all your splitx = ... lines).
At the very least you should use a loop
workers = 12;
divider = ceil(specs.numberOfRows/workers);
%so much simpler than numbered variables
dataset_array = cell(1, numel(workers))
for idx = 1:workers
dataset_array{idx} = data((data.ID > divider*idx-1) & (data.ID <= divider*idx), :);
end
Probably better:
workers = 12;
destination = discretize(data.ID, workers) ; %split ID into workers bins
dataset_array = cell(1, numel(workers))
for idx = 1:workers
dataset_array{idx} = data(destination == idx, :);
end
or:
workers = 12;
destination = discretize(data.ID, workers) ; %split ID into workers bins
dataset_array = splitapply(@(rows) {data(rows, :)}, (1:height(data))', destination);
15 lines of code down to 3! And if you want to change the number of workers, you just have one line to edit instead of lots of copy/paste or deletions required.
Most likely, your myFunction takes a table as input, not a 1x1 cell array of table, in which case your parfor should be:
newDataset_array = cell(size(dataset_array))
parfor i=1:numel(dataset_array) %don't hardcode values
newDataset_array{i} = myFunction(dataset_array{i}); %Use {} indexing to get the content of the cell
end
Owner5566
2020-1-15
Owner5566
2020-1-15
Guillaume
2020-1-15
It's not in the release notes, but it appears that the number of bins option was added in R2016b.
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers))
should work for you.
Owner5566
2020-1-15
Guillaume
2020-1-15
Oh, of course, N edges == (N-1) bins. Use
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers + 1));
Owner5566
2020-1-15
Guillaume
2020-1-15
Now i just need a way, to make the big data Available to all workers
The way i do it now, they all get it in the function, which leads to a lot of memory use.
Cant i make it available to all?
I need it for filtering in the functions
采纳的回答
更多回答(0 个)
类别
在 帮助中心 和 File Exchange 中查找有关 Parallel for-Loops (parfor) 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!