Mapreduce does not seem to use all available cores

Question

Mehrdad Oveisi 2014-11-10

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/162140-mapreduce-does-not-seem-to-use-all-available-cores

回答： Rick Amos 2014-11-24

Hello,

I am using mapreduce on a machine with 16 cores. I make a pool with 15 workers (cores) which works fine. When I run mapreduce though, it only utilizes one or two workers: sometimes one for the mapper and one for the reducer. This is how I check which worker is processing the data (in addition to using a system monitor to watch CPU/core activities):

tk=getCurrentTask();
disp(tk.ID)

There are tens of files to be processed and each mapper is called with one file to process. Each time a mapper is called it loads and processes one file. I expect that during the first call to the mapper and while it is loading and processing the first file on one worker (core), there are other parallel calls to mapper to process the next files on other workers. However, this is not how it happens; it just sequentially calls the mapper on the same worker. Sometimes it uses a second worker for the reducer calls. So at most it uses two workers, while there are 15 available in the pool.

What would be a simple code to check if mapreduce is making use of all the available cores?

EDIT: Actually now I can confirm that the mapper is always run by a single worker, but the reducer may be run by a few different workers, as expected.

Your help is appreciated, Mehrdad

10 个评论
显示 8更早的评论隐藏 8更早的评论

Mehrdad Oveisi 2014-11-13

编辑：Mehrdad Oveisi 2014-11-13

在 MATLAB Online 中打开

workers_test.m

Actually I have now come up with a simple example code to illustrate this problem (changing the example presented in Getting Started with MapReduce). Running the following code (also attached) on my system shows that there is only one worker for the mapper function. Note the single value 9 for the key 'MapperTaskID' in the output.

Output:

            Key           Value  
      _______________    ________
      'ReducerTaskID'    [     9]
      'Mean'             [702.16]
      'ReducerTaskID'    [     7]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      'MapperTaskID'     [     9]
      ...

The testing code:

function keyvalues = workers_test
    ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
    ds.SelectedVariableNames = 'Distance';
    ds.RowsPerRead = 5000; % smaller values increase the num of mapper calls
    preview(ds)
    outds = mapreduce(ds, @MeanDistMapFun, @MeanDistReduceFun);
    keyvalues = readall(outds);
end
function MeanDistMapFun(data, info, intermKVStore)
    tk=getCurrentTask();
    add(intermKVStore, 'MapperTaskID', tk.ID);
      distances = data.Distance(~isnan(data.Distance));
      sumLenValue = [sum(distances)  length(distances)];
      add(intermKVStore, 'sumAndLength', sumLenValue);
  end
function MeanDistReduceFun(intermKey, intermValIter, outKVStore)
    tk=getCurrentTask();
    add(outKVStore, 'ReducerTaskID', tk.ID);
      if strcmp(intermKey, 'MapperTaskID') 
          while hasnext(intermValIter)  % pass the same key/values along
              add(outKVStore, intermKey, getnext(intermValIter));
          end
          return
      end
      sumLen = [0 0];
      while hasnext(intermValIter)
          sumLen = sumLen + getnext(intermValIter);
      end
      add(outKVStore, 'Mean', sumLen(1)/sumLen(2));
  end

Mehrdad Oveisi 2014-11-13

> This example hits a separate limitation that the input data currently needs to "large" to provide meaningful parallelism.

I guess this limitation is behind the problem I am having. I have about 600 files to be processed. The files are about 40M on average (ranging from 5M to 130M max). All of them are in .mat format containing exactly four structs, which contain the data, meta data, etc. So the actual "data" table in each file is inside a struct in that file. I wasn't sure if it is possible to directly make datastores from these tables that are inside structs in the files. So instead I pass to the datastore as input a text file containing the 600 .mat filenames. (And set ds.RowsPerRead=1 to go through the filenames one by one.)

Then as I mentioned in the original post "each time a mapper is called it loads and processes one file."

Given the limitation you are mentioning, since the input to the mapper is just a filename, it will not provide parallelism.

Is there any setting options to change this assumption that small input requires small amount of processing?
Or is there any way to make a datastore of tables that are inside structs in the input files?

Rick Amos 2014-11-17

在 MATLAB Online 中打开

Currently, the one very specific form of mat files that can be read by datastore is the output of another mapreduce call. An unofficial shortcut that creates such a mat file is the following code:-

data.Key = {'Test'};
data.Value = {struct('a', 'Hello World!', 'b', 42)};
save('myMatFile.mat', '-struct', 'data');
ds = datastore('myMatFile.mat');
readall(ds)

Mehrdad Oveisi 2014-11-19

Thank you Rick! I found your reply here useful. So I thought it's good to have a separate thread for this tip.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Rick Amos 2014-11-24

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/162140-mapreduce-does-not-seem-to-use-all-available-cores#answer_160012

In R2014b, there are some limitations with the minimum size of data that can be parallelized. To avoid this limitation, the input datastore must contain at least one of the following:

Multiple files, where each file will be handled in parallel.
Files that are larger than 32 MB, where each 32 MB will be handled in parallel.

If the input datastore contains a single small file, you will need to find a way to split that file into multiple files. For example, if the input datastore contains a single file listing many filenames (to the actual data), you can split this up into many files each containing a single or small number of filenames to ensure parallelism.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Mapreduce does not seem to use all available cores

10 个评论
显示 8更早的评论隐藏 8更早的评论

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

Mapreduce does not seem to use all available cores

10 个评论 显示 8更早的评论隐藏 8更早的评论

采纳的回答

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

Community Treasure Hunt

10 个评论
显示 8更早的评论隐藏 8更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论