Im trying to implement a function in Matlab, that searches for files and does that in parallel to speed up the process. Ive successfully implemented that in the following function:
function matches = searchfordata(starting_path, search_depth, checkFunction)
arguments
starting_path {isfolder}
search_depth int64
checkFunction function_handle
end
tic;
folders = struct('name', '' , 'folder', starting_path);
dataMap = containers.Map('KeyType', 'double', 'ValueType', 'any');
next_folders = [];
matches = [];
no_folders = height(folders);
current_depth = 0;
total_folders = 0;
if search_depth < 0
search_depth = 9001;
end
while current_depth <= search_depth && no_folders > 0
total_folders = total_folders + no_folders;
parfor n = 1:no_folders
path = strcat(folders(n).folder, filesep, folders(n).name);
[files, cfolders] = filesandfolders(path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
next_folders = [next_folders; cfolders];
end
if height(matches) > 0
dataMap(current_depth) = matches;
matches = [];
end
folders = next_folders;
no_folders = height(folders);
next_folders = [];
current_depth = current_depth + 1;
end
matches = dataMap;
toc;
end
Relevant other functions/classes for this:
function [files, folders] = filesandfolders(path)
directory_contents = dir(path);
files = directory_contents(~[directory_contents.isdir]);
folders = directory_contents([directory_contents.isdir]);
folders = folders(~ismember({folders.name}, {'.', '..'}));
end
Basically a dir, which splits the result into the files and folders and removes the "." and ".." from the folder results.
function boolobject = checkFiles(files)
cyclingregex = '\.txt$';
transientregex = '\.txt$';
matching_cycling = regexpi({files.name}, cyclingregex, 'Match');
matching_transient = regexpi({files.name}, transientregex, 'Match');
cycling_indices = ~cellfun(@isempty, matching_cycling);
transient_indices = ~cellfun(@isempty, matching_transient);
boolobject = FolderData(files(1).folder);
boolobject.cyclingData = any(cycling_indices);
boolobject.rthData = any(transient_indices);
if boolobject.cyclingData || boolobject.rthData
return
else
boolobject = [];
end
end
This gets the list of files from filesandfolders as input and filters for the files I am searching for. I changed this to txt for better reproduceability. The output of this function is this class:
classdef FolderData < handle
properties
folder
rthData
cyclingData
end
methods
function this = FolderData(path)
this.rthData = false;
this.cyclingData = false;
this.folder = path;
end
end
end
Which just says what files were found and in which folder.
The actual search function at the top takes 8-30 seconds on my drives and is working. Now I thought I could maybe speed this up a little more with afterEach. The basic idea being, that the if the contents of the folders that are being processed in the parfor loop are very different in quantity, theres a single folder holding back the process, because it needs to finish before the function can resume its work after the parfor loop.
For that I have created the following script:
clc;
clear all;
path = 'D:\';
if isempty(gcp('nocreate'))
parpool(4);
end
fun = @checkFiles;
output = searchfordata(path, 100, fun);
function matches = searchparforae(starting_path, checkFunction)
tic;
folder_que = parallel.pool.DataQueue;
matches = [];
listener = afterEach(folder_que, @search_folder);
starting_path = struct('folder', starting_path, 'name', '');
search_folder(starting_path);
function search_folder(input)
parfor n = 1:height(input)
folder_path = strcat(input(n).folder, filesep, input(n).name);
fprintf(1, folder_path);
fprintf(1, '\n');
[files, folders] = filesandfolders(folder_path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
send(folder_que, folders);
end
end
toc;
end
function matches = searchforae(starting_path, checkFunction)
tic;
folder_que = parallel.pool.DataQueue;
matches = [];
listener = afterEach(folder_que, @search_folder);
starting_path = struct('folder', starting_path, 'name', '');
search_folder(starting_path);
function search_folder(input)
for n = 1:height(input)
folder_path = strcat(input(n).folder, filesep, input(n).name);
fprintf(1, folder_path);
fprintf(1, '\n');
[files, folders] = filesandfolders(folder_path);
if height(files) > 0
check = checkFunction(files);
else
check = [];
end
matches = [matches;check];
send(folder_que, folders);
end
end
toc;
end
The two functions "searchforae" and "searchparforae" are exactly the same, except for the loop. As it might be obvious from the names "searchforae" has a for loop, while "searchparforae" has a parfor loop.
Now searchforae is not working at all. The print outputs show, that searchforae is only processing files in the initially given directory and the directories directly below that. Print outputs:
D:\
D:\$RECYCLE.BIN
D:\Downloads
D:\OneDriveTemp
D:\Programme
D:\Repositories
D:\Sonstiges
D:\Spiele
D:\System Volume Information
D:\Uni
D:\Uni2
D:\Users
D:\Zwischenablage
The searchparforae function in contrast is working just as well as the searchfordata function at the top. But instead of 8-30 seconds its taking 5-10 minutes. Am I using afterEach wrong? Why is it taking that long? Also why isnt the searchforae function working correctly even tho the only difference is a for loop instead of a parfor compared to searchparforae?