Sort Excel files by file content?

Hi all, my new challenge with matlab involves filtering files with very inconsistent names. For example,
s =
'HI_B2_TTT9_D452_07052016.xlsx'
'HI_H2G_TTT7_D259_070516.xlsx'
'HI_B2C_TTT9_D1482_070516.xlsx'
'HI_A1C_468_070516_TTT4.xlsx'
'HI__TTT8_862_07052016_G1C.xlsx'
'HI_KA6_TTT4_148_07052016.xlsx'
'8C_HI_279_Potato_07052016.xlsx'
'HI_8C_279_Bacon_TTT52016.xlsx'
The files that I want are the first six files, which have different styles of naming even though they are the same type of files (TTT). While the last two files are undesired and need to be filter out indicated by keywords such as "Potato" and "Bacon".
My goal is to Extract files that contains the keywords "TTT" while eliminate files that have keywords "Potato" and "Bacon", this is not ideal since there are in fact hundreds of these files in my folders outside of this simple example that are constantly updating and I will need to look through them all for other potential unwanted keywords such as "Sour Cream", etc.
My ideal goal will be to extract those TTT files by its content, since all the TTT excel files have a sheet named "cooking is fun" inside while all the other ones do not. Is this feasible and is there a best way to do so?
Thank you so much for reading my concern and any inputs will be greatly appreciated!

5 个评论

What about this name: 'HI_8C_279_Bacon_TTT52016.xlsx' ? it contains TTT and Bacon!
Good catch! That is an example of a bad naming (someone accidentally put "TTT" together with "Bacon") and there is no way it can be avoided, all these files are named by different users who may or may not follow a correct naming rule. This is why ideally it will be best to sort these files by its content. Only the real TTT files have a sheet called "cooking is fun" while the other ones no matter how messed up the names are, will not have that sheet.
Look at these two examples:
'HI_B2C_TTT9_D1482_070516.xlsx'
'HI_8C_279_Bacon_TTT52016.xlsx'
What is your criterion to filter the first one?
I am thinking that maybe I can do:
Step 1: Extract all "TTT" files
Step 2: Further eliminate any files with keyword "Bacon"
Look at edited answer

请先登录,再进行评论。

 采纳的回答

Guillaume
Guillaume 2016-7-6
编辑:Guillaume 2016-7-13
You could certainly inspect the content of the excel files to see if there's the worksheet you want. However, since it involves opening and closing every file, it's not going to be particularly fast. It's up to you whether that's important or not:
function hassheet = CheckExcelFilesForWorksheet(folder, filelist, sheetname)
%HASSHEET Check whether or not the given excel files have a sheet with the given name
%folder: folder where the excel files are located (1D char array / string)
%filelist: names of excel files (cell array of 1D char arrays / string array)
%sheetname: name of sheet to find in excel files (1D char array / string)
%hassheet: array the same size as filelist, indicating whether or not the excel file has a sheet with sheetname (logical)
hassheet = false(size(filelist)); %output
excel = actxserver('Excel.Application'); %start microsoft excel
cleanupobj = onCleanup(@() excel.Quit); %close excel when function returns or error occurs
for fileidx = 1:numel(filelist)
%open without updating link and as read only. read only ensure the file can be opened even if it's already in use:
workbook = excel.Workbooks.Open(fullfile(folder, filelist{fileidx}), false, true);
%get the list of worksheet name by iterating over the Sheets collection:
sheetnames = arrayfun(@(i) workbook.Sheets.Item(i).Name, 1:workbook.Sheets.Count, 'UniformOutput', false);
workbook.Close(false);
if ismember(sheetname, sheetnames)
hassheet(fileidx) = true;
end
end
end
edited 13/07/2016: fixed bugs: forgot to close the workbooks, typos

6 个评论

Thank you so much for the great help! I have never done anything like this before not to mention knowing what functions to use. For the last part, how do I incorporate the sheet name, "cooking is fun" in it? I am a little confused, thank you!
'cooking is fun' is the sheetname input to the function.
It's only used at the end in the ismember call. You could hard code it there but leaving it as an input is more flexible.
So, typical call to the function:
folder = 'c:\somewhere\where\the excel files\are';
hassheet = CheckExcelFilesForWorksheet(folder, s, 'cookingisfun');
%only keep files which have the sheet:
s = s(hassheet);
That makes a lot of sense, thank you!
So I tried:
function hassheet = CheckExcelFilesForWorksheet(filelist, sheetname)
hassheet = false(size(filelist)); %output
excel = actxserver('Excel.Application');
cleanupobj = onCleanup(@() excel.Quit);
for fileidx = 1:numfiles
workbook = excel.Workbooks.Open(AllExcelPath{fileidx}, false, true);
sheetnames = arrayfun(@(i) workbook.Sheets.Item(i).Name, 1:workbook.Sheets.Count, 'UniformOutput', false);
if ismember(sheetname, sheetnames)
hassheet(fileidx) = true;
end
end
end
%Note that numfiles is a pre-existing number of total excel files I have.
%AllExcelPath is a pre-existing cell array that contains all the full path of the excel files.
hassheet = CheckExcelFilesForWorksheet(s, 'cooking is fun');
WantedExcel = s(hassheet);
But get the result
function hassheet = CheckExcelFilesForWorksheet(filelist, sheetname)
|
Error: Function definitions are not permitted in this context.
I am not sure what I did wrong did here...
You put the function in a script instead of its own file.
Function definitions are not allowed in scripts in the current versions of matlab (although that is about to change).
Ok. So I saved the function on its own, it now gives me the error...
Undefined function or variable 'numfiles'.
Error in CheckExcelFilesForWorksheet (line 10)
for fileidx = 1:numfiles
I am not sure why this line is invalid since I just checked, "numfiles" is exactly the same as if I use "numel(AllExcel)", where AllExcel is a pre-existing cell array of the names of all the available excel files that I have.
This is a bug of your own. There's no numfiles in my original code. For reference, my original code for that line and the following was:
for fileidx = 1:numel(filelist)
%open without updating link and as read only. read only ensure the file can be opened even if it's already in use:
workbook = excel.Workbooks.Open(fullfile(folder, filelist{fileidx}), false, true);

请先登录,再进行评论。

更多回答(1 个)

Edited
a=regexp(s,'.+TTT.+','match','once')
b=regexprep(a,'\S+Bacon\S+','')
out=s(~cellfun(@isempty,b))

3 个评论

Thank you! And if I want to extract (TTT or XXX) and eliminate multiple keywords (Bacon or Potato), will I be able to do this?

a=regexp(s,'.+TTT.+','.+XXX.+','match','once')
b=regexprep(a,'\S+Bacon\S+','\S+Potato\S+','')
out=s(~cellfun(@isempty,b))
Certainly not! More likely, you'd get a syntax error. I'd recommend you learn the regular expression language (note that this is not a language specific to matlab).
If you want to replace Bacon or Potato, this regex would work:
regexprep(a, '\S+(?:Bacon|Potato)\S+', '')
However, this has nothing to do with the original question: "Sort Excel files by file content?"
After you mention that it will require the opening and closing of each file, I realize that it may take forever for all my files to be filtered every time I run the codes. Thus I am considering doing a quick regexp before I make the attempt of opening any files.
Thank you very much for all your help!

请先登录,再进行评论。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by