Mass replace text, then split the replaced text into multiple columns - from multiple csv files

3 次查看(过去 30 天)
I have multiple csv files (Azure.csv, Blue.csv,...until 50th file) which has roughly this kind of data inside it:
ID, Duration, Team A, TeamAComposition, Team B, TeamBComposition
120404 3442 Azure Ae Bee Cee Dee Ee Blue Ef Gee Heij Ai Jei
302033 233 Azure Ae Bee Cee Dee Zet Cyan El Em En Ou Pee
...and so on
I would like to replace the team composition data into integers, so for example "Ae" will be replaced into "1", "Bee" will be "2", "Cee" into "3" and so on, and if possible convert those into 5 different new columns. So the new table is roughly like this
ID, Duration, Team A, AComp1, AComp2, AComp3, AComp4, AComp5, Team B, BComp1, BComp2, BComp3, BComp4, BComp5,
120404 3442 Azure 1 2 3 4 5 Blue 6 7 8 9 10
How can I do this? Thanks in advance.
  6 个评论
Guillaume
Guillaume 2018-3-2
A comma or any other separator not present in the names would remove any ambiguity indeed. It's possible to do the splitting without that separator but if there's a chance of the problem above, then an explicit separator is safer.
NeverPerfecT
NeverPerfecT 2018-3-3
编辑:NeverPerfecT 2018-3-3
How can I achieve the multiple column conversion above? (assuming I've given comma after each hero names) Also I'm using Matlab R2015b

请先登录,再进行评论。

采纳的回答

Guillaume
Guillaume 2018-3-5
Solution for when heroes are just separated by a space:
In its own m file:
function varargout = split_heros(heroes, varargin)
%split multiple scalar strings (varargin) into heroes id
varargout = cell(1, numel(varargin)*5);
for col = 1:numel(varargin)
ishero = ~cellfun(@isempty, regexp(varargin{col}, heroes.regex));
varargout(5*col-4 : 5*col) = num2cell(heroes.hero_id(ishero));
end
end
To create the table:
heronames = readtable('hero_names.csv');
heronames.regex = strcat(heronames.localized_name, '( |$)'); %regular expression to help splitting list of names
%to wrap in a loop for each of the 50 files:
intable = readtable('DC vs EG.csv');
outnames = [sprintfc('AComp%d', 1:5), sprintfc('BComp%d', 1:5)]; %using undocumented sprintfc. in R2016b or later replace with compose
outtable = [intable(:, [1:6, 9, 11]), ...
rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes', 'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)]
Solution for when heroes are separated by a comma:
In its own m file:
function varargout = split_heros(heroes, varargin)
[~, row] = ismember(strsplit(strjoin([varargin{:}], ','), ','), heroes.localized_name);
varargout = num2cell(heroes.hero_id(row));
end
To create the table:
heronames = readtable('hero_names.csv');
%to wrap in a loop for each of the 50 files, same code as before
intable = readtable('DC vs EG.csv');
outnames = [sprintfc('AComp%d', 1:5), sprintfc('BComp%d', 1:5)]; %using undocumented sprintfc. in R2016b or later replace with compose
outtable = [intable(:, [1:6, 9, 11]), ...
rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes', 'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)]
  6 个评论
NeverPerfecT
NeverPerfecT 2018-3-7
Sorry to bother you again Guillame, I've come to another dead end xD. When I tried to convert another file this wall of error text came up.. (csv attached)
Error using table/rowfun>dfltErrHandler (line 309)
Applying the function '@(varargin)split_heros(heronames,varargin{:})' to the 38th row of A generated the
following error:
In an assignment A(:) = B, the number of elements in A and B must be the same.
Error in table/rowfun>@(s,varargin)dfltErrHandler(grouped,funName,s,varargin{:}) (line 196)
errHandler = @(s,varargin) dfltErrHandler(grouped,funName,s,varargin{:});
Error in table/rowfun (line 214)
[b_data{i,:}] = errHandler(struct('identifier',ME.identifier, 'message',ME.message,
'index',i),inArgs{:});
Error in replacenames (line 7)
rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes',
'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)];
Is this due to no commas present? As always, thanks again!
Guillaume
Guillaume 2018-3-7
If you're getting an "In an assignment A(:) = B, ..." error most likely it's because for some reason the code didn't identify exactly 5 heros in a team. It either found more or less. As said, I didn't put any error checking. You could add the line
assert(sum(ishero) == 5, '%d heros found in team %d', sum(ishero), col);
after the ishero = ... line in split_heros.m to make the error message clearer.
When I test your 'CDEC vs VG.csv' file, I get a problem on the first row of the table because Barathrum is not in your 'hero_name.csv'. I suspect a similar problem happens for row 38 (where you're getting the error).
Note that the check for hero names that I've implemented for the 1st case is fairly crude. It also assumes that the hero names do not contain any of the following characters: ()[]{}|^$.+*\?, that is any of the characters that have special meaning in regular expressions.
If you want to check which heros are identified for a particular team of a particular row, e.g. team B of row 38:
heroes = intable{38, 'TeamBHeroes'}
ishero = ~cellfun(@isempty, regexp(heroes, heronames.regex));
foundheros = heronames(ishero, :)
which in my case shows that Monkey King is not found (since not in heronames)

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Database Toolbox 的更多信息

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by