Mass replace text, then split the replaced text into multiple columns - from multiple csv files

Question

NeverPerfecT 2018-3-2

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/385917-mass-replace-text-then-split-the-replaced-text-into-multiple-columns-from-multiple-csv-files

评论： Guillaume 2018-3-7

I have multiple csv files (Azure.csv, Blue.csv,...until 50th file) which has roughly this kind of data inside it:

ID,     Duration,   Team A,   TeamAComposition,   Team B,   TeamBComposition
120404  3442        Azure     Ae Bee Cee Dee Ee   Blue      Ef Gee Heij Ai Jei
302033  233         Azure     Ae Bee Cee Dee Zet  Cyan      El Em En Ou Pee
...and so on

I would like to replace the team composition data into integers, so for example "Ae" will be replaced into "1", "Bee" will be "2", "Cee" into "3" and so on, and if possible convert those into 5 different new columns. So the new table is roughly like this

ID,     Duration,   Team A,   AComp1, AComp2,   AComp3,   AComp4,  AComp5, Team B,   BComp1,  BComp2, BComp3, BComp4, BComp5,
120404  3442        Azure     1       2         3         4        5       Blue      6        7       8       9       10

How can I do this? Thanks in advance.

6 个评论
显示 4更早的评论隐藏 4更早的评论

Guillaume 2018-3-2

A comma or any other separator not present in the names would remove any ambiguity indeed. It's possible to do the splitting without that separator but if there's a chance of the problem above, then an explicit separator is safer.

NeverPerfecT 2018-3-3

编辑：NeverPerfecT 2018-3-3

How can I achieve the multiple column conversion above? (assuming I've given comma after each hero names) Also I'm using Matlab R2015b

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Guillaume 2018-3-5

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/385917-mass-replace-text-then-split-the-replaced-text-into-multiple-columns-from-multiple-csv-files#answer_308457

在 MATLAB Online 中打开

Solution for when heroes are just separated by a space:

In its own m file:

function varargout = split_heros(heroes, varargin)
    %split multiple scalar strings (varargin) into heroes id
    varargout = cell(1, numel(varargin)*5);
    for col = 1:numel(varargin)
        ishero = ~cellfun(@isempty, regexp(varargin{col}, heroes.regex));
        varargout(5*col-4 : 5*col) = num2cell(heroes.hero_id(ishero));
    end
end

To create the table:

heronames = readtable('hero_names.csv');
heronames.regex = strcat(heronames.localized_name, '( |$)');  %regular expression to help splitting list of names
%to wrap in a loop for each of the 50 files:
intable = readtable('DC vs EG.csv');
outnames = [sprintfc('AComp%d', 1:5), sprintfc('BComp%d', 1:5)];  %using undocumented sprintfc. in R2016b or later replace with compose
outtable = [intable(:, [1:6, 9, 11]), ...
            rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes', 'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)]

Solution for when heroes are separated by a comma:

In its own m file:

function varargout = split_heros(heroes, varargin)
    [~, row] = ismember(strsplit(strjoin([varargin{:}], ','), ','), heroes.localized_name);
    varargout = num2cell(heroes.hero_id(row));
end

To create the table:

heronames = readtable('hero_names.csv');
%to wrap in a loop for each of the 50 files,  same code as before
intable = readtable('DC vs EG.csv');
outnames = [sprintfc('AComp%d', 1:5), sprintfc('BComp%d', 1:5)];  %using undocumented sprintfc. in R2016b or later replace with compose
outtable = [intable(:, [1:6, 9, 11]), ...
            rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes', 'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)]

6 个评论
显示 4更早的评论隐藏 4更早的评论

NeverPerfecT 2018-3-7

在 MATLAB Online 中打开

CDEC vs VG.csv

Sorry to bother you again Guillame, I've come to another dead end xD. When I tried to convert another file this wall of error text came up.. (csv attached)

Error using table/rowfun>dfltErrHandler (line 309)
Applying the function '@(varargin)split_heros(heronames,varargin{:})' to the 38th row of A generated the
following error:
In an assignment  A(:) = B, the number of elements in A and B must be the same.
Error in table/rowfun>@(s,varargin)dfltErrHandler(grouped,funName,s,varargin{:}) (line 196)
    errHandler = @(s,varargin) dfltErrHandler(grouped,funName,s,varargin{:});
Error in table/rowfun (line 214)
                [b_data{i,:}] = errHandler(struct('identifier',ME.identifier, 'message',ME.message,
                'index',i),inArgs{:});
Error in replacenames (line 7)
            rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes',
            'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)];

Is this due to no commas present? As always, thanks again!

Guillaume 2018-3-7

在 MATLAB Online 中打开

If you're getting an "In an assignment A(:) = B, ..." error most likely it's because for some reason the code didn't identify exactly 5 heros in a team. It either found more or less. As said, I didn't put any error checking. You could add the line

assert(sum(ishero) == 5, '%d heros found in team %d', sum(ishero), col);

after the ishero = ... line in split_heros.m to make the error message clearer.

When I test your 'CDEC vs VG.csv' file, I get a problem on the first row of the table because Barathrum is not in your 'hero_name.csv'. I suspect a similar problem happens for row 38 (where you're getting the error).

Note that the check for hero names that I've implemented for the 1st case is fairly crude. It also assumes that the hero names do not contain any of the following characters: ()[]{}|^$.+*\?, that is any of the characters that have special meaning in regular expressions.

If you want to check which heros are identified for a particular team of a particular row, e.g. team B of row 38:

heroes = intable{38, 'TeamBHeroes'}
ishero = ~cellfun(@isempty, regexp(heroes, heronames.regex));
foundheros = heronames(ishero, :)

which in my case shows that Monkey King is not found (since not in heronames)

请先登录，再进行评论。

Mass replace text, then split the replaced text into multiple columns - from multiple csv files

6 个评论
显示 4更早的评论隐藏 4更早的评论

采纳的回答

6 个评论
显示 4更早的评论隐藏 4更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Mass replace text, then split the replaced text into multiple columns - from multiple csv files

6 个评论 显示 4更早的评论隐藏 4更早的评论

采纳的回答

6 个评论 显示 4更早的评论隐藏 4更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

WeChat

6 个评论
显示 4更早的评论隐藏 4更早的评论

6 个评论
显示 4更早的评论隐藏 4更早的评论