speed up renamecats/categorical multiple columns

Question

Peng Li 2020-5-12

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/525001-speed-up-renamecats-categorical-multiple-columns

评论： Peng Li 2020-10-9

I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.

I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.

See below an example

tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["123", "234", "345", "456", "567"]', ...
    'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["apple", "banana", "cat", "dog", "elephont"]', ...
    'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
    ["East", "West", "North", "South", "Middle"]', ...
    'VariableNames', {'Code', 'Meaning'});
Vars   = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
    tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
        dictionary.(Vars{iC}).Meaning);
end

Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).

Thank you!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Campion Loong 2020-10-9

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/525001-speed-up-renamecats-categorical-multiple-columns#answer_509771

在 MATLAB Online 中打开

Hi Peng,

It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).

In this case, why not create the table with categorical to begin with:

tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"],      dictionary.A.Code, dictionary.A.Meaning),...
            categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
            'VariableNames', {'A', 'B'});

There is no loop, faster and much more readable.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Campion Loong 2020-10-9

If you have thousands of columns, are you actually reading it from a file or a source somewhere? I struggle to imagine that could be manageable if you're making the first table call manually on thousands of columns.

If you are reading or importing, check out ImportOptions -- it gives you much more flexibility before actually reading the data in:

Peng Li 2020-10-9

Hi Campion, thanks again for you attention. I've actually tried different options -- tall array, datastore, transform a datastore, mapreduce, or readall in a server (over 380G ram) a while ago. This is easily handlable.

The issue is with this de-coding part. It is simply too slow to do a loop. And ImportOptions couldn't help with the decoding of the actual data I guess, as i have to load the data first and do the decoding.

I've tried a way using transform datastore. Basically in the transform function, I do the decoding, and then write the datastore to disk. It works, but slow too.

I have several workable solutions now but just no one gives me the best speed. The single file is around 20G in cvs format, with over half a million rows and almost 10 thunsands of columns. With my server this tasks takes over 24 hours so I guess i just need to be a bit patient to let the server work while i'm doing something else.

请先登录，再进行评论。

speed up renamecats/categorical multiple columns

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

3 个评论
显示 1更早的评论隐藏 1更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

speed up renamecats/categorical multiple columns

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

3 个评论 显示 1更早的评论隐藏 1更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论