fast cell2mat with padding or numerical equivalent of pad function that works very fast on cellarray of variable length uint8 vectors

4 次查看(过去 30 天)
Hi Folks,
I have a similar question I already found being asked but I need very fast solution that specifically converts cellarray of vriable length uint8 vectors into a matrix of uint8s with 0s padded to the end of shorter vectors. On the characters the very fast solution is simply function pad. For example given the cell array c:
for i=1:256 c{i,1}=char(uint8(0:i-1)); end
I can pad the cells with whatever character (say '@') using
padded_c=pad(c,'@');
and then convert it really fast to a matrix by:
matrix_c=reshape([padded_c{:}],[],numel(c));
which is btw a way faster than cell2mat I could have used. The important part here is that I did not have to specify and know the maximum length in the cell array - pad function figures it out nicely. What I want is the similarly performing function but on cellarrays of uint8s that occupy half of the space compared to chars. The cell arrays are huge, could be holding hundreds of millions of uint8 vectors and even checking the maximum length with cellfun(@numel,c) is very costly. I need it for DNA sequences analysis. The input is typically a text with lines of only a few characters arranged in sequences of up to 160 or so. To save space I convert the sequences from the lines of char text to cellarray of uint8s and then I would need a fast cell2mat with padding. Obviously I could proccess them as chars using the above and convert into uint8s at the end, but this limits significantly the size of the data I can process in RAM in one go. I was thinking that perhaps I could stay on a single vector of uint8s from the start (without placing separate lines into cells) and then somehow inject uint8(0) in places after the end of shorter lines but I could not see a fast(er) way to do it other than copying across to another empty matrix of uint8s. Any ideas?
  3 个评论
dymitr ruta
dymitr ruta 2022-10-13
Hi James, cool thanks for the remarks. Here is the fastest code that I currently use to achieve what I want:
%Converts a read vector of chars or uint8s into an uint8 matrix padded
%with 0s based on the padded matrix string
function y=vec2pmtx(x,delimiter)
%Convert of uint8s if x is char array
if ischar(x) x=uint8(x); end
%Use new line as a default delimiter
if nargin<2 delimiter=uint8(10); end
%Locations of newline markers (ends of sequences)
e=find(x==delimiter);
%Locations od the start of sequences
s=[1 e(1:end-1)+1];
%Calculating lengths of sequences and shifting ends by 1 to remove eolns
c=e-s; e=e-1;
%Max length of the sequence m and the number of sequences n
m=max(c); n=numel(s);
%Prealocating padded matrix
y=zeros(n,m,'uint8');
%Loading sequences into the new padded matrix
for i=1:n
y(i,1:c(i))=x(s(i):e(i));
end
It is only slightly slower than:
y=pad(split(string(char(x))),"0");
but if I need to convert back to uint8 via y=uint8(char(y)), the the above function that operates on uint8 only (without the need to expand to char or string) is the fastest. Do you see any ways to parallelize it? It seems to be that braking up, distribution and collection would eat out any benefits.
Jan
Jan 2022-10-13
编辑:Jan 2022-10-13
@dymitr ruta: Now the input is a "read vector" and can be char or uint8. In the question you have mentioned a cell array. I thought of posting a C-Mex function, but as long as the type of the input is not clear, this would be a waster of time in 50% of the cases.
So please post a small example of the input data and the wanted output. It matters if you want the row or column order.
A hint: cellfun(@numel,c) is slower than cellfun('prodofsize',c).

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Cell Arrays 的更多信息

产品


版本

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by