Efficient training of LSTM network with GPU

13 次查看(过去 30 天)
Hi all,
I recently introduced a GPU implemented computer and currently trying to refactor my LSTM codes to take advantage of GPU. However, I found my implementation doesn't show improvement on speed, actually using CPU is faster than using GPU. Below testing codes are testing of basic algorithm of LSTM for comparison. Could anyone give some advice on how to employ the potential of GPU for LSTM? I tried using pagefun, arrayfun and bsxfun but they seemed not working to improve speed.
This one is for GPU.
function LSTM_gpu2()
vis = 700; hid = 500;
T = 80; epochs = 10;
sigmoid = @(x) 1./(1+exp(-x));
x = rand(vis,1,T); h = zeros(hid,1,T+1); c = h;
W_z = rand(hid,vis,'gpuArray'); W_i = rand(hid,vis,'gpuArray');
W_f = rand(hid,vis,'gpuArray'); W_o = rand(hid,vis,'gpuArray');
R_z = rand(hid,hid,'gpuArray'); R_i = rand(hid,hid,'gpuArray');
R_f = rand(hid,hid,'gpuArray'); R_o = rand(hid,hid,'gpuArray');
P_i = diag(rand(hid,1,'gpuArray')); P_f = diag(rand(hid,1,'gpuArray'));
P_o = diag(rand(hid,1,'gpuArray'));
b_z = rand(hid,1,'gpuArray'); b_i = rand(hid,1,'gpuArray');
b_f = rand(hid,1,'gpuArray'); b_o = rand(hid,1,'gpuArray');
I = zeros(hid,T,'gpuArray'); F = zeros(hid,T,'gpuArray');
O = zeros(hid,T,'gpuArray'); G = zeros(hid,T,'gpuArray');
x = gpuArray(x); h = gpuArray(h); c = gpuArray(c);
tic;
for i=1:epochs
for t=1:T
G(:,t) = tanh(W_z*x(:,:,t) + R_z*h(:,:,t) + b_z);
I(:,t) = sigmoid(W_i*x(:,:,t) + R_i*h(:,:,t) + P_i*c(:,:,t) + b_i);
F(:,t) = sigmoid(W_f*x(:,:,t) + R_f*h(:,:,t) + P_f*c(:,:,t) + b_f);
c(:,:,t+1) = G(:,t).*I(:,t) + c(:,:,t).*F(:,t);
O(:,t) = sigmoid(W_o*x(:,:,t) + R_o*h(:,:,t) + P_o*c(:,:,t+1) + b_o);
h(:,:,t+1) = tanh(c(:,:,t+1)).*O(:,t);
end
%%backprop
%%update
end
toc;
return;
And this one is for CPU.
function LSTM_cpu()
vis = 700; hid = 500;
T = 80; epochs = 10;
sigmoid = @(x) 1./(1+exp(-x));
x = rand(vis,1,T); h = zeros(hid,1,T+1); c = h;
W_z = rand(hid,vis); W_i = rand(hid,vis);
W_f = rand(hid,vis); W_o = rand(hid,vis);
R_z = rand(hid,hid); R_i = rand(hid,hid);
R_f = rand(hid,hid); R_o = rand(hid,hid);
P_i = diag(rand(hid,1)); P_f = diag(rand(hid,1));
P_o = diag(rand(hid,1));
b_z = rand(hid,1); b_i = rand(hid,1);
b_f = rand(hid,1); b_o = rand(hid,1);
I = zeros(hid,T); F = zeros(hid,T);
O = zeros(hid,T); G = zeros(hid,T);
tic;
for i=1:epochs
for t=1:T
G(:,t) = tanh(W_z*x(:,:,t) + R_z*h(:,:,t) + b_z);
I(:,t) = sigmoid(W_i*x(:,:,t) + R_i*h(:,:,t) + P_i*c(:,:,t) + b_i);
F(:,t) = sigmoid(W_f*x(:,:,t) + R_f*h(:,:,t) + P_f*c(:,:,t) + b_f);
c(:,:,t+1) = G(:,t).*I(:,t) + c(:,:,t).*F(:,t);
O(:,t) = sigmoid(W_o*x(:,:,t) + R_o*h(:,:,t) + P_o*c(:,:,t+1) + b_o);
h(:,:,t+1) = tanh(c(:,:,t+1)).*O(:,t);
end
%%backprop
%%update
end
toc;
return;
OS: Windows 10,
GPU: NVIDIA Quadro M5000,
CPU: Intel i7-5820K,
MATLAB: R2016a
Thank you,
Yuto Ozaki
  1 个评论
Yuto Ozaki
Yuto Ozaki 2016-4-10
编辑:Yuto Ozaki 2016-4-10
Additional question:
Some papers[1] [2] use affine transform notation to realize a more compact way of calculation but they do not using peephole connections. In fact, Chainer's LSTM model does not implement peephole connections and TensorFlow provides LSTM models both having and not having peephole connections. To pursue calculation efficiency, would omitting peephole be the current best practice? If a model does not include peephole, all affine transform can be done at once and I think it can lead to more GPU-friendly coding.
[1] Kelvin Xu, et al.: Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)
[2] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals: RECURRENT NEURAL NETWORK REGULARIZATION (2014)

请先登录,再进行评论。

采纳的回答

Joss Knight
Joss Knight 2016-4-15
To get good performance out of the GPU, you need to give it a lot of data to process. Your best bet is to vectorize your code to remove the inner loop. Your sigmoid and tanh activation functions, for instance, are element-wise operators and so should vectorize trivially, while your matrix multiplies can be executed in batch using pagefun.
Alternatively, have you considered using the new Deep Learning features in the Neural Network Toolbox in MATLAB R2016a, or the free 3rd party deep learning solution MatConvNet?
  2 个评论
Yuto Ozaki
Yuto Ozaki 2016-4-16
Joss,
Thank you for your reply. I just tried with bigger size samples training with mini batch and it yielded around 35% faster speed on GPU. However, I think removing the inner loop would be challenging since RNN basically gets input from previous state and that would make sequential for-loop be essential algorithm of RNN.
I have checked Neural Network Toolbox but seemingly the toolbox has not implemented RNN. My main interest is in music information retrieval so time-series models such as RNN and other variants are my main focus.
Joss Knight
Joss Knight 2016-4-20
Support for RNNs is considered high priority by the development team. Meanwhile, take a look at MatConvNet.

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Sequence and Numeric Feature Data Workflows 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by