I think the best approach here is to vectorise your code so that you're not calling fft in a loop, nor indexing the gpuArray in a loop. (It's often relatively slow to index gpuArray data). In this case, you can vectorise by forming a matrix on which you can call fft to operate down the columns, like so:
% Parameters
ll = 2^16;
ww = 256;
ol = ll-ww+1;
% Build the input data
dataGpu = gpuArray.rand(1, ll);
% Create an index matrix that we're going to use with dataGpu
idxMat = bsxfun(@plus, (1:ww)', 0:(ol-1));
% Index dataGpu to form a matrix where each column is a sub-vector
% of dataGpu
dataGpuXform = dataGpu(idxMat);
% Make a single vectorised call to fft
out = fft(dataGpuXform);
On my rather old Tesla C2070 GPU, the fft call completes in 0.09 seconds.