Part of the overhead is the error checking. But rgb2hsv is just more complicated than rgb2gray. rgb2gray is a simple linear scaling, A * r + B * b + C * g for fixed scalar doubles A, B, C. rgb2hsv has to work with mins and max's and different formulae depending which channel is the max, and it has a bunch of border cases to take care of.
If you have the memory, then with no loops,
T = reshape( permute(vid, [1 2 4 3]), size(vid,1), size(vid,2) * size(vid,4), size(vid,3));
hsvVid = permute(reshape( rgb2hsv(T), size(vid,1), size(vid,2), size(vid,4), size(vid,3) ), [1 2 4 3]);
greyVid = reshape( rgb2gray(T), size(vid,1), size(vid,2), size(vid,4));
This rearranges the frames as if it was single much wider RGB video, does a conversion on that (so only one call overhead instead of one per frame), and re-forms back to a series of frames.
The memory-shuffling time could easily amount to more than the overhead of all of the calls.
