Hi,
Let us break this work into two parts say 'Code Re-Designing' and 'Algorithm Re-Designing'.
Part 1 :: Code Re-Designing
Firstly, you would need to profile the code to find out the computationally expensive areas and then work on them. Secondly, if you have access to the parallel computing toolbox you can parallelize the tasks between individual workers.
The algorithm seems to be embarrassingly parallel where there are no dependency between the data sets in a iteration. Hence I would expect to see a good speed up after parallelizing. One way would be to breaking the image into n sets if you have n workers and allow each worker to work on them independently. Finally, rejoin them to get the original image. Please ignore this if the algorithm cannot be parallelized.
Also refer to the answer provided here for more information on this:
Part 2 :: Algorithm Re-Designing
a. You can re-size the images to a smaller scale and then work on them. This would certainly reduce the quality of your result. But again it depends on your objective.
b. Instead of looping over all the pixels during each iteration, considering marking the pixels of interest for the next iteration. What I mean to say is only a certain set of pixels would be contributing in identification of the measurable deformation between the images. These set of pixels are the pixels of interest for the next iteration. You would need to have a measure or a selection criteria to consider these pixels for the next iteration.
Hope this helps. All the best!!!