One option is to use CNN and LSTM. As long as the data is correctly saved in the mat file, you do not have to care about the file format. The below is a demo of video classification which may relate to your study.
The image features were extracted via a pre-trained network and the time-series features were classified using LSTM (Long Short Term Memory). A netowrk trained with ImageNet was used for a future extractor, but the domain you focus is different, then, you might update the parameters just like fine-tuning.