you will nerver long use the speach frames in the second DAEs,
-first: you will encode your input frames with the first DAs.
-second:you will use the encoded inputs (the new representation of the inputs(hodden layers)) as the input of the next DEs.
you can mange my code to be capable to train stacked DAEs , it is very fast, if you liked it please rate it and give us your opinion .