Unreasonably Large MAT File

5 次查看(过去 30 天)
AdiKaba
AdiKaba 2020-8-21
I am applying a customer developed data transform domain based compression algorithm to compress a data file. The algoritm performs as expected where an acceptable compression ratio is achieved, and the orginial data is reconstructed with small error. I attached plots of the original and reconstructed data. No issues with the algorithm performance. However, I am having issues when I save the reconstructed data onto a disk as a MAT file. The input data is about 7 MB on disk while the reconstructed data takes more than 30 MB space on the hard disk.
The attached plots show two different input data sets along with the corresponding reconstructed data sets. To save the reconstructed data to MAT file, I used MATLAB's "save" command.
save test1.mat reconstructedData;
save test2.mat inputData; %I did this to just to verify that the mat file has the same size as the input mat file.
Why is the reconstructedData much larger on disk than the input data even though the plots tell a different story?
  21 个评论
Walter Roberson
Walter Roberson 2020-8-26
" IDWT synthesizes the decomposed input wavelet coefficients into the time domain giving a reconstructed signal that is similar to the input signal."
Yes.
"This gives a more compact representation of the signal that can be represented with less number of bits compared to the original signal."
The DWT often has that property, but the IDWT does not.
"That is why I am expecting the file size to decrease (not the data type) by a factor of the compression ratio."
Is your file size 4186913 the original signal, or is it the DWT version of the signal, or is it the reconstructed version of the signal?
"That is why I am expecting the file size to decrease"
File size is determined by how much compression zlib can find for the data, which is a different matter than the "information content" (entropy) of the data.
Consider, for example, a 17 Hz sine wave with no phase delay, sampled at 5 megahertz: the "information content" is the fundamental frequency and the sample rate and the number of samples. If you were in a situation where the only permitted fundamentals were the integers 0 to 31, and the only permitted sampling rates were integer megaherz 0 to 7, and the only permitted lengths were "full cycles" 0 to 255, then the "information content" would be only 16 bits (5 bits for fundamental, 3 for sampling, 8 bits for number of cycles.)
The compression available through a dictionary technique such as zlib uses, would be at most two copies of each y value (one for rise, one for fall) per full cycle -- not very good. zlib does not even attempt mathematical calculations to predict values.
A discrete fourier transform (fft) of such a signal would, to within round-off, show a single non-zero at 17 Hz and (two sided transform) at -17 Hz, and if you used find() to locate that you could arrive at a fairly compact representation.
Wavelet transform of the same signal... it would depend which wavelet you choose. The tests I did just now found some that could do a 2:1 compression (cd was small enough to potentially be all zero) but I did not encounter any that could do better.
You are confusing different representations of the data in your signal with the information content of the data.
And you are also confused in thinking that a 5:1 amplitude reduction makes a difference in the information content. There is as much information in the line segment between 1 and 2 as there is between 1/5 and 2/5 (infinite information if you are talking about real numbers). IEEE 754 floating point repreresentation does not use fewer bits for a value that is 1/5th of the original.
Walter Roberson
Walter Roberson 2020-8-26
"I didn't accept your explanation because your reasoning regarding the wavelet transform is not mathematically correct."
What I wrote about idwt is,
"When you do your idwt you are spreading information across your data in a way that does not happen to align nicely with dictionary compression."
I am distinguishing between information and data.
Consider for example the wavelet that is square waves. If your data happens to be square waves with duty cycle 1/2, then the wavelet can compact the information into a small number of coefficients -- just enough to encode the width and length in a structured way. And similar to the discrete fourier transform I described above, a lot of the coefficients might be zero, which would compress well with the dictionary-based compression scheme used by zlib (and so used by MATLAB) to store .mat files.
Then when you idwt(), the information ("square wave, amplitude, duty cycle, frequency, cycle count) gets spread out over the data that is the reconstructed square wave. And that data might not happen to compress nearly as well with the dictionary compression scheme as the wavelet transform was able to do with it.
"Please note that I don't MATLAB to compress the variable while saving it."
Notice how compression of .mat files is on automatically for -v7 and -v7.3 files, unless you specifically ask for -nocompression. The 4186913 byte file size you are seeing is after MATLAB's zlib compression has been used.

请先登录,再进行评论。

回答(1 个)

AdiKaba
AdiKaba 2020-8-26
编辑:AdiKaba 2020-8-26
I understand you have an MVP status to protect as there are some mathematical inaccuracies in your responses. You are referring to my comments as "confused" which I think it a bad choice of word. Again, I disagree with your comments, just because you wrote a long reponse doesn't mean it is correct. No confusion here. Good luck.
  1 个评论
Walter Roberson
Walter Roberson 2020-8-27
What result do you get when you save your inputSignal with -nocompression ?
I firmly recommend the book Text Compression, by Bell, Cleary, and Witten (Prentice Hall, 1990), https://books.google.ca/books/about/Text_Compression.html for making clearer the difference between information content and representation.
Reducing the amplitude of a signal does not reduce its entropy (disorder, difficulty of predicting). Filtering can reduce entropy (but does not necessarily do so.)
You have 32000000 bytes of data that under LZSS+Huffman Encoding (zlib) compresses to 4186913 bytes. You process the decompressed signal, and you expect the stored file to be at most 4186913 bytes and expect a 5:1 compression, so you are hoping for on the order of 837383 byte file output. But there is no certainty that the processing you do will happen to end up with something that compresses nicely with the LZSS+Huffman Encoding compression scheme.
Let me give another example drawn from fourier transform (which, I showed above, in some cases could be a way to get a significant compression, for some signals.) Consider a 50% duty cycle square wave. That is potentially just bi-level, a number of zeros followed by the same number of ones, and the pattern repeated many times. The (non-discrete) fourier transform of a square wave is an infinite series. Suppose we take the dft, and now we process it, filtering out the 4/5 of coefficients that are least absolute value. That would be compression under that model. Now ifft() . The result is not going to be a square wave: it is going to be a waveform with a lot of ringing on it, that does not lend itself nearly as well to LZSS+huffman dictionary compression. The inaccuracies caused by the approximation get smeared out over all of the data when you ifft() to reconstruct.
Likewise, wavelets are based upon repeated shapes at different amplitude and frequencies. Wavelets do not describe individual samples in the signal: they calculate amplitudes at different frequencies that when reconstructed try to approximate the signal well, and any change in coefficients (by zeroing them for compression) gets propagated as a subtle change across the entire signal. But "well" for reconstruction is not measured by exact reconstruction: it is measured by error in reconstruction. And although the SSE on the reconstruction may be small, that the wavelet might be an excellent representation of the "interesting" information from the signal, that does not mean that the reconstructed signal is going to happen to be a good match for the LZSS+Huffman compression scheme that MATLAB automatically applies when you save files unless you say to use -nocompression .
The processing you do might well have reduced the information in the signal in a way that is useful for your purpose. But that does not mean that the automatically compression that MATLAB uses (unless told not to) is a good match for the processed result. What it does mean is that you have the potential to write your own compression routine that does a good job on the signal.
For example you might want to experiment with using fwrite() of the processed data (producing a 32000000 byte file), and then using gzip -9 on the binary file.
MATLAB is not adding overhead to the saved file: you just happen to be using an output signal that does not compress especially well with its built-in compression. And you can demonstrate whether MATLAB's compression is faulty by writing out the binary 32000000 bytes and putting it through some compression tools.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Discrete Multiresolution Analysis 的更多信息

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by