What's the Largest Recorded Data Set Ever Used with MATLAB?
6 次查看(过去 30 天)
显示 更早的评论
Does The Mathworks have any information concerning the largest known data sets to have ever been used with MATLAB?
I'm now working with extremely large matricies and am trying to determine what kind of hardware would be needed in order to continue my work. The trial-and-error approach is proving expensive, so it would help to understand realistic expectations.
I once believed that as long as I have a computer with enough RAM to hold my matricies, I'd be set. But now I know that allowable data flow between memory and processors, is a critical factor, along with my number of cores. Rather than dump more money into the latest GPU's, and see what happens, I'd like to know what others have already done. Quite simply put, if they're using greater hardware than I could ever afford, and are still struggling to meet similar goals, I'd like to know.
If The Mathworks, or anyone, is aware of huge MATLAB projects, and their outcomes, then I'd like to hear about it. What are the sizes of their data sets, and what hardware are they using in order to reach their goals?
I've been unable to find hard numbers on matrix sizes, hardware being used, and corresponding run times. Any known cases, with specific numbers, would be greatly appreciated.
0 个评论
采纳的回答
Image Analyst
2017-9-14
Just how big is your data? The biggest I've worked on personally is CT data of about 20 GB per image, though I've only worked on a slice at a time since it was unnecessary to have the whole thing in memory at one time. I've heard of people with mass spectrometry images that are about 100 GB. I think some people in geology (oil, gas, earthquakes, etc.) have 3-D images that are terabytes. NASA has dozens of probes and receives hundreds of terabytes every hour!
For really large images, you can use memmapfile() to get around having to have the whole image in memory at one time, though I personally have not used the function.
3 个评论
Image Analyst
2017-9-14
Well you could use HPE's super computer https://phys.org/news/2017-05-hp-enterprise-unveils-era-big.html which has 160 terabytes of RAM memory. That ought to do it.
I believe you can rent time on their grid computers: https://www.hpe.com/us/en/what-is/supercomputing.html
IBM https://en.wikipedia.org/wiki/Category:IBM_supercomputers also has about eight different supercomputers that you can send your job over to. You can run your computer, which basically sends the job over the internet to their super computer in some different city which runs MATLAB, then it sends you the data back. So jobs that would take weeks on your computer can finish in a couple of hours. But for all you know, it's running locally - your user doesn't know what's happening with data flying around the world and being processed with a supercomputer and being sent back - it's all behind the scenes as I understand it.
更多回答(2 个)
Jan
2017-9-14
Mentioning any matrix sizes will not be useful, because it matters what you want to do with the matrix. Solving a huge linear system might not be the point if you can condense the matrix to a known fixed pattern and with some mathematics a Core2Duo 4GB RAM can beat a multi-core TB cluster. It matters if the matrices are full or sparse, and if the data can be represented as single for a faster GPU access. It is important if the processing can be parallelized and if it is possible, how this scales with the number of cores. Maybe the computations can be separated in small pieces, which match into the processor cache. Even for a simple calculation of the matrix power M^n the computing time depends remarkably on the implementation of the power operation.
In consequence I think, a list of matrix sizes and RAM/processor names cannot be useful. It depends on what you want to compute with which kind of data.
You can always buy a computer which solves the problem in the half time. But an experiences programmer and mathematician might accelerate the code by a factor of 100 or 1000.
5 个评论
Jan
2017-9-15
To my surprise Matlab's logical indexing can be accelerated: https://www.mathworks.com/matlabcentral/fileexchange/49553-copymask, and this is only a single threaded method for the extraction of the data, in your case C(b). Multi-threading and including the assignment to A(b) in the C-Mex function is not hard.
My old Atari ST had implemented such "BITBLIT" operations in hardware apart from the CPU. But unfortunately the 1MB RAM will most likely not satisfy your needs. :-)
If A(b) = C(b) is a bottleneck in your code and the matrices have > 1e7 elements, a specific multi-threaded C-mex (or better some SSE/AVX code) could accelerate the code by perhaps a factor of 3 (to be honest: +-3, this can only be proved by implementing it) times the number of cores.
Multi-threading can slow down the code, if it is applied in the wrong way by cache-line collisions (write access to the same 64 byte block). For a parallelization it matters if the problem can be processed in a shared or distributed memory and for the latter how the machines are connected.
Before you buy a 20'000$ computer, employ a programmer with experiences in parallel processing. Perhaps 1 month of work and a pool of 10 machines for 500€ is faster at the end.
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!