What's the Largest Recorded Data Set Ever Used with MATLAB?

6 次查看(过去 30 天)
Does The Mathworks have any information concerning the largest known data sets to have ever been used with MATLAB?
I'm now working with extremely large matricies and am trying to determine what kind of hardware would be needed in order to continue my work. The trial-and-error approach is proving expensive, so it would help to understand realistic expectations.
I once believed that as long as I have a computer with enough RAM to hold my matricies, I'd be set. But now I know that allowable data flow between memory and processors, is a critical factor, along with my number of cores. Rather than dump more money into the latest GPU's, and see what happens, I'd like to know what others have already done. Quite simply put, if they're using greater hardware than I could ever afford, and are still struggling to meet similar goals, I'd like to know.
If The Mathworks, or anyone, is aware of huge MATLAB projects, and their outcomes, then I'd like to hear about it. What are the sizes of their data sets, and what hardware are they using in order to reach their goals?
I've been unable to find hard numbers on matrix sizes, hardware being used, and corresponding run times. Any known cases, with specific numbers, would be greatly appreciated.

采纳的回答

Image Analyst
Image Analyst 2017-9-14
Just how big is your data? The biggest I've worked on personally is CT data of about 20 GB per image, though I've only worked on a slice at a time since it was unnecessary to have the whole thing in memory at one time. I've heard of people with mass spectrometry images that are about 100 GB. I think some people in geology (oil, gas, earthquakes, etc.) have 3-D images that are terabytes. NASA has dozens of probes and receives hundreds of terabytes every hour!
For really large images, you can use memmapfile() to get around having to have the whole image in memory at one time, though I personally have not used the function.
  3 个评论
Image Analyst
Image Analyst 2017-9-14
Well you could use HPE's super computer https://phys.org/news/2017-05-hp-enterprise-unveils-era-big.html which has 160 terabytes of RAM memory. That ought to do it.
I believe you can rent time on their grid computers: https://www.hpe.com/us/en/what-is/supercomputing.html
IBM https://en.wikipedia.org/wiki/Category:IBM_supercomputers also has about eight different supercomputers that you can send your job over to. You can run your computer, which basically sends the job over the internet to their super computer in some different city which runs MATLAB, then it sends you the data back. So jobs that would take weeks on your computer can finish in a couple of hours. But for all you know, it's running locally - your user doesn't know what's happening with data flying around the world and being processed with a supercomputer and being sent back - it's all behind the scenes as I understand it.
Steven Evans
Steven Evans 2017-9-16
Thanks so much for those links. Sending jobs and renting time on more powerful systems sounds like a good option. It's especially interesting to learn of the technological advances which are taking place in the computing industry.

请先登录,再进行评论。

更多回答(2 个)

Jan
Jan 2017-9-14
Mentioning any matrix sizes will not be useful, because it matters what you want to do with the matrix. Solving a huge linear system might not be the point if you can condense the matrix to a known fixed pattern and with some mathematics a Core2Duo 4GB RAM can beat a multi-core TB cluster. It matters if the matrices are full or sparse, and if the data can be represented as single for a faster GPU access. It is important if the processing can be parallelized and if it is possible, how this scales with the number of cores. Maybe the computations can be separated in small pieces, which match into the processor cache. Even for a simple calculation of the matrix power M^n the computing time depends remarkably on the implementation of the power operation.
In consequence I think, a list of matrix sizes and RAM/processor names cannot be useful. It depends on what you want to compute with which kind of data.
You can always buy a computer which solves the problem in the half time. But an experiences programmer and mathematician might accelerate the code by a factor of 100 or 1000.
  5 个评论
Jan
Jan 2017-9-15
To my surprise Matlab's logical indexing can be accelerated: https://www.mathworks.com/matlabcentral/fileexchange/49553-copymask, and this is only a single threaded method for the extraction of the data, in your case C(b). Multi-threading and including the assignment to A(b) in the C-Mex function is not hard.
My old Atari ST had implemented such "BITBLIT" operations in hardware apart from the CPU. But unfortunately the 1MB RAM will most likely not satisfy your needs. :-)
If A(b) = C(b) is a bottleneck in your code and the matrices have > 1e7 elements, a specific multi-threaded C-mex (or better some SSE/AVX code) could accelerate the code by perhaps a factor of 3 (to be honest: +-3, this can only be proved by implementing it) times the number of cores.
Multi-threading can slow down the code, if it is applied in the wrong way by cache-line collisions (write access to the same 64 byte block). For a parallelization it matters if the problem can be processed in a shared or distributed memory and for the latter how the machines are connected.
Before you buy a 20'000$ computer, employ a programmer with experiences in parallel processing. Perhaps 1 month of work and a pool of 10 machines for 500€ is faster at the end.
Steven Evans
Steven Evans 2017-9-16
编辑:Steven Evans 2017-9-16
Thanks so much, Jan, for the algorithm concerning indexing. I'm all for finding ways to simplify and be more efficient, so I'll definitely have to take a closer look at this mex procedure.
One of my earliest MATLAB posts was answered with a more elegant/efficient procedure than what I had thought of. I also learned of a few new MATLAB functions, as a result. So I definitely see what you're getting at.
Walter, your insight on graphics cards, and their limitations with indexing capability, is a tremendous help. You just saved me lots of time and frustration, since I was seriously considering the idea of dumping more money into graphics cards. Your insight has helped a lot.
If anyone's interested, I ran across this article which features a graphics card which is designed to tackle the memory limitations, which Walter described. It's both interesting and expensive.

请先登录,再进行评论。


Edric Ellis
Edric Ellis 2017-9-15
This page has links to information about MATLAB's "big data" capabilities. In particular, tall arrays let you work with data that is too large to fit in memory by operating out-of-core on file-based data.
  3 个评论
Steven Evans
Steven Evans 2017-9-16
编辑:Steven Evans 2017-9-16
Thanks, Edric. I ran across this earlier during my searches. It's great to know that MATLAB offers lots of ways to handle large data.
I thought I'd be on track to extremely fast processing once I bought enough RAM to hold my data. I figured once the computer had all of its access to the information in RAM, my calculations wouldn't be hindered. So when I saw this idea of saving portions of large data, outside of memory, I steered away from it. The ironic thing is that keeping all of it in RAM still didn't solve my problem.
After reading Image Analyst's links about NASA, and the hoards of information they go through, I'm certain they could benefit from something like this.
Thanks again for your time, insight and suggestions.
Since some of you mentioned empty matricies, I recently found it interesting that ones(0,2) isn't the same as ones(0,1).

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Parallel Computing 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by