What's the Largest Recorded Data Set Ever Used with MATLAB?

Question

Steven Evans 2017-9-14

2
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/356742-what-s-the-largest-recorded-data-set-ever-used-with-matlab

编辑： Steven Evans 2017-9-16

Does The Mathworks have any information concerning the largest known data sets to have ever been used with MATLAB?

I'm now working with extremely large matricies and am trying to determine what kind of hardware would be needed in order to continue my work. The trial-and-error approach is proving expensive, so it would help to understand realistic expectations.

I once believed that as long as I have a computer with enough RAM to hold my matricies, I'd be set. But now I know that allowable data flow between memory and processors, is a critical factor, along with my number of cores. Rather than dump more money into the latest GPU's, and see what happens, I'd like to know what others have already done. Quite simply put, if they're using greater hardware than I could ever afford, and are still struggling to meet similar goals, I'd like to know.

If The Mathworks, or anyone, is aware of huge MATLAB projects, and their outcomes, then I'd like to hear about it. What are the sizes of their data sets, and what hardware are they using in order to reach their goals?

I've been unable to find hard numbers on matrix sizes, hardware being used, and corresponding run times. Any known cases, with specific numbers, would be greatly appreciated.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Image Analyst 2017-9-14

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/356742-what-s-the-largest-recorded-data-set-ever-used-with-matlab#answer_281612

Just how big is your data? The biggest I've worked on personally is CT data of about 20 GB per image, though I've only worked on a slice at a time since it was unnecessary to have the whole thing in memory at one time. I've heard of people with mass spectrometry images that are about 100 GB. I think some people in geology (oil, gas, earthquakes, etc.) have 3-D images that are terabytes. NASA has dozens of probes and receives hundreds of terabytes every hour!

For really large images, you can use memmapfile() to get around having to have the whole image in memory at one time, though I personally have not used the function.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Image Analyst 2017-9-14

Well you could use HPE's super computer https://phys.org/news/2017-05-hp-enterprise-unveils-era-big.html which has 160 terabytes of RAM memory. That ought to do it.

I believe you can rent time on their grid computers: https://www.hpe.com/us/en/what-is/supercomputing.html

IBM https://en.wikipedia.org/wiki/Category:IBM_supercomputers also has about eight different supercomputers that you can send your job over to. You can run your computer, which basically sends the job over the internet to their super computer in some different city which runs MATLAB, then it sends you the data back. So jobs that would take weeks on your computer can finish in a couple of hours. But for all you know, it's running locally - your user doesn't know what's happening with data flying around the world and being processed with a supercomputer and being sent back - it's all behind the scenes as I understand it.

Steven Evans 2017-9-16

Thanks so much for those links. Sending jobs and renting time on more powerful systems sounds like a good option. It's especially interesting to learn of the technological advances which are taking place in the computing industry.

请先登录，再进行评论。

Answer 2

Jan 2017-9-14

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/356742-what-s-the-largest-recorded-data-set-ever-used-with-matlab#answer_281593

Mentioning any matrix sizes will not be useful, because it matters what you want to do with the matrix. Solving a huge linear system might not be the point if you can condense the matrix to a known fixed pattern and with some mathematics a Core2Duo 4GB RAM can beat a multi-core TB cluster. It matters if the matrices are full or sparse, and if the data can be represented as single for a faster GPU access. It is important if the processing can be parallelized and if it is possible, how this scales with the number of cores. Maybe the computations can be separated in small pieces, which match into the processor cache. Even for a simple calculation of the matrix power M^n the computing time depends remarkably on the implementation of the power operation.

In consequence I think, a list of matrix sizes and RAM/processor names cannot be useful. It depends on what you want to compute with which kind of data.

You can always buy a computer which solves the problem in the half time. But an experiences programmer and mathematician might accelerate the code by a factor of 100 or 1000.

5 个评论
显示 3更早的评论隐藏 3更早的评论

Steven Evans 2017-9-14

编辑：Steven Evans 2017-9-14

Thanks for your response, Jan.

I probably should have been more specific concerning the types of operations on large matricies. I'm looking at basic operations, such as element-by-element addition, multiplication and division (not matrix multiplication). Also matrix indexing, such as A(b) = C(b), with 'b' being a logical matrix, and 'C' and 'A' being matricies of type double.

Assuming no patterns are known, and such operations are carried out by brute procedure, I would hope there would be some sort of possible comparison between hardware, and workload.

Given the operations I've mentioned, parallelization is always an option since all of these operations can be carried out independently. Assume the elements in these matricies are random values, only common by type, and that the only difference between these scenarios is workload.

I've learned that graphics cards are increasingly being used for such mundane operations on large matricies. In such cases, greater complications arise based on mere size of the workload. I'm interested in the results MATLAB users have experienced when tackling these instances with their given hardware.

Thanks for the suggestion of simplification based on patterns and such. I agree with you completely, in that regard. A great deal of work can often be saved when that sort of thing is possible.

Jan 2017-9-15

To my surprise Matlab's logical indexing can be accelerated: https://www.mathworks.com/matlabcentral/fileexchange/49553-copymask, and this is only a single threaded method for the extraction of the data, in your case C(b). Multi-threading and including the assignment to A(b) in the C-Mex function is not hard.

My old Atari ST had implemented such "BITBLIT" operations in hardware apart from the CPU. But unfortunately the 1MB RAM will most likely not satisfy your needs. :-)

If A(b) = C(b) is a bottleneck in your code and the matrices have > 1e7 elements, a specific multi-threaded C-mex (or better some SSE/AVX code) could accelerate the code by perhaps a factor of 3 (to be honest: +-3, this can only be proved by implementing it) times the number of cores.

Multi-threading can slow down the code, if it is applied in the wrong way by cache-line collisions (write access to the same 64 byte block). For a parallelization it matters if the problem can be processed in a shared or distributed memory and for the latter how the machines are connected.

Before you buy a 20'000$ computer, employ a programmer with experiences in parallel processing. Perhaps 1 month of work and a pool of 10 machines for 500€ is faster at the end.

Steven Evans 2017-9-16

编辑：Steven Evans 2017-9-16

Thanks so much, Jan, for the algorithm concerning indexing. I'm all for finding ways to simplify and be more efficient, so I'll definitely have to take a closer look at this mex procedure.

One of my earliest MATLAB posts was answered with a more elegant/efficient procedure than what I had thought of. I also learned of a few new MATLAB functions, as a result. So I definitely see what you're getting at.

Walter, your insight on graphics cards, and their limitations with indexing capability, is a tremendous help. You just saved me lots of time and frustration, since I was seriously considering the idea of dumping more money into graphics cards. Your insight has helped a lot.

If anyone's interested, I ran across this article which features a graphics card which is designed to tackle the memory limitations, which Walter described. It's both interesting and expensive.

请先登录，再进行评论。

Answer 3

Edric Ellis 2017-9-15

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/356742-what-s-the-largest-recorded-data-set-ever-used-with-matlab#answer_281659

This page has links to information about MATLAB's "big data" capabilities. In particular, tall arrays let you work with data that is too large to fit in memory by operating out-of-core on file-based data.

3 个评论
显示 1更早的评论隐藏 1更早的评论

Steven Lord 2017-9-15

在 MATLAB Online 中打开

Depending on your definition of "used with" the following could be smaller.

matlab -r quit

Steven Evans 2017-9-16

编辑：Steven Evans 2017-9-16

Thanks, Edric. I ran across this earlier during my searches. It's great to know that MATLAB offers lots of ways to handle large data.

I thought I'd be on track to extremely fast processing once I bought enough RAM to hold my data. I figured once the computer had all of its access to the information in RAM, my calculations wouldn't be hindered. So when I saw this idea of saving portions of large data, outside of memory, I steered away from it. The ironic thing is that keeping all of it in RAM still didn't solve my problem.

After reading Image Analyst's links about NASA, and the hoards of information they go through, I'm certain they could benefit from something like this.

Thanks again for your time, insight and suggestions.

Since some of you mentioned empty matricies, I recently found it interesting that ones(0,2) isn't the same as ones(0,1).

请先登录，再进行评论。

What's the Largest Recorded Data Set Ever Used with MATLAB?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论
显示 1更早的评论隐藏 1更早的评论

更多回答（2 个）

5 个评论
显示 3更早的评论隐藏 3更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

另请参阅

类别

标签

产品

Community Treasure Hunt

What's the Largest Recorded Data Set Ever Used with MATLAB?

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

3 个评论 显示 1更早的评论隐藏 1更早的评论

更多回答（2 个）

5 个评论 显示 3更早的评论隐藏 3更早的评论

3 个评论 显示 1更早的评论隐藏 1更早的评论

另请参阅

类别

标签

产品

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

5 个评论
显示 3更早的评论隐藏 3更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论