How can you set up very large Tall Arrays without Running into swap/page-file issues
6 次查看(过去 30 天)
显示 更早的评论
I recently tried setting up a tall array, using the following (approximate) code:
ds=fileDatastore(filename,'ReadFcn',@mydataload);
dataTall=tall(ds);
The data file type is not one that Matlab can natively handle, so I am using my own data reading function.
- The data is binary IBM floats, and represents a 2D array where the first few bytes in every column are a header and the rest is data
- I generally set up the loading to load a few columns at a time, but I wasn't sure how to get "fileDataSstore to do this, so it is set to load the whole thing (headers are skipped over during the read process)
I have set up the pageful for my system to be as large as possible (3x the RAM, I ran this test on Windows though I will also be running things on Linux / CentOS). Unfortunately, while I have enough RAM to make this work, I get a "out of pagefiles" error that forces a system reboot before the "tall(ds)" command is finished running.
Can someone please tell me what I am doing wrong, and how to fix this? I REALLY hope that TMW didn't decide to make this "big data" inspired function such that it was limited by pageful space, since that only gives a 100-200% maximum capacity boost versus having everything in RAM. I mean, 2-3x improvement is better than nothing, but it doesn't even come close to being a feasible solution for most "big data" analysis...
Thank you in advance!
0 个评论
回答(2 个)
Hatem Helal
2018-12-6
I think this problem could be nicely solved by implementing a custom datastore. The main idea is your datastore will need to know how to incrementally/partially read your large binary file. You'll need to consider how to partition reading these files if you are looking to use tall arrays with parallel computing toolbox. A typical strategy is to partition based on byte offsets. This makes it easier to implement the partition method of the matlab.io.datastore.Partitionable interface but requires that your reader knows how to seek to the first complete record/row of your dataset.
0 个评论
Edric Ellis
2017-7-4
Firstly, tall arrays are definitely not required to fit into RAM, or swap space, or anything like that. It is perfectly possible to use tall arrays with collections of data that are 100s or 1000s of GB in size. tall arrays are processed out-of-core by reading in files or portions of files at a time.
I suspect the problem in your case is that fileDatastore is designed to read whole files at a time (whereas the datastore instances used with e.g. tabular text files know how to read portions of files at a time). So, if you have just a single huge file that you're using with fileDatastore, then this is likely to be the cause of your difficulty. If you can partition your input data file somehow and then use that with your fileDatastore, things should work better.
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Tall Arrays 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!