Please help me create a tall array from a large binary file and "fileDatastore" without running out of memory.

2 次查看(过去 30 天)
I have a large data file (the particular file I am working with now is ~60gb, though a few hundred gb's is typical) that I want to create a tall array of. I am hoping this will allow me to quickly perform calculations on the data without loading it into memory. The data is in a custom format, so it seems that I am stuck with using the custom "fileDatastore" format.
Making the datastore is not a problem, but every time I try and load it I run out of pagefile memory (and have already made my pagefile as big as possible on Windows 10). The issue seems to be that Matlab requires temporarily loading the full datastore into memory before the tall array can be made. It would (supposedly) free up the memory after the file was completely read, but it never gets there. This is due to my not being able to find any way to tell the "fileDatastore" to only read part of the data at once. In the other types of datastore there is a "ReadSize" property that seems to do this, but that is missing from fileDatastore's valid options. The @readfcn I am using is setup to [partially read the data correctly (I could easily tell it to read the next X values from the current position), I just dont know how to make fileDatastore pass along a 2nd parameter that has this information (the 1st parameter is the file name).
I imagine I could manually break up the data into separate datastores and then combione them each into the same tall array somehow, but this 1) would be rather tedious to do EVERY time I want to us make a fileDatastore, and 2) I imagine this would negatively impact the delayed execution feature, since (I'd guess) matlab would try any optimize reading the data from each small sub-datastore individually rather than optimizing for the whole data file. As such, I'd much rather find a way to do this from a single fileDatastore.
.
.
PS If any MathWorks staff sees this - please suggest to the development team to fix this. Granted I am using my personal computer for this, not some cluster with a terrabyte of ram, but it is kind of ridiculous that a computer with an i7 + 16gb of RAM and Matlab's "latest and greatest big data solution" can't manage to deal with a ~60gb file without crashing the computer....I cant imagine that it would take someone (who is familiar with the source code) more than a few hours to add an option of "pass this number to your read function to decide how much it should read at a given time" (or something similar).
  1 个评论
Hatem Helal
Hatem Helal 2018-12-6
编辑:Hatem Helal 2018-12-6
How are your large binary files generated? It would also be worth evaluating whether you can modify that tools/process to instead create a folder full of large files that represents your large dataset. For example a folder with 60 files that are each ~1GB can be trivially partitioned for parallel analysis. This is a widely used best practice for the storage / representation of large datasets that would let you comfortably analyze your data on your personal computer.

请先登录,再进行评论。

采纳的回答

Edric Ellis
Edric Ellis 2017-7-10
In R2017a, fileDatastore is currently restricted to reading entire files at a time. This is a known limitation of the current implementation, and this is definitely something we hope to be able to address in a future release of MATLAB. For now, unfortunately the only workaround is to split your data into multiple files so that each file can be loaded without running out of memory. You can use a single fileDatastore instance with multiple data files, as shown in the first example on the fileDatastore reference page.
  1 个评论
Anthony Barone
Anthony Barone 2017-7-11
Edric,
I appreciate the answer, though I am admittedly disappointed by it. I do hope that this gets implemented in an upcoming release.
I had also considered splitting up the data, resaving it, and loading it, but to be honest I dont think that would be worthwhile. Part of this is the inconvenience of having a 2nd copy of datasets that is effectively useless outside of Matlab (this isnt a huge issue for my current ~60gb file, though this is a trial run...when in full production some of the datasets it will use could easily be 10-20x this size). However, a larger part is my feeling that if something as fundamental as loading data cant be done without these kinda of modifications and workarounds, I can only assume that this project has been put on the "back burner" and as such is really not ready for full production usage. I cant really see relying on hope that there wouldnt be anymore issues, and I imagine that by the time I am able to experiment with it further on my free time to verify this I imagine 2017b will already be put.
At any rate, I very much appreciate the definitive answer. I will keep an eye out in future releases to see if this feature has matured a bit more.

请先登录,再进行评论。

更多回答(1 个)

Hatem Helal
Hatem Helal 2018-12-6
I think this problem could be nicely solved by implementing a custom datastore.

类别

Help CenterFile Exchange 中查找有关 Large Files and Big Data 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by