Develop Custom Datastore
This topic shows how to implement a custom datastore for file-based data. Use this framework only when writing your own custom datastore interface. Otherwise, for standard file formats, such as images or spreadsheets, use an existing datastore from MATLAB®. For more information, see Getting Started with Datastore.
Overview
To build your custom datastore interface, use the custom datastore classes and objects.
Then, use the custom datastore to bring your data into MATLAB and leverage the MATLAB big data capabilities such as tall
,
mapreduce
, and Hadoop®.
Designing your custom datastore involves inheriting from one or more abstract classes and implementing the required methods. The specific classes and methods you need depend on your processing needs.
Processing Needs |
Classes |
---|---|
Datastore for Serial Processing in MATLAB |
|
Datastore with support for Parallel Computing Toolbox™ and MATLAB Parallel Server™ |
|
Datastore with support for Hadoop |
|
Datastore with support for shuffling samples in a datastore in random order |
|
Datastore with support for writing files via
|
(Optionally,
inheriting from |
Start by implementing datastore for serial processing, and then add support for parallel processing, Hadoop, shuffling, or writing.
Implement Datastore for Serial Processing
To implement a custom datastore named MyDatastore
, create a script
MyDatastore.m
. The script must be on the MATLAB path and should contain code that inherits from the appropriate class and
defines the required methods. The code for creating a datastore for serial processing in
MATLAB must:
Inherit from the base class
matlab.io.Datastore
.Define these methods:
hasdata
,read
,reset
, andprogress
.Define additional properties and methods based on your data processing and analysis needs.
For a sample implementation, follow these steps.
Steps | Implementation |
---|---|
Inherit from the base class |
classdef MyDatastore < matlab.io.Datastore properties (Access = private) CurrentFileIndex double FileSet matlab.io.datastore.DsFileSet end |
Add this property to create a datastore on one machine that works seamlessly on another machine or cluster that possibly has a different file system or operating system. Add methods to get and set this property in the methods section. |
% Property to support saving, loading, and processing of % datastore on different file system machines or clusters. % In addition, define the methods get.AlternateFileSystemRoots() % and set.AlternateFileSystemRoots() in the methods section. properties(Dependent) AlternateFileSystemRoots end |
Implement the function |
methods % begin methods section function myds = MyDatastore(location,altRoots) myds.FileSet = matlab.io.datastore.DsFileSet(location,... 'FileExtensions','.bin', ... 'FileSplitSize',8*1024); myds.CurrentFileIndex = 1; if nargin == 2 myds.AlternateFileSystemRoots = altRoots; end reset(myds); end |
Implement the |
function tf = hasdata(myds) % Return true if more data is available. tf = hasfile(myds.FileSet); end |
Implement the This method
uses |
function [data,info] = read(myds) % Read data and information about the extracted data. if ~hasdata(myds) error(sprintf(['No more data to read.\nUse the reset ',... 'method to reset the datastore to the start of ' ,... 'the data. \nBefore calling the read method, ',... 'check if data is available to read ',... 'by using the hasdata method.'])) end fileInfoTbl = nextfile(myds.FileSet); data = MyFileReader(fileInfoTbl); info.Size = size(data); info.FileName = fileInfoTbl.FileName; info.Offset = fileInfoTbl.Offset; % Update CurrentFileIndex for tracking progress if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ... fileInfoTbl.FileSize myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ; end end |
Implement the |
function reset(myds) % Reset to the start of the data. reset(myds.FileSet); myds.CurrentFileIndex = 1; end |
Define the methods to get and set the
You must
reset the datastore in the |
% Before defining these methods, add the AlternateFileSystemRoots % property in the properties section % Getter for AlternateFileSystemRoots property function altRoots = get.AlternateFileSystemRoots(myds) altRoots = myds.FileSet.AlternateFileSystemRoots; end % Setter for AlternateFileSystemRoots property function set.AlternateFileSystemRoots(myds,altRoots) try % The DsFileSet object manages the AlternateFileSystemRoots % for your datastore myds.FileSet.AlternateFileSystemRoots = altRoots; % Reset the datastore reset(myds); catch ME throw(ME); end end end |
Implement the |
methods (Hidden = true) function frac = progress(myds) % Determine percentage of data read from datastore if hasdata(myds) frac = (myds.CurrentFileIndex-1)/... myds.FileSet.NumFiles; else frac = 1; end end end |
Implement the |
methods (Access = protected) % If you use the DsFileSet object as a property, then % you must define the copyElement method. The copyElement % method allows methods such as readall and preview to % remain stateless function dscopy = copyElement(ds) dscopy = copyElement@matlab.mixin.Copyable(ds); dscopy.FileSet = copy(ds.FileSet); end end |
End the |
end |
Create Function to Read Your Proprietary File Format
The implementation of the read
method of your custom datastore uses
a function called MyFileReader
. You must create this function to read
your custom or proprietary data. Build this function using DsFileReader
object and its methods. For instance, create a function that reads binary files.
function data = MyFileReader(fileInfoTbl) % create a reader object using the FileName reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName); % seek to the offset seek(reader,fileInfoTbl.Offset,'Origin','start-of-file'); % read fileInfoTbl.SplitSize amount of data data = read(reader,fileInfoTbl.SplitSize); end
Add Support for Parallel Processing
To add support for parallel processing with Parallel Computing Toolbox and MATLAB
Parallel Server, update your implementation code in MyDatastore.m
to:
Inherit from an additional class
matlab.io.datastore.Partitionable
.Define two additional methods:
maxpartitions
andpartition
.
For a sample implementation, follow these steps.
Steps | Implementation |
---|---|
Update the |
classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.Partitionable . . . |
Add the definition for |
methods . . . function subds = partition(myds,n,ii) subds = copy(myds); subds.FileSet = partition(myds.FileSet,n,ii); reset(subds); end end |
Add definition for |
methods (Access = protected) function n = maxpartitions(myds) n = maxpartitions(myds.FileSet); end end |
End |
end |
Add Support for Hadoop
To add support for Hadoop, update your implementation code in MyDatastore.m
to:
Inherit from an additional class
matlab.io.datastore.HadoopLocationBased
.Define two additional methods:
getLocation
andinitializeDatastore
.
For a sample implementation, follow these steps.
Steps | Implementation |
---|---|
Update the |
classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.HadoopLocationBased . . . |
Add the definition for |
methods (Hidden = true) . . . function initializeDatastore(myds,hadoopInfo) import matlab.io.datastore.DsFileSet; myds.FileSet = DsFileSet(hadoopInfo,... 'FileSplitSize',myds.FileSet.FileSplitSize); reset(myds); end function loc = getLocation(myds) loc = myds.FileSet; end % isfullfile method is optional function tf = isfullfile(myds) tf = isequal(myds.FileSet.FileSplitSize,'file'); end end |
End the |
end |
Add Support for Shuffling
To add support for shuffling, update your implementation code in
MyDatastore.m
to:
Inherit from an additional class
matlab.io.datastore.Shuffleable
.Define the additional method
shuffle
.
For a sample implementation, follow these steps.
Steps | Implementation |
---|---|
Update the |
classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.Shuffleable . . . |
Add the definition for |
methods % previously defined methods . . . function dsNew = shuffle(ds) % dsNew = shuffle(ds) shuffles the files and the % corresponding labels in the datastore. % Create a copy of datastore dsNew = copy(ds); dsNew.Datastore = copy(ds.Datastore); fds = dsNew.Datastore; % Shuffle files and corresponding labels numObservations = dsNew.NumObservations; idx = randperm(numObservations); fds.Files = fds.Files(idx); dsNew.Labels = dsNew.Labels(idx); end end |
End the |
end |
Add Support for Writing Data
To add support for writing data, update your implementation code in
MyDatastore.m
to follow these requirements:
Inherit from an additional class
matlab.io.datastore.FileWritable
.Initialize the properties
SupportedOutputFormats
andDefaultOutputFormat
.Implement a
write
method if the datastore writes data to a custom format.Implement a
getFiles
method if the datastore does not have aFiles
property.Implement a
getFolders
method if the datastore does not have aFolders
property.The output location is validated as a string. If your datastore requires further validation, you must implement a
validateOutputLocation
method.If the datastore is meant for files that require multiple reads per file, then you must implement the methods
getCurrentFilename
andcurrentFileIndexComparator
.Optionally, inherit from another class
matlab.io.datastore.FoldersPropertyProvider
to add support for aFolders
property (and thus theFolderLayout
name-value pair ofwriteall
). If you do this, then you can use thepopulateFoldersFromLocation
method in the datastore constructor to populate theFolders
property.To add support for the
'UseParallel'
option ofwriteall
, you must subclass from bothmatlab.io.datastore.FileWritable
andmatlab.io.datastore.Partitionable
and implement apartition
method in the subclass that supports the syntaxpartition(ds,'Files',index)
.
For a sample implementation that inherits from
matlab.io.datastore.FileWritable
, follow these steps.
Steps | Implementation |
---|---|
Update the |
classdef MyDatastore < matlab.io.Datastore & ... matlab.io.datastore.FileWritable . . . |
Initialize the properties |
properties (Constant) SupportedOutputFormats = ... [matlab.io.datastore.ImageDatastore.SupportedOutputFormats, "dcm"]; DefaultOutputFormat = "dcm"; end |
Add definitions for |
methods (Access = {?matlab.io.datastore.FileWritable, ... ?matlab.bigdata.internal.executor.FullfileDatastorePartitionStrategy}) function files = getFiles(ds) files = {'data/folder/file1', 'data/folder/file2',...}; end end methods (Access = protected) function folders = getFolders(ds) folders = {'data/folder1/', 'data/folder2/',...}; end end |
Add a |
methods(Access = protected) function tf = write(myds, data, writeInfo, outFmt, varargin) if outFmt == "dcm" % use custom write fcn for dcm format dicomwrite(data, writeInfo.SuggestedOutputName, varargin{:}); else % callback into built-in for known formats write@matlab.io.datastore.FileWritable(myds, data, ... writeInfo, outFmt, varargin{:}); end tf = true; end end |
End the |
end |
For a longer example class that inherits from both
matlab.io.datastore.FileWritable
and
matlab.io.datastore.FoldersPropertyProvider
, see Develop Custom Datastore for DICOM Data.
Validate Custom Datastore
After following the instructions presented here, the implementation step of your custom datastore is complete. Before using this custom datastore, qualify it using the guidelines presented in Testing Guidelines for Custom Datastores.
See Also
matlab.io.Datastore
| matlab.io.datastore.Partitionable
| matlab.io.datastore.HadoopLocationBased
| matlab.io.datastore.Shuffleable
| matlab.io.datastore.DsFileSet
| matlab.io.datastore.DsFileReader
| matlab.io.datastore.FoldersPropertyProvider
| matlab.io.datastore.FileWritable