Set Up Datastore for Processing on Different Machines or Clusters
You can create and save a datastore on a platform that loads and works
seamlessly on a different platform by setting up the
'AlternateFileSystemRoots'
property of the datastore. Use this property
when:
You create a datastore on a local machine, and need to access and process the data on another machine (possibly running a different operating system).
You process your datastore with parallel and distributed computing involving different platforms, cloud or cluster machines.
This example demonstrates the use of the 'AlternateFileSystemRoots'
property for TabularTextDatastore
. However, you can use the
same syntax for any of these datastores: SpreadsheetDatastore
, ImageDatastore
, ParquetDatastore
,
FileDatastore
, KeyValueDatastore
, and TallDatastore
. To use the 'AlternateFileSystemRoots'
functionality for custom datastores, see matlab.io.datastore.DsFileSet
and Develop Custom Datastore.
Save Datastore and Load on Different File System Platform
Create a datastore on one file system that loads and works seamlessly on a different machine (possibly of a different operating system). For example, create a datastore on a Windows® machine, save it, and then load it on a Linux® machine.
First, before you create and save the datastore, identify the root paths for your data on the different platforms. The root paths will differ based on the machine or file system. For instance, if you have data on your local machine and a copy of the data on a cluster, then get the root paths for accessing the data:
"Z:\DataSet"
for your local Windows machine."/nfs-bldg001/DataSet"
for your Linux cluster.
Then, associate these root paths by using the
'AlternateFileSystemRoots'
parameter of the datastore.
altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"]; ds = tabularTextDatastore('Z:\DataSet','AlternateFileSystemRoots',altRoots);
Examine the Files
property of datastore. In this instance, the
Files
property contains the location of your data as accessed by your
Windows machine.
ds.Files
ans = 5×1 cell array {'Z:\DataSet\datafile01.csv'} {'Z:\DataSet\datafile02.csv'} {'Z:\DataSet\datafile03.csv'} {'Z:\DataSet\datafile04.csv'} {'Z:\DataSet\datafile05.csv'}
save ds_saved_on_Windows.mat ds
Files
property. Since
the root path 'Z:\DataSet'
is not accessible on the Linux cluster, at
load time, the datastore function automatically updates the root paths based on the values
specified in the 'AlternateFileSystemRoots'
parameter. The
Files
property of the datastore now contains the updated root paths for
your data on the Linux
cluster.load ds_saved_on_Windows.mat
ds.Files
ans = 5×1 cell array {'/nfs-bldg001/DataSet/datafile01.csv'} {'/nfs-bldg001/DataSet/datafile02.csv'} {'/nfs-bldg001/DataSet/datafile03.csv'} {'/nfs-bldg001/DataSet/datafile04.csv'} {'/nfs-bldg001/DataSet/datafile05.csv'}
Process Datastore Using Parallel and Distributed Computing
To process your datastore with parallel and distributed computing that involves
different platforms, cloud or cluster machines, you must predefine the
'AlternateFileSystemRoots'
parameter. This example demonstrates how to
create a datastore on your local machine, analyze a small portion of the data, and then use
Parallel Computing Toolbox™ and MATLAB®
Parallel Server™ to scale up the analysis to the entire dataset.
Create a datastore and assign a value to the
'AlternateFileSystemRoots'
property. To set the value for the
'AlternateFileSystemRoots'
property, identify the root paths for your
data on the different platforms. The root paths differ based on the machine or file system.
For example, identify the root paths for data access from your machine and your cluster:
"Z:\DataSet"
from your local Windows Machine."/nfs-bldg001/DataSet"
from the MATLAB Parallel Server Linux Cluster.
Then, associate these root paths using the
AlternateFileSystemRoots
property.
altRoots = ["Z:\DataSet","/nfs-bldg001/DataSet"]; ds = tabularTextDatastore('Z:\DataSet','AlternateFileSystemRoots',altRoots);
Analyze a small portion of the data on your local machine. For instance, get a partitioned subset of the data, clean the data by removing any missing entries, and examine a plot of the variables.
tt = tall(partition(ds,100,1));
summary(tt);
% analyze your data
tt = rmmissing(tt);
plot(tt.MyVar1,tt.MyVar2)
Scale up your analysis to the entire dataset by using MATLAB Parallel Server cluster (Linux cluster). For instance, start a worker pool using the cluster profile, and then perform analysis on the entire dataset by using parallel and distributed computing capabilities.
parpool('MyMjsProfile') tt = tall(ds); summary(tt); % analyze your data tt = rmmissing(tt); plot(tt.MyVar1,tt.MyVar2)
See Also
datastore
| TabularTextDatastore
| SpreadsheetDatastore
| ImageDatastore
| FileDatastore
| KeyValueDatastore
| TallDatastore