Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Transfer Data to or from a Cloud Cluster

Transfer Data To Amazon S3 Buckets

To work with data in the cloud, you can upload to Amazon S3, then use datastores to access the data in S3 from the workers in your cluster.

  1. For efficient file transfers to and from Amazon S3, download and install the AWS Command Line Interface tool from https://aws.amazon.com/cli/

  2. Specify your AWS Access Key ID, Secret Access Key, and Region of the bucket as system environment variables.

    • For example, on Linux, macOS, or Unix:

      export AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID"
      export AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY" 
      export AWS_REGION="us-east-1"
      

    • On Windows:

      set AWS_ACCESS_KEY_ID="YOUR_AWS_ACCESS_KEY_ID"
      set AWS_SECRET_ACCESS_KEY="YOUR_AWS_SECRET_ACCESS_KEY"
      set AWS_REGION="us-east-1"
      

      To permanently set these environment variables, set them in your user or system environment.

  3. Create a bucket for your data. Either use the AWS S3 web page or a command like the following:

    aws s3 mb s3://mynewbucket

  4. Upload your data using a command like the following:

    aws s3 cp mylocaldatapath s3://mynewbucket --recursive
    For example:
    aws s3 cp path/to/cifar10/in/the/local/machine s3://MyExampleCloudData/cifar10/ --recursive

  5. After creating a cloud cluster, to copy your AWS credentials to your cluster workers, in MATLAB, select Parallel > Manage Cluster Profiles. In the Cluster Profile Manager, select your cloud cluster profile. Scroll to the EnvironmentVariables property and add AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION.

After you store your data in Amazon S3, you can use datastores to access the data from your cluster workers. Simply create a datastore pointing to the URL of the S3 bucket. For example, the following sample code shows using an imageDatastore to access an S3 bucket. Replace 's3://MyExampleCloudData/cifar10' with the URL of your S3 bucket.

imds = imageDatastore('s3://MyExampleCloudData/cifar10',...
 'IncludeSubfolders',true, ...
 'LabelSource','foldernames');
You can use an imageDatastore to read data from the cloud in your desktop client MATLAB, or when running code on your cluster workers, without changing your code. For details, see Work with Remote Data (MATLAB).

For a step-by-step example showing deep learning using data stored in Amazon S3, see the white paper Deep Learning with MATLAB and Multiple GPUs.

Copy Data from Amazon S3 Account to Your Cluster

Tip

If you have data stored in an Amazon S3 bucket, then you can use datastores in MATLAB to directly access the data without needing any storage on the cluster. For details, see Transfer Data To Amazon S3 Buckets. You can also select files to add when creating your cluster.

To transfer individual files from an Amazon S3 bucket to the cluster machines, on the Create Cluster screen, next to Amazon S3 Data, click Add Files. Specify which files you want to make available to your cluster machines. You can specify S3 files only when creating your cluster and starting it for the first time.

When the cluster starts up, before the mjs process starts, specified files are copied to /shared/imported on the cluster’s shared file system. See Cluster File System and Storage. If any of the files is in gzip, tar, or zip file format, they are automatically expanded in /shared/imported.

Note

Transferring a large amount of data from your Amazon S3 account can cause the cluster to time out during its startup. If your data size exceeds approximately 5 GB, start your cluster without the S3 data transfer, then upload the necessary data to the cluster /shared/persisted folder from a local drive, as described in either Transfer Data with Standard Utilities or Transfer Data with the remotecopy Utility.

Transfer Data with Job Methods and Properties

To transfer data to the cloud cluster, you can use the AttachedFiles or JobData properties, in the same way you use these for other clusters. For example:

  1. Place all required executable and data files in the same folder.

  2. Specify that folder in the AttachedFiles property of the job.

    When you submit your job, the files are transferred to the cloud and made available to the workers running on the cloud cluster.

Data that is stored in job and task properties is available to the client, so your task or batch function results are accessible from the finished job’s fetchOutputs function or the tasks’ OutputArguments property. For batch jobs that run on the cloud, you can access the job’s workspace variables with the load function in your client session.

Download SSH Key Identity File

Among the cluster information in Cloud Center is the SSH key for that cluster, for non-root user access. Follow these steps to download a cluster’s SSH key identity file:

  1. In Cloud Center, click My Clusters.

  2. In the list of your clusters, click the cluster whose key you want to download.

  3. In the Cluster Summary display, click More Details to expand the display.

  4. In the SSH Keys information field is a hyperlink labeled User Access. Click this link to download and save the key (.pem) file to your local client machine.

You can use your saved .pem file for SSH or other access to the cloud machines for transferring data, as described in Transfer Data with Standard Utilities and Transfer Data with the remotecopy Utility.

Note that the only key available here is for user access (username clouduser), not for root access. A root access key (user name: ubuntu) is provided only when you create a new cluster. If you require, but do not have access to the root private key for a cluster, you could create a new cluster using another key for which you do have access, or create a new key according to the SSH key name description in Create a Cloud Cluster.

Transfer Data with Standard Utilities

In these examples, suppose you want to transfer the file /home/cloudtmp/emem.mat to the folder /shared/persisted on the headnode of your cloud cluster. Instead of providing passwords, you use an SSH key identity file, which is the private key file you download from a cloud center cluster as described in Download SSH Key Identity File.

This section highlights only a few of the many file transfer utilities that are available.

SFTP

The sftp utility is a command-line interactive interface, similar to ftp, that lets you connect to a remote host, navigate its file system, and transfer files. The following example shows how to use sftp at a UNIX command prompt:

cd /home/cloudtmp
sftp -i /home/.ssh/your-key.pem \
  clouduser@ec2-67-202-5-207.compute-1.amazonaws.com:/shared/persisted
sftp> put emem.mat
sftp> ls
emem.mat
sftp> exit

For more information about the sftp utility, use the following commands:

sftp -help
man sftp

SCP

The scp utility lets you access the remote host, and transfer the file, in a single command. This example shows the UNIX version of the command:

scp -i /home/.ssh/your-key.pem emem.mat \
  clouduser@ec2-67-202-5-207.compute-1.amazonaws.com:/shared/persisted

For more information about the scp utility, use the following commands:

scp -help
man scp

FileZilla

FileZilla is a GUI utility which lets you connect to the cloud cluster head node and transfer files with an easy drag-and-drop technique. This example shows how to transfer the local file C:\cloudtmp\emem.mat to the folder /shared/persisted on your cloud cluster.

  1. Start FileZilla, and set its Local site to the folder you want to transfer your local file from (or to).

  2. To connect FileZilla to your cloud cluster file system, specify the host that is the head node of your cloud cluster. The user name is always clouduser. Use port 22 for SFTP connections.

  3. Do not provide a password, but instead provide your SSH key identity file under Edit > Settings. In the Select pane of the Settings dialog box, choose SFTP. In the Public Key Authentication pane, click Add keyfile. Navigate to the key file that you downloaded from the Cloud Center for this cluster. (Note: On Windows, the .pem format key file you download from Cloud Center is not directly compatible with FileZilla, but when you select that key file, FileZilla can automatically convert the format for you.) When the key file appears in the list, click OK to dismiss the Settings dialog box.

  4. When FileZilla is configured with the proper key file, click Quickconnect.

  5. After connecting, set the Remote site path to /shared/persisted.

  6. Now drag the file emem.mat from the local column to the remote column. That completes the transfer.

Transfer Data with the remotecopy Utility

You can transfer between your client file system and your cloud cluster with the remotecopy utility provided with Parallel Computing Toolbox™ as:

matlabroot/toolbox/distcomp/bin/remotecopy

The remotecopy utility uses an identity file instead of passwords. This is the private SSH key file you download for a cluster from Cloud Center as described in Download SSH Key Identity File.

Transfer Data to the Cloud

This example shows how to copy the file /home/cloudtmp/emem.mat from a local UNIX machine to a cloud cluster machine:

  1. Navigate to the location of the remotecopy utility, and run the command as shown.

    cd /matlabinstall/toolbox/distcomp/bin
    ./remotecopy -local /home/cloudtmp/emem.mat \
        -to -remote /shared/persisted/emem.mat \
        -remotehost ec2-107-21-71-51.compute-1.amazonaws.com \
        -protocol scp -username clouduser -identityfile /home/.ssh/your-key.pem \
        -passphrase ""

    (For Windows, use appropriate slashes, path names, and ^ to indicate continuation of the command on multiple lines. For other options or information about mixed platforms, see remotecopy -help.)

    The -remotehost name is available in Cloud Center under the details for the head node of a running cluster.

  2. With the data files in place on the cloud cluster machines, you can specify their location in the job’s AdditionalPaths property to provide access to them for the MATLAB workers.

Retrieve Data from the Cloud

This example shows how to copy the file /shared/persisted/emem.mat from a cloud cluster machine to a local UNIX machine as /home/cloudtmp/return_emem.mat.

cd /matlabinstall/toolbox/distcomp/bin
./remotecopy -local /home/cloudtmp/return_emem.mat \
    -from -remote /shared/persisted/emem.mat \
    -remotehost ec2-107-21-71-51.compute-1.amazonaws.com \
    -protocol scp -username clouduser -identityfile /home/.ssh/your-key.pem \
    -passphrase ""

(For Windows, use appropriate slashes, path names, and ^ to indicate continuation of the command on multiple lines. For other options or information about mixed platforms, see remotecopy -help.)

The -remotehost name is available in Cloud Center under the details for the head node of a running cluster.

Retrieve Data from Persisted Storage Without Starting a Cluster

This procedure describes how to retrieve your persisted data from Amazon EC2®, without starting a cluster to access /shared/persisted. The major steps are described in the following subtopics:

Find Persisted Storage Resources in AWS

  1. Log in to the AWS® Management Console and access your Amazon EC2 Dashboard.

  2. On the right side of the tool bar at the top of the page, select the Region that your cluster is located in.

  3. In the left side navigation pane, select Elastic Block Store > Snapshots.

  4. Search for your snapshot:

    • In the Filter list, select Owned By Me.

    • In the Search Snapshots field, enter your cluster name from the Cloud Center.

    • Sort the Started column in descending order.

  5. In the lower half of the page, review the Tags for the top result in the list, and verify that the ClusterInfo value has the correct cluster name. For example, the result when your cluster name is MyR12b might look like this:

    MyR12b / first.last__AT__company.com / 4006224 
  6. Select the snapshot with the correct ClusterInfo value and the most recent Started value. In the Description tab, copy the Snapshot ID (for example, snap-20cd6642) and note its Capacity value.

Launch Instance or Attach Volume to Existing Instance

Select one of these two options:

Option 1: Launch Ubuntu Instance

  1. On the EC2 Dashboard, click Launch Instance.

    For the next several steps, navigate using the numbered tabs at the top of the page.

  2. On the Choose an Amazon Machine Image (AMI) tab, choose an Ubuntu AMI.

  3. On the Choose an Instance Type tab, select the hardware configuration and size of the instance to launch. Larger instance types have more CPU and memory. To minimize cost, select the t2.micro instance type if you are using VPC.

  4. On the Add Storage tab:

    • Click Add New Volume.

    • In the Type list, select EBS for Amazon Elastic Block Store.

    • In the Device list, select one of /dev/sd[f-p].

      For Linux®/UNIX® instances, recommended device names are /dev/sdf through /dev/sdp.

    • In the Snapshot field, enter the snapshot ID you copied earlier; for example, snap-20cd6642.

    • In the Size field, enter a value equal to the size of the snapshot; for example, 100 GiB.

  5. (optional) On the Tag Instance tab, give the instance a Name value so you can more easily find the instance in the Amazon Management Console.

  6. On the Configure Security Group tab, use a security group to define firewall rules for your instance. These rules specify which incoming network traffic is delivered to your instance. All other traffic is ignored.

    • In the Type list, select SSH.

    • In the Source list, select My IP.

  7. On the Review Instance Launch tab, check the details of your instance, and make any necessary changes by clicking the appropriate Edit link. When all settings are correct, click Launch.

  8. In the Select an existing key pair or create a new key pair dialog box, make your preference. For example, select Choose an existing key pair, then in the Select the key pair list, choose a key pair that you have access to. This is the key pair you will use later to connect to the instance for mounting the volume and transferring data.

  9. After you launch the instance, wait for the instance state to become Running. You can view this information in the EC2 Dashboard by navigating to Instances > Instances

For more information on Instance Types, see
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html.

For more information on AWS Block Device Mapping, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html

Option 2: Attach Volume to Existing Ubuntu Instance.  The Amazon Elastic Block Store (EBS) volume and the instance must be located in the same Availability Zone.

  1. In the left-side navigation pane of the EC2 Dashboard, select Elastic Block Store > Snapshots.

  2. Select your snapshot.

  3. Create a volume from your snapshot:

    1. Click Actions > Create Volume.

    2. Set the Availability Zone to match that of your instance. You can accept the defaults for the other settings.

    3. A confirmation indicates that the volume was successfully created. Note the volume ID, for example, vol-8a9d6642).

    Wait until the state of your volume is Available.

  4. In the left-side navigation pane of the EC2 Dashboard, select Elastic Block Store > Volumes.

  5. Select the volume you created in step 3.

  6. Click Actions > Attach Volume.

  7. In the Attach Volume dialog box:

    • In the Instance field, enter the ID of the instance to attach the volume to.

    • In the Device field, enter something in the range of /dev/sd[f-p]. For Linux/UNIX instances, recommended device names are /dev/sdf through /dev/sdp.

    • Click Attach to attach the volume to the instance.

For more information on AWS Block Device Mapping, see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html

For more information on EBS volumes, see
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-restoring-volume.html.

Mount Volume for Access

Make the volume available for use from the instance.

  1. Connect to your instance using SSH.

  2. Use the lsblk command to view your available disk devices and their mount points to help you determine the correct device name to use. (Most likely xvdf.) Note: Do not create a new file system.

  3. Create a mount point directory for the volume. The mount point is where the volume is located in the file system tree and where you read and write files to after you mount the volume. Substitute a location for mount_point, such as /data.

    ~$ sudo mkdir mount_point
  4. Use the following command to mount the volume at the location you just created.

    $ sudo mount device_name mount_point
    

    For example,

    $ sudo mount /dev/xvdf /data 
    

For more information on using EBS volumes, see
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html.

Transfer Data

You can now transfer data between the mounted volume and your local drive, as described in either: