Skip to main content

Skip the Cloud Data Transfers with Local Storage

· 5 min read

trainML training jobs can now run on data directly from your local computer and upload their results back without using any cloud intermediary. If you already have the data set on your local computer and want to avoid the repetitive cycle of uploading and downloading from cloud storage, this storage type is for you.

How It Works

The Local option is now available from both the Input Data Type and Output Type fields. If this option is selected, specify the storage path as the location on your local computer you want the data to be copied from or to. The path must be specified as an absolute path (starting with /, like /home/username/data), a home directory relative path (starting with ~/, like ~/data), or an environment variable based path (starting with $, like $HOME/data, where the HOME environment variable on your local computer is set to /home/username). Relative paths ./ are not supported.

Running the connection utility is mandatory for local storage types. Jobs will wait indefinitely for you connect before downloading data or uploading their results.

Warning

Jobs will wait indefinitely for you to connect, and you will continue to billed while it waits.

Additionally, the storage path specified must exist in the same environment as you are running the connection utility. For example, if you run the connection utility inside of a linux virtual machine (VM), the storage path specified must be the path of the data inside the VM, not the host computer.

When using the Local storage type for input data, the contents of the specified directory will be recursively copied to the trainML data path (accessible with the TRAINML_DATA_PATH environment variable) for the workers to access. No automatic extraction of archives will occur, so ensure that the data is already unarchived on your local computer. Additionally, we recommend that you use an isolated path for the input data, which contains only the data you need for this training job and nothing else. Copying more data than necessary will needlessly delay the workers from starting. The speed of this process will be primarily limited by your internet connection's upstream bandwidth; however, the data download duration only costs your patience, not your credits.

When using the Local storage type for output artifacts, the contents of the trainML output path (accessible with the TRAINML_OUTPUT_PATH environment variable) will be zipped with a naming convention of <job_name>.zip (or <job_name>_<worker_number>.zip for a multi-worker training job) and uploaded to the specified local directory.

Warning

Similar to downloading, workers will wait indefinitely for you to connect in order to upload their results. You continue to be billed as workers are uploading. We recommend that you stay connected to a job for the entire duration of a job that uses the local storage option.

Example Training Job

As an example of how to use this storage type, the following adapts the TensorFlow CIFAR-10 Image Classification Tutorial to use the local storage type.

Data Preparation

The data preparation step is significantly simplified, as the only thing that is required is to download the data to your local computer. In order to run this example using the root of your home directory, run the following commands:

cd ~
mkdir trainml-examples
cd trainml-examples
mkdir data
mkdir output

cd data
curl -O https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xzf cifar-10-python.tar.gz
rm cifar-10-python.tar.gz

Model Training

At this point, should have a data folder inside the root of the code repository that contains the data set, and an output folder that is currently empty. Login to the trainML platform and click the Create Training Job+/Training Job+ button to open a new job form. Populate the following fields with the following values:

  • Job Name: Local Storage Example
  • GPU Type: gtx1060
  • GPU Count: 1
  • Job Environment: Deep Learning - Python 3.9
  • Model Code Location: https://github.com/trainML/examples.git
  • Input Data Type: Local
  • Input data storage path: ~/trainml-examples/data
  • Output Type: Local
  • Artifact output storage path: ~/trainml-examples/output
  • Number of Workers: 1
  • Command for Worker 1: PYTHONPATH=$PYTHONPATH:$TRAINML_MODEL_PATH python -m official.vision.image_classification.resnet_cifar_main --num_gpus=1 --data_dir=$TRAINML_DATA_PATH --model_dir=$TRAINML_OUTPUT_PATH --enable_checkpoint_and_export=True --train_epochs=10 --batch_size=1024

Click Next and then Create on the job form review page to start the job. When the job is in the running state, it will wait for you to connect before it tries to download the data set. Connect to the job by following the instructions in the getting started guide. Once you connect, the log viewer will show the data downloading the to workers. Once that's complete, you will see the output from the worker itself during the training, and finally the zipping and uploading.

After the job completes, navigate to the ~/trainml-examples/output directory on your local computer. You will see a file with the format <job_name>.zip. Unzip it, and you can continue with the analyzing the output step of the tutorial.