trainML training jobs can now run on data directly from your local computer and upload their results back without using any cloud intermediary. If you already have the data set on your local computer and want to avoid the repetitive cycle of uploading and downloading from cloud storage, this storage type is for you.
How It Works
Local option is now available from both the
Input Data Type and
Output Type fields. If this option is selected, specify the storage path as the location on your local computer you want the data to be copied from or to. The path must be specified as an absolute path (starting with
/home/username/data), a home directory relative path (starting with
~/data), or an environment variable based path (starting with
$HOME/data, where the
HOME environment variable on your local computer is set to
/home/username). Relative paths
./ are not supported.
Running the connection utility is mandatory for local storage types. Jobs will wait indefinitely for you connect before downloading data or uploading their results.
Jobs will wait indefinitely for you to connect, and you will continue to billed while it waits.
Additionally, the storage path specified must exist in the same environment as you are running the connection utility. For example, if you run the connection utility inside of a linux virtual machine (VM), the storage path specified must be the path of the data inside the VM, not the host computer.
When using the
Local storage type for input data, the contents of the specified directory will be recursively copied to the trainML data path (accessible with the
TRAINML_DATA_PATH environment variable) for the workers to access. No automatic extraction of archives will occur, so ensure that the data is already unarchived on your local computer. Additionally, we recommend that you use an isolated path for the input data, which contains only the data you need for this training job and nothing else. Copying more data than necessary will needlessly delay the workers from starting. The speed of this process will be primarily limited by your internet connection's upstream bandwidth; however, the data download duration only costs your patience, not your credits.
When using the
Local storage type for output artifacts, the contents of the trainML output path (accessible with the
TRAINML_OUTPUT_PATH environment variable) will be zipped with a naming convention of
<job_name>_<worker_number>.zip for a multi-worker training job) and uploaded to the specified local directory.
Similar to downloading, workers will wait indefinitely for you to connect in order to upload their results. You continue to be billed as workers are uploading. We recommend that you stay connected to a job for the entire duration of a job that uses the local storage option.
Example Training Job
As an example of how to use this storage type, the following adapts the TensorFlow CIFAR-10 Image Classification Tutorial to use the local storage type.
The data preparation step is significantly simplified, as the only thing that is required is to download the data to your local computer. In order to run this example using the root of your home directory, run the following commands:
curl -O https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xzf cifar-10-python.tar.gz
At this point, should have a data folder inside the root of the code repository that contains the data set, and an output folder that is currently empty. Login to the trainML platform and click the
Create Training Job+/Training Job+ button to open a new job form. Populate the following fields with the following values:
Job Name: Local Storage Example
GPU Type: gtx1060
GPU Count: 1
Job Environment: Deep Learning - Python 3.9
Model Code Location: https://github.com/trainML/examples.git
Input Data Type: Local
Input data storage path: ~/trainml-examples/data
Output Type: Local
Artifact output storage path: ~/trainml-examples/output
Number of Workers: 1
Command for Worker 1: PYTHONPATH=$PYTHONPATH:$TRAINML_MODEL_PATH python -m official.vision.image_classification.resnet_cifar_main --num_gpus=1 --data_dir=$TRAINML_DATA_PATH --model_dir=$TRAINML_OUTPUT_PATH --enable_checkpoint_and_export=True --train_epochs=10 --batch_size=1024
Next and then
Create on the job form review page to start the job. When the job is in the
running state, it will wait for you to connect before it tries to download the data set. Connect to the job by following the instructions in the getting started guide. Once you connect, the log viewer will show the data downloading the to workers. Once that's complete, you will see the output from the worker itself during the training, and finally the zipping and uploading.
After the job completes, navigate to the
~/trainml-examples/output directory on your local computer. You will see a file with the format
<job_name>.zip. Unzip it, and you can continue with the analyzing the output step of the tutorial.