trainML training jobs can now run on data directly from your local computer and upload their results back without using any cloud intermediary. If you already have the data set on your local computer and want to avoid the repetitive cycle of uploading and downloading from cloud storage, this storage type is for you.
How It Works
The Local
option is now available from both the Input Data Type
and Output Type
fields. If this option is selected, specify the storage path as the location on your local computer you want the data to be copied from or to. The path must be specified as an absolute path (starting with /
, like /home/username/data
), a home directory relative path (starting with ~/
, like ~/data
), or an environment variable based path (starting with $
, like $HOME/data
, where the HOME
environment variable on your local computer is set to /home/username
). Relative paths ./
are not supported.
Running the connection utility is mandatory for local storage types. Jobs will wait indefinitely for you connect before downloading data or uploading their results.
Jobs will wait indefinitely for you to connect, and you will continue to billed while it waits.
Additionally, the storage path specified must exist in the same environment as you are running the connection utility. For example, if you run the connection utility inside of a linux virtual machine (VM), the storage path specified must be the path of the data inside the VM, not the host computer.
When using the Local
storage type for input data, the contents of the specified directory will be recursively copied to the trainML data path (accessible with the TRAINML_DATA_PATH
environment variable) for the workers to access. No automatic extraction of archives will occur, so ensure that the data is already unarchived on your local computer. Additionally, we recommend that you use an isolated path for the input data, which contains only the data you need for this training job and nothing else. Copying more data than necessary will needlessly delay the workers from starting. The speed of this process will be primarily limited by your internet connection's upstream bandwidth; however, the data download duration only costs your patience, not your credits.
When using the Local
storage type for output artifacts, the contents of the trainML output path (accessible with the TRAINML_OUTPUT_PATH
environment variable) will be zipped with a naming convention of <job_name>.zip
(or <job_name>_<worker_number>.zip
for a multi-worker training job) and uploaded to the specified local directory.
Similar to downloading, workers will wait indefinitely for you to connect in order to upload their results. You continue to be billed as workers are uploading. We recommend that you stay connected to a job for the entire duration of a job that uses the local storage option.
Example Training Job
As an example of how to use this storage type, the following adapts the TensorFlow CIFAR-10 Image Classification Tutorial to use the local storage type.
Data Preparation
The data preparation step is significantly simplified, as the only thing that is required is to download the data to your local computer. In order to run this example using the root of your home directory, run the following commands:
cd ~
mkdir trainml-examples
cd trainml-examples
mkdir data
mkdir output
cd data
curl -O https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xzf cifar-10-python.tar.gz
rm cifar-10-python.tar.gz
Model Training
At this point, should have a data folder inside the root of the code repository that contains the data set, and an output folder that is currently empty. Login to the trainML platform and click the Create Training Job+/Training Job+
button to open a new job form. Populate the following fields with the following values:
Job Name
: Local Storage ExampleGPU Type
: gtx1060GPU Count
: 1Job Environment
: Deep Learning - Python 3.9Model Code Location
: https://github.com/trainML/examples.gitInput Data Type
: LocalInput data storage path
: ~/trainml-examples/dataOutput Type
: LocalArtifact output storage path
: ~/trainml-examples/outputNumber of Workers
: 1Command for Worker 1
: PYTHONPATH=$PYTHONPATH:$TRAINML_MODEL_PATH python -m official.vision.image_classification.resnet_cifar_main --num_gpus=1 --data_dir=$TRAINML_DATA_PATH --model_dir=$TRAINML_OUTPUT_PATH --enable_checkpoint_and_export=True --train_epochs=10 --batch_size=1024
Click Next
and then Create
on the job form review page to start the job. When the job is in the running
state, it will wait for you to connect before it tries to download the data set. Connect to the job by following the instructions in the getting started guide. Once you connect, the log viewer will show the data downloading the to workers. Once that's complete, you will see the output from the worker itself during the training, and finally the zipping and uploading.
After the job completes, navigate to the ~/trainml-examples/output
directory on your local computer. You will see a file with the format <job_name>.zip
. Unzip it, and you can continue with the analyzing the output step of the tutorial.