Training jobs are ideal for parallel hyperparameter tuning, model architecture searches, and long duration training runs on models that have been designed with a scriptable interface.
Starting a Job
Create a Training Job link on the Home screen or the
Create button from the Training Jobs page to open a new job form. Enter a job name that will uniquely identify this job for you. Select the type of GPU you would like to use by clicking an available GPU card. Select how many GPUs you want attached to each worker in the
GPU Count field. A maximum of 4 GPUs per worker is allowed. If any options in these fields are disabled, there are not enough GPUs of that type available to satisfy your request. Specify the amount of disk space you want allocated for this job's working directory in the
Disk Size field. Be sure to allocate enough space to complete this job, as this allocation cannot be changed once the job is created.
Public Dataset from
Dataset Type field, then select
CIFAR-10 from the
Dataset field. This will automatically load the CIFAR-10 dataset into the
/opt/trainml/input directory of each job worker. Since this is a trainML supplied dataset, you will incur no additional storage cost for using this data.
To create a trainML model from the job worker's training results to be used as the basis for subsequent jobs, select
trainML from the
Output Type dropdown. Alternatively, to automatically upload model results to an external source, select a configured storage provider. Based on what provider you choose, enter a valid path into the
Output storage path field. If this is configured, when a worker exits, it will automatically zip the contents of the
/opt/trainml/output directory and push it to the specified storage path. It may take several minutes after training is complete for the upload of the final artifact upload to complete, depending on the size of the contents of the output directory.
trainML can use git to load the model code onto
the job workers. Specify a git clone url in the
Model Code Location field. For
example, to pull the default branch of the Tensorflow Model Garden, enter
The repository will be loaded into the
/opt/trainml/models directory of the worker environment, which will be the current working directory of the worker when it starts.
You can specify up to 20 workers to run in parallel, each with their own dedicated GPUs. You can specify a unique command for each worker, or the same command for all workers. If you are using an external solution like Weights & Biases to control the experiments each worker is running, you may want to specify a single command for all workers. If you are using command line arguments to your training script to make different workers try different hyperparameters or architectures, specify a unique command for each worker.
The command specified will be run at the root of the code repository that was loaded based on the models section. For example, to run a shell script called train.sh located in the root of the code repository with arguments --lr=1e-8 --alpha=0.5, input:
./train.sh --lr=10e-8 --alpha=0.5
To run a python script called train.py located in the root of the code repository with the same arguments, input
python train.py --lr=10e-8 --alpha=0.5
You can optionally add environment variables that will be accessible by the workers to further control their execution. For example, passing a BUCKET_NAME to automatically write checkpoints to during the model run, or setting WANDB_PROJECT and WANDB_API_KEY when using Weights & Biases to track your experiments. Like the model code, environment variables are shared across all job workers.
If you want your job workers to utilize third-party cloud services, you can also attach their keys to the workers. This will set the relevant environment variables in the worker containers to the configured key values.
As an example, if you want to fully configure the aws cli within the worker container:
- Configure an AWS third-party key with an IAM user that has the appropriate policy attached.
- Add an environment variable with a key of
AWS_DEFAULT_REGIONand a value of the AWS region the data or services reside in.
- Select AWS from the
Add Third-Party Keys to Workersdropdown.
Once you click Next on the job form, you are given the opportunity to review your training job configuration for errors. Review these settings carefully. They cannot be changed once a job is started.
If the number of workers and GPUs per worker exceeds the current available GPUs of that type, you will receive a message on the review form stating that workers will queue until GPUs become available. When this occurs, not all workers will run in parallel and may not start until other workers complete. You are not billed for waiting workers.
Monitoring the Job
Once a job successfully starts, the dashboard should indicate that the job is in the
running state. Click the
View button to access the job worker logs. Log messages are sorted in descending order (most recent on top) and new log messages appear automatically as they are generated. If you have a job with multiple workers, you can filter the log view for a single worker using the dropdown at the top of the logs page. If there are many log messages, you can scroll down on the page to see older logs. If you have multiple workers, you must select a specific worker to see the older log messages.
To view the current status of each job worker, click on the job name from the Training Job dashboard. This detail view shows the current status of each worker as well as the command each working executing.
Stopping and Terminating a Job
When each worker finishes executing, it automatically stops, and billing for that worker also stops. When all workers complete, the job is considered finished. You can also interrupt a running job or job worker by clicking
Stop on either the job or the job worker.
Finished jobs may be automatically purged after 24 hours.
When a job is finished, you can review the its details and download an extracts of the worker logs. When you no longer need to see the details of the job, click the