Skip to main content

Running a Training Job

Training jobs are ideal for parallel hyperparameter tuning, model architecture searches, and long duration training runs on models that have been designed with a scriptable interface.

Starting a Job

Click the Create a Training Job link on the Home screen or the Create button from the Training Jobs page to open a new job form. Enter a job name that will uniquely identify this job for you. Select the type of GPU you would like to use by clicking an available GPU card. Select how many GPUs you want attached to each worker in the GPU Count field. A maximum of 4 GPUs per worker is allowed. If any options in these fields are disabled, there are not enough GPUs of that type available to satisfy your request. Specify the amount of disk space you want allocated for this job's working directory in the Disk Size field. Be sure to allocate enough space to complete this job, as this allocation cannot be changed once the job is created.

Data

Select Public Dataset from Dataset Type field, then select CIFAR-10 from the Dataset field. This will automatically load the CIFAR-10 dataset into the /opt/trainml/input directory of each job worker. Since this is a trainML supplied dataset, you will incur no additional storage cost for using this data.

To create a trainML model from the job worker's training results to be used as the basis for subsequent jobs, select trainML from the Output Type dropdown. Alternatively, to automatically upload model results to an external source, select a configured storage provider. Based on what provider you choose, enter a valid path into the Output storage path field. If this is configured, when a worker exits, it will automatically zip the contents of the /opt/trainml/output directory and push it to the specified storage path. It may take several minutes after training is complete for the upload of the final artifact upload to complete, depending on the size of the contents of the output directory.

Model

trainML can use git to load the model code onto the job workers. Specify a git clone url in the Model Code Location field. For example, to pull the default branch of the Tensorflow Model Garden, enter https://github.com/tensorflow/models.git.

The repository will be loaded into the /opt/trainml/models directory of the worker environment, which will be the current working directory of the worker when it starts.

Workers

You can specify up to 20 workers to run in parallel, each with their own dedicated GPUs. You can specify a unique command for each worker, or the same command for all workers. If you are using an external solution like Weights & Biases to control the experiments each worker is running, you may want to specify a single command for all workers. If you are using command line arguments to your training script to make different workers try different hyperparameters or architectures, specify a unique command for each worker.

The command specified will be run at the root of the code repository that was loaded based on the models section. For example, to run a shell script called train.sh located in the root of the code repository with arguments --lr=1e-8 --alpha=0.5, input:

./train.sh --lr=10e-8 --alpha=0.5

To run a python script called train.py located in the root of the code repository with the same arguments, input

python train.py --lr=10e-8 --alpha=0.5

Environment (Optional)

You can optionally add environment variables that will be accessible by the workers to further control their execution. For example, passing a BUCKET_NAME to automatically write checkpoints to during the model run, or setting WANDB_PROJECT and WANDB_API_KEY when using Weights & Biases to track your experiments. Like the model code, environment variables are shared across all job workers.

If you want your job workers to utilize third-party cloud services, you can also attach their keys to the workers. This will set the relevant environment variables in the worker containers to the configured key values.

As an example, if you want to fully configure the aws cli within the worker container:

  1. Configure an AWS third-party key with an IAM user that has the appropriate policy attached.
  2. Add an environment variable with a key of AWS_DEFAULT_REGION and a value of the AWS region the data or services reside in.
  3. Select AWS from the Add Third-Party Keys to Workers dropdown.

Review

Once you click Next on the job form, you are given the opportunity to review your training job configuration for errors. Review these settings carefully. They cannot be changed once a job is started.

info

If the number of workers and GPUs per worker exceeds the current available GPUs of that type, you will receive a message on the review form stating that workers will queue until GPUs become available. When this occurs, not all workers will run in parallel and may not start until other workers complete. You are not billed for waiting workers.

Monitoring the Job

Once a job successfully starts, the dashboard should indicate that the job is in the running state. Click the View button to access the job worker logs. Log messages are sorted in descending order (most recent on top) and new log messages appear automatically as they are generated. If you have a job with multiple workers, you can filter the log view for a single worker using the dropdown at the top of the logs page. If there are many log messages, you can scroll down on the page to see older logs. If you have multiple workers, you must select a specific worker to see the older log messages.

To view the current status of each job worker, click on the job name from the Training Job dashboard. This detail view shows the current status of each worker as well as the command each working executing.

Stopping and Terminating a Job

When each worker finishes executing, it automatically stops, and billing for that worker also stops. When all workers complete, the job is considered finished. You can also interrupt a running job or job worker by clicking Stop on either the job or the job worker.

info

Finished jobs may be automatically purged after 24 hours.

When a job is finished, you can review the its details and download an extracts of the worker logs. When you no longer need to see the details of the job, click the Terminate button.