Simple Image Classification with Training Jobs

This tutorial should cost less than 0.1 credits ($0.10) if you use the GTX 1060 instance type and the same training settings as the guide.

info

Some parts of this tutorial use advanced features that require the connection capability, ensure that your environment meets the requirements and you have installed all the prerequisites.

Model Overview

This tutorial demonstrates how to train a ResNet50 model from Keras Applications on the CIFAR-10 dataset. The example training code can be found in the trainML examples repository in the training/image-classification folder.

Training File

The entire training code is contained in the resnet_cifar.py file. The file is a regular Python script and not a Jupyter notebook. Although it is possible to run notebooks as scripts using nbconvert, converting a notebook to a script yourself provides the opportunity to add capabilities that make training, experimentation, and production deployment much easier. The key step in converting a notebook into a regular python script is to implement a __main__ function.

def main():
    ## <top level commands go here>
    ....

if __name__ == "__main__":
    main()

You can find more information about main functions here. Performing this step makes implementing the next improvement possible.

Argparse

The built-in argparse library makes it easy to pass script variables like hyperparameters through the command line. The top section of the resnet_cifar.py file demonstrates how to define a parser, add arguments, and define some simple data validation for those values. Once the parser is defined, the rest of your code can access those values from the object returned by the parse_args() function. This line is the first line of the main() function. In many notebook versions, hyperparameters like batch size, epochs, or optimizer selection are hardcoded in the notebook, and have to be manually modified to try a new parameter. With argparse, the code of the script does not need to change in order to modify those hyperparameters, simply pass in a new value at runtime. For example, to run this code with 2 epochs and the rmsprop optimizer, use a command like this:

python resnet_cifar10.py --epochs 2 --optimizer rmsprop

This method can be used to dynamically control any aspect of your training execution, from how many checkpoints to save, what metrics to calculate, to which model architecture to train.

Environment Variables

The last thing to note about the training code is the use of the trainML environment variables. Using these variables is the recommended way of accessing the input data, model code, and output data locations within a trainML job. The load_data function uses the TRAINML_DATA_PATH environment variable to load the cifar-10 data from location the dataset will be attached when the job starts. Additionally, the build_callbacks function saves both the TensorBoard logs and the model checkpoints to the TRAINML_OUTPUT_PATH. This will ensure that all those files will be uploaded back at the end of training. Any data outside the TRAINML_OUTPUT_PATH is assumed to be in-process material and is discarded.

Although not implemented in this script, using these environment variable can make it easy to run your code on both your local computer and as a trainML job. For example, you can set the input data path like the following:

data_path = os.environ.get('TRAINML_DATA_PATH') if os.environ.get('TRAINML_DATA_PATH') else '/path/to/local/dataset'

When this code runs locally, the TRAINML_DATA_PATH is not defined and uses the default local path. When it runs inside a trainML job, it will automatically use the correct location for the attached input data.

Model Training

Creating the Training Job

Login to the trainML platform and click the Create a Training Job link on the Home screen or the Create button from the Training Jobs page to open a new job form. Enter a memorable name for the Job Name like CIFAR-10 ResNet Example. Select the cheapest available GPU Type and leave the GPUs Per Worker as 1.

In the Data section, select Public Dataset from Dataset Type field, then select CIFAR-10 from the Dataset field. This will automatically load the CIFAR-10 dataset into the /opt/trainml/input directory of each job worker. Since this is a trainML supplied dataset, you will incur no additional storage cost for using this data. For Output Type select Local and specify an existing directory on your local computer, like ~/training-example/output in the Artifact output storage path field.

In the Model section, specify the URL of example code repository: https://github.com/trainML/examples.git

Leave the number of workers at 1 and enter the following as the Command for Worker 1:

python training/image-classification/resnet_cifar.py --epochs 10 --optimizer adam --batch-size 128

Click Next to review your settings. Click Create to start the job.

Monitoring the Job Worker

To monitor the progress of the job, connect to the job by following the instructions in the getting started guide. Once the worker initialized, you will begin to see output like the following:

Epoch 1/10
390/390 - 57s - loss: 2.9473 - accuracy: 0.2323 - val_loss: 3.9611 - val_accuracy: 0.1629 - 57s/epoch - 146ms/step
BenchmarkMetric: {'global step': 300, 'time_taken': 9.212851, 'examples_per_second': 1389.363611}

Connect to the job using the connection utility so that the job worker can upload the results to your local computer.

When training completes, the worker outputs the final accuracy and loss metrics. If you return to the trainML dashboard, you will see the job has stopped automatically.

Analyzing the Output

Depending on size of the model artifact directory, it could take several minutes for the training output file to appear in the artifact storage location.

Analyzing Results

Once the job terminates, check the output path you specified in when creating the job. There should be a file with a naming convention of <job_name>.zip, for example CIFAR-10_ResNet_Example.zip. Extract this file into a folder that uses a Python environment with Tensorboard installed. In a terminal window, navigate to that directory and run:

tensorboard --logdir .

This should start tensorboard on your local computer. You will have to manually navigate to the URL it specified to see the results. From here, you can analyze the training run using all the features of Tensorboard. See the Tensorboard guide for more details.

Cleaning Up

Once you are done with a job, click the Terminate button to remove it from the trainML platform.

Model Overview​

Training File​

Argparse​

Environment Variables​

Model Training​

Creating the Training Job​

Monitoring the Job Worker​

Analyzing the Output​

Analyzing Results​

Cleaning Up​