Simple Image Classification with Training Jobs
This tutorial should cost less than 0.1 credits ($0.10) if you use the GTX 1060 instance type and the same training settings as the guide.
Some parts of this tutorial use advanced features that require the connection capability, ensure that your environment meets the requirements and you have installed all the prerequisites.
Model Overview
This tutorial demonstrates how to train a ResNet50 model from Keras Applications on the CIFAR-10 dataset. The example training code can be found in the trainML examples repository in the training/image-classification
folder.
Training File
The entire training code is contained in the resnet_cifar.py file. The file is a regular Python script and not a Jupyter notebook. Although it is possible to run notebooks as scripts using nbconvert, converting a notebook to a script yourself provides the opportunity to add capabilities that make training, experimentation, and production deployment much easier. The key step in converting a notebook into a regular python script is to implement a __main__
function.
def main():
## <top level commands go here>
....
if __name__ == "__main__":
main()
You can find more information about main
functions here. Performing this step makes implementing the next improvement possible.
Argparse
The built-in argparse library makes it easy to pass script variables like hyperparameters through the command line. The top section of the resnet_cifar.py
file demonstrates how to define a parser, add arguments, and define some simple data validation for those values. Once the parser is defined, the rest of your code can access those values from the object returned by the parse_args()
function. This line is the first line of the main()
function. In many notebook versions, hyperparameters like batch size, epochs, or optimizer selection are hardcoded in the notebook, and have to be manually modified to try a new parameter. With argparse
, the code of the script does not need to change in order to modify those hyperparameters, simply pass in a new value at runtime. For example, to run this code with 2 epochs and the rmsprop
optimizer, use a command like this:
python resnet_cifar10.py --epochs 2 --optimizer rmsprop
This method can be used to dynamically control any aspect of your training execution, from how many checkpoints to save, what metrics to calculate, to which model architecture to train.
Environment Variables
The last thing to note about the training code is the use of the trainML environment variables. Using these variables is the recommended way of accessing the input data, model code, and output data locations within a trainML job. The load_data
function uses the TRAINML_DATA_PATH
environment variable to load the cifar-10 data from location the dataset will be attached when the job starts. Additionally, the build_callbacks
function saves both the TensorBoard logs and the model checkpoints to the TRAINML_OUTPUT_PATH
. This will ensure that all those files will be uploaded back at the end of training. Any data outside the TRAINML_OUTPUT_PATH
is assumed to be in-process material and is discarded.
Although not implemented in this script, using these environment variable can make it easy to run your code on both your local computer and as a trainML job. For example, you can set the input data path like the following:
data_path = os.environ.get('TRAINML_DATA_PATH') if os.environ.get('TRAINML_DATA_PATH') else '/path/to/local/dataset'
When this code runs locally, the TRAINML_DATA_PATH
is not defined and uses the default local path. When it runs inside a trainML job, it will automatically use the correct location for the attached input data.
Model Training
Creating the Training Job
Login to the trainML platform and click the Create a Training Job
link on the Home screen or the Create
button from the Training Jobs page to open a new job form. Enter a memorable name for the Job Name like CIFAR-10 ResNet Example
. Select the cheapest available GPU Type and leave the GPUs Per Worker as 1.
In the Data
section, select Public Dataset
from Dataset Type
field, then select CIFAR-10
from the Dataset
field. This will automatically load the CIFAR-10 dataset into the /opt/trainml/input
directory of each job worker. Since this is a trainML supplied dataset, you will incur no additional storage cost for using this data. For Output Type
select Local
and specify an existing directory on your local computer, like ~/training-example/output
in the Artifact output storage path
field.
In the Model
section, specify the URL of example code repository: https://github.com/trainML/examples.git
Leave the number of workers at 1 and enter the following as the Command for Worker 1
:
python training/image-classification/resnet_cifar.py --epochs 10 --optimizer adam --batch-size 128
Click Next
to review your settings. Click Create
to start the job.
Monitoring the Job Worker
To monitor the progress of the job, connect to the job by following the instructions in the getting started guide. Once the worker initialized, you will begin to see output like the following:
Epoch 1/10
390/390 - 57s - loss: 2.9473 - accuracy: 0.2323 - val_loss: 3.9611 - val_accuracy: 0.1629 - 57s/epoch - 146ms/step
BenchmarkMetric: {'global step': 300, 'time_taken': 9.212851, 'examples_per_second': 1389.363611}
Connect to the job using the connection utility so that the job worker can upload the results to your local computer.
When training completes, the worker outputs the final accuracy and loss metrics. If you return to the trainML dashboard, you will see the job has stopped automatically.
Analyzing the Output
Depending on size of the model artifact directory, it could take several minutes for the training output file to appear in the artifact storage location.
Analyzing Results
Once the job terminates, check the output path you specified in when creating the job. There should be a file with a naming convention of <job_name>.zip
, for example CIFAR-10_ResNet_Example.zip
. Extract this file into a folder that uses a Python environment with Tensorboard installed. In a terminal window, navigate to that directory and run:
tensorboard --logdir .
This should start tensorboard on your local computer. You will have to manually navigate to the URL it specified to see the results. From here, you can analyze the training run using all the features of Tensorboard. See the Tensorboard guide for more details.
Cleaning Up
Once you are done with a job, click the Terminate
button to remove it from the trainML platform.