Skip to main content

Running an Inference Job

Inference jobs are designed to run trained models on new data as part of a model inference pipeline and deliver the predictions back to an external location.

Starting a Job

Click the Run an Inference Job link on the Home screen or the Create button from the Inference Jobs Dashboard to open a new job form. Enter a job name that will uniquely identify this job for you. Select the type of GPU you would like to use by clicking an available GPU card. Select how many GPUs you want attached to each worker in the GPU Count field. A maximum of 4 GPUs per inference job is allowed. If any options in these fields are disabled, there are not enough GPUs of that type available to satisfy your request. Specify the amount of disk space you want allocated for this job's working directory in the Disk Size field. Be sure to allocate enough space to complete this job, as this allocation cannot be changed once the job is created.

Data

Inference jobs will automatically download the source data and upload the predictions as part of their execution. To specify the location of the source data, select the required storage provider from the Input Type field and the path to the data in the Input Storage Path field. When the job starts, the data from this location will be placed into the /opt/trainml/input directory. To specify where to send the output predictions, select the required storage provider from the Output Type field and the path to send the results to in the Output Storage Path field.

tip

In order for the automatic output to work, you must save the resuls in the /opt/trainml/output folder. The recommended way to configure this in your code is to use the TRAINML_OUTPUT_PATH environment variable.

Model

If you created a trainML model in the previous step, select trainML as the Model Type and select it from the list. Otherwise, select Git and specify the git clone URL of the repository.

Workers

Specify the command to use to perform the inference operation with the selected model on the data that will be loaded. The command specified will be run at the root of the model that was loaded based on the models section. For example, if your inference code is called predict.py and takes parameters for the data location (--data-path) and where to save the predictions (--output-path), the command would be the following:

python predict.py --data-path=$TRAINML_DATA_PATH --output-path=$TRAINML_OUTPUT_PATH

This command takes advantage of the trainML environment variables to ensure the code is utilizing the correct directory structure.

Review

Once you click Next on the job form, you are given the opportunity to review your inference job configuration for errors. Review these settings carefully. They cannot be changed once a job is started.

info

If the number of GPUs requested exceeds the current available GPUs of that type, you will receive a message on the review form stating that job will queue until GPUs become available. When this occurs, the job will wait until GPUs of the type you selected become available. You are not billed for waiting jobs.

Monitoring the Job

Once a job successfully starts, the dashboard should indicate that the job is in the running state. Click the View button to access the job logs. Log messages are sorted in descending order (most recent on top) and new log messages appear automatically as they are generated. If there are many log messages, you can scroll down on the page to see older logs.

To view detailed information about the job, click on the job name from the Inference Job dashboard.

Stopping and Terminating a Job

When each worker finishes executing, it automatically stops, and billing for that worker also stops. When all workers complete, the job is considered finished. You can also interrupt a running job or job worker by clicking Stop on either the job or the job worker.

info

Finished jobs may be automatically purged after 24 hours.

When a job is finished, you can review the its details and download an extracts of the worker logs. When you no longer need to see the details of the job, click the Terminate button.