Build Full Machine Learning Pipelines with Inference Jobs

March 21, 2021 · 2 min read

The trainML platform has been extended to support batch inference jobs, enabling customers to use trainML for all stages of the machine learning pipeline that require GPU acceleration.

How It Works

A new section has been added to the sidebar navigation that provides access to the Inference Jobs Dashboard. To create a new inference job, navigate to the Inference Jobs Dashboard and click Create. The new job form is similar to existing jobs types with two differences:

Inference jobs can not be used on datasets. Instead, you can download external data as part of the job creation process using the fields in the Input Data section. Data downloaded in this manner is purged from the system when the job finishes.
Inference jobs only support a single worker.

Example Pipeline

The following is an example of using the trainML SDK to walk through the full machine learning pipeline using datasets, models, training jobs, and inference jobs.

Initialize the SDK.

from trainml import TrainML
import asyncio

trainml = TrainML()

Create the training dataset.

dataset = asyncio.run(
    trainml.datasets.create(
        name="Example Dataset",
        source_type="aws",
        source_uri="s3://example-bucket/data/cifar10",
    )
)

asyncio.run(dataset.attach())

Create the training job to run on the newly created datasets and export it's results to a trainML model

training_job = asyncio.run(
     trainml.jobs.create(
        name="Example Training Job",
        type="training",
        gpu_type="gtx1060",
        gpu_count=1,
        disk_size=10,
        workers=[
            "PYTHONPATH=$PYTHONPATH:$TRAINML_MODEL_PATH python -m official.vision.image_classification.resnet_cifar_main --num_gpus=1 --data_dir=$TRAINML_DATA_PATH --model_dir=$TRAINML_OUTPUT_PATH --enable_checkpoint_and_export=True --train_epochs=10 --batch_size=1024",
        ],
        data=dict(
            datasets=[dict(id=dataset.id, type="existing")],
            output_type="trainml",
        ),
        model=dict(git_uri="git@github.com:my-account/test-private.git"),
    )
)
asyncio.run(training_job.attach())

Get the model's ID from the training job worker and wait for the model creation to finish.

training_job = asyncio.run(training_job.refresh())

model = asyncio.run(
    trainml.models.get(training_job.workers[0].get("output_model_uuid"))
)

asyncio.run(model.wait_for("ready"))

Run an inference job using the created model on new data and save the results to an external source.

inference_job = asyncio.run(
    trainml.jobs.create(
        name="Example Inference Job",
        type="inference",
        gpu_type="GTX 1060",
        gpu_count=1,
        disk_size=10,
        workers=[
            "PYTHONPATH=$PYTHONPATH:$TRAINML_MODEL_PATH python -m official.vision.image_classification.resnet_cifar_main --num_gpus=1 --data_dir=$TRAINMLDATA_PATH --model_dir=$TRAINML_OUTPUT_PATH --enable_checkpoint_and_export=True --train_epochs=10 --batch_size=1024",
        ],
        data=dict(
            input_type="aws",
            input_uri="s3://example-bucket/data/new_data",
            output_type="aws",
            output_uri="s3://example-bucket/output/model_predictions",
        ),
        model=dict(model_uuid=model.id),
    )
)
asyncio.run(inference_job.attach())

(Optional) Cleanup resources.

asyncio.gather(
    training_job.remove(),
    inference_job.remove(),
    model.remove(),
    dataset.remove(),
)

How It Works​

Example Pipeline​

How It Works

Example Pipeline