Skip to main content

Customer Provided Job Environments

· 4 min read

Customers with prebuilt docker images can now use them as the job environment for any job type.

How It Works

trainML jobs can now be created by specifying a custom image instead of one of our prebuilt environments. This allows customers to use any combination of library versions and code and know that their training or inference is running with exactly those versions. trainML currently supports pulling images stored in three Docker registries; Docker Hub (both public and private repositories), AWS Elastic Container Registry, and Google Container Registry.

In order to use an image in one of these registries, you must first configure third party access keys for the provider you intend to use. This is not strictly required for public Docker Hub images, but is highly recommended. If you use public images anonymously, your job may fail due to Docker rate limiting.

Customers are responsible for building their images with their desired version of CUDA and any associated acceleration libraries. Additionally, notebooks and endpoints require certain libraries to be installed to function properly. You can add these to the image during the image build process, or include them as additional pip packages during job creation.

  • Notebook: jupyterlab
  • Endpoint: fastapi, pydantic, uvicorn[standard], python-multipart
caution

Jobs using images that do not meet these requirements may fail or immediately stop once they reach the running state.

When a job using a custom image starts, it will pull the specified image version at that time. If the job is restartable (e.g. a notebook or endpoint), it will continue to use that image version as long as the job exists, even if newer version of the image are pushed to the same tag.

Unlike trainML built-in environments, the image size counts towards the disk size quota you specify when creating the job. For example, if you request 20 GB of disk space, and use a custom image that is 15 GB, you will have 5GB of space available as the job's working directory.

caution

Ensure you are reserving enough disk space to accommodate the image size. CUDA layers can be 3+ GB alone. If the image size is greater than the requested disk space, the job will fail.

Using the Web Platform

Navigate to the job dashboard of the job type you wish to create and click the Create button. Specify the resources, model, data, as normal and expand the Environment section. In the Base Environment section, there is now a Customer Provided checkbox. Check this box and specify the docker image name in the Image field. If you do not have keys configured for the registry type the image is stored it, you will receive a error message.

Using the SDK

A docker image can be specified when creating a job using the trainML SDK by setting the type field in the environment dict to CUSTOM and specifying the docker image name as the custom_image value. An example specification is the following:

job = await trainml.jobs.create(
name="Custom Environment Training Job",
type="training",
...
environment=dict(
type="CUSTOM",
custom_image="tensorflow/tensorflow:2.4.3-gpu",
),
worker_commands=[
"python train.py",
],
)

Using the CLI

A docker image can be specified when creating a job using the trainML CLI by using the --custom-image flag. This will override the --environment flag if specified. An example is the following:

trainml job create training \
--model-dir ~/model-dir --data-dir ~/data-dir --output-dir ~/output-dir \
--custom-image tensorflow/tensorflow:2.4.3-gpu \
"Custom Environment Training Job" \
"python train.py"