Customers with prebuilt docker images can now use them as the job environment for any job type.
How It Works
trainML jobs can now be created by specifying a custom image instead of one of our prebuilt environments. This allows customers to use any combination of library versions and code and know that their training or inference is running with exactly those versions. trainML currently supports pulling images stored in three Docker registries; Docker Hub (both public and private repositories), AWS Elastic Container Registry, and Google Container Registry.
In order to use an image in one of these registries, you must first configure third party access keys for the provider you intend to use. This is not strictly required for public Docker Hub images, but is highly recommended. If you use public images anonymously, your job may fail due to Docker rate limiting.
Customers are responsible for building their images with their desired version of CUDA and any associated acceleration libraries. Additionally, notebooks and endpoints require certain libraries to be installed to function properly. You can add these to the image during the image build process, or include them as additional pip
packages during job creation.
- Notebook:
jupyterlab
- Endpoint:
fastapi
,pydantic
,uvicorn[standard]
,python-multipart
Jobs using images that do not meet these requirements may fail or immediately stop once they reach the running
state.
When a job using a custom image starts, it will pull the specified image version at that time. If the job is restartable (e.g. a notebook or endpoint), it will continue to use that image version as long as the job exists, even if newer version of the image are pushed to the same tag.
Unlike trainML built-in environments, the image size counts towards the disk size quota you specify when creating the job. For example, if you request 20 GB of disk space, and use a custom image that is 15 GB, you will have 5GB of space available as the job's working directory.
Ensure you are reserving enough disk space to accommodate the image size. CUDA layers can be 3+ GB alone. If the image size is greater than the requested disk space, the job will fail.
Using the Web Platform
Navigate to the job dashboard of the job type you wish to create and click the Create
button. Specify the resources, model, data, as normal and expand the Environment
section. In the Base Environment
section, there is now a Customer Provided
checkbox. Check this box and specify the docker image name in the Image
field. If you do not have keys configured for the registry type the image is stored it, you will receive a error message.
Using the SDK
A docker image can be specified when creating a job using the trainML SDK by setting the type
field in the environment
dict to CUSTOM
and specifying the docker image name as the custom_image
value. An example specification is the following:
job = await trainml.jobs.create(
name="Custom Environment Training Job",
type="training",
...
environment=dict(
type="CUSTOM",
custom_image="tensorflow/tensorflow:2.4.3-gpu",
),
worker_commands=[
"python train.py",
],
)
Using the CLI
A docker image can be specified when creating a job using the trainML CLI by using the --custom-image
flag. This will override the --environment
flag if specified. An example is the following:
trainml job create training \
--model-dir ~/model-dir --data-dir ~/data-dir --output-dir ~/output-dir \
--custom-image tensorflow/tensorflow:2.4.3-gpu \
"Custom Environment Training Job" \
"python train.py"