trainML Documentation
  • Docs
  • Tutorials
  • Blog
  • Login/Signup

›All Blog Posts

All Blog Posts

  • CloudBender
  • NVIDIA NGC Catalog Integration
  • Collaborative Projects for Resource Sharing
  • Customer Provided Job Environments
  • REST Endpoints for Inference
  • Automatic Dependency Installation
  • Consolidated Account Billing
  • Load Model Code Directly From Your Laptop
  • Start Training Models With One Line
  • RTX 3090 (BFGPU) Instances Now Available
  • Build Full Machine Learning Pipelines with trainML Inference Jobs
  • Store Training Results Directly on the trainML Platform
  • Dataset Viewing
  • Stay Modern with Python 3.8 Job Environments
  • Downloadable Log Extracts for Jobs and Datasets
  • Automate Training with the trainML Python SDK
  • trainML Jobs on Google Cloud Platform Instances
  • Spawn Training Jobs Directly From Notebooks
  • Easy Notebook Forking For Rapid Experimentation
  • Making Datasets More Flexible and Expanding Environment Options
  • Kaggle Datasets and API Integration
  • Centralized, Real-Time Training Job Worker Monitoring
  • Free to Use Public Datasets
  • Major UI Overhaul and Direct Notebook Access
  • Load Data Once, Reuse Infinitely
  • Serverless Deep Learning On Private Git Repositories
  • Google Cloud Storage Integration Released
  • Skip the Cloud Data Transfers with Local Storage
  • Web (HTTP/FTP) Data Downloads Plus Auto-Extraction of Archives

Customer Provided Job Environments

August 12, 2021

trainML

trainML

Customers with prebuilt docker images can now use them as the job environment for any job type.

How It Works

trainml jobs can now be created by specifying a custom image instead of one of our prebuilt environments. This allows customers to use any combination of library versions and code and know that their training or inference is running with exactly those versions. trainML currently supports pulling images stored in three Docker registries; Docker Hub (both public and private repositories), AWS Elastic Container Registry, and Google Container Registry.

In order to use an image in one of these registries, you must first configure third party access keys for the provider you intend to use. This is not strictly required for public Docker Hub images, but is highly recommended. If you use public images anonymously, your job may fail due to Docker rate limiting.

Customers are responsible for building their images with their desired version of CUDA and any associated acceleration libraries. Additionally, notebooks and endpoints require certain libraries to be installed to function properly. You can add these to the image during the image build process, or include them as additional pip packages during job creation.

  • Notebook: jupyterlab
  • Endpoint: fastapi, pydantic, uvicorn[standard], python-multipart

Jobs using images that do not meet these requirements may fail or immediately stop once they reach the running state.

When a job using a custom image starts, it will pull the specified image version at that time. If the job is restartable (e.g. a notebook or endpoint), it will continue to use that image version as long as the job exists, even if newer version of the image are pushed to the same tag.

Unlike trainML built-in environments, the image size counts towards the disk size quota you specify when creating the job. For example, if you request 20 GB of disk space, and use a custom image that is 15 GB, you will have 5GB of space available as the job's working directory.

Ensure you are reserving enough disk space to accommodate the image size. CUDA layers can be 3+ GB alone. If the image size is greater than the requested disk space, the job will fail.

Using the Web Platform

Navigate to the job dashboard of the job type you wish to create and click the Create button. Specify the resources, model, data, as normal and expand the Environment section. In the Base Environment section, there is now a Customer Provided checkbox. Check this box and specify the docker image name in the Image field. If you do not have keys configured for the registry type the image is stored it, you will receive a error message.

Using the SDK

A docker image can be specified when creating a job using the trainML SDK by setting the type field in the environment dict to CUSTOM and specifying the docker image name as the custom_image value. An example specification is the following:

job = await trainml.jobs.create(
    name="Custom Environment Training Job",
    type="training",
    ...
    environment=dict(
        type="CUSTOM",
        custom_image="tensorflow/tensorflow:2.4.3-gpu",
    ),
    worker_commands=[
        "python train.py",
    ],
)

Using the CLI

A docker image can be specified when creating a job using the trainML CLI by using the --custom-image flag. This will override the --environment flag if specified. An example is the following:

trainml job create training \
 --model-dir ~/model-dir --data-dir ~/data-dir --output-dir ~/output-dir \
 --custom-image tensorflow/tensorflow:2.4.3-gpu \
 "Custom Environment Training Job" \
 "python train.py"
Tweet
Recent Posts
  • How It Works
    • Using the Web Platform
    • Using the SDK
    • Using the CLI
trainML Documentation
Docs
Getting StartedTutorials
Legal
Privacy PolicyTerms of Use
Copyright © 2022 trainML, LLC, All rights reserved