Job Configuration
Notebook and training job configuration is organized into five different sections, Resources, Data, Model, Workers, and Environment.
Resources
This section is mandatory for all job types.
Job Name
: A friendly name for the job. The job name is used to identify both the job's connection utility and the job's output data (if configured). It is available as an environment variable in the job environment.
GPU Type
: The types of GPU that will satisfy the job request. GPU types differ by both their memory size and their operations per second (OPS). A GPU's OPS rate varies based on the data type being used. Single-precision (FP32) is the most common data type used in deep learning.
GTX 1060
: 6 GB of GPU memory, 4.5 TFLOPS of single-precision (FP32), 0.9 TFLOPS of half-precision (FP16) - GTX 1060 instances should not be used for half-precision trainingRTX 2060 Super
: 8 GB of GPU memory, 7.1 TFLOPS of single-precision (FP32), 15 TFLOPS of half-precision (FP16)RTX 2070 Super
: 8 GB of GPU memory, 9 TFLOPS of single-precision (FP32), 18.5 TFLOPS of half-precision (FP16)RTX 2080 Ti
: 11 GB of GPU memory, 13.5 TFLOPS of single-precision (FP32), 25 TFLOPS of half-precision (FP16)RTX 3090
: 24 GB of GPU memory, 35.5 TFLOPS of single-precision (FP32), 39 TFLOPS of half-precision (FP16)RTX A4000
: 16 GB of GPU memory, 9.7 TFLOPS of single-precision (FP32), 19 TFLOPS of half-precision (FP16)RTX A6000
:48 GB of GPU memory, 36 TFLOPS of single-precision (FP32), 40 TFLOPS of half-precision (FP16)T4
: 16 GB of GPU memory, 7.5 TFLOPS of single-precision (FP32), 15.5 TFLOPS of half-precision (FP16)P100
: 16 GB of GPU memory, 8 TFLOPS of single-precision (FP32), 16 TFLOPS of half-precision (FP16)V100
: 16 GB of GPU memory, 15 TFLOPS of single-precision (FP32), 25.5 TFLOPS of half-precision (FP16)A100 (40GB)
: 40 GB of GPU memory, 39 TFLOPS of single-precision(FP32), 62.5 TFLOPS of half-precision (FP16)A100 (80GB)
: 80 GB of GPU memory, 39 TFLOPS of single-precision(FP32), 62.5 TFLOPS of half-precision (FP16)None (CPU-Only)
: Do not attach a GPU.
All GPU instance types have at least 2 cores (4 vCPU) per GPU for PCIe 3.0 GPUs or 4 cores (8 vCPU) per GPU for PCIe 4.0 GPUs with twice the GPU memory in CPU memory. CPU to GPU bandwidth is a minimum of 8 PCIe lanes per GPU.
GPU Count
: This is the number of GPUs per instance. If you use multiple job workers, this is the number of GPUs each worker will use.
CPU Count
(CPU Only Jobs): This is the number of vCPU per instance. Must be a multiple of 4.
Max Price
: The maximum credits per hour per GPU that will satisfy the job request. If the only GPU types available exceed the max price setting, the job will remain in the Waiting for GPUs
state until the request can be fulfilled.
Disk Size
: The amount of working directory storage in GB to allocate for this job. The minimum is 10 GB and the maximum is 1 TB. If you use multiple job workers, each worker will have this much space. Any data added in the Data
section has a separate allocation and will not impact this.
GPU Pricing
The credits per hour each GPU type costs can vary based on the supply and demand of that GPU type. Once you create a job, your credits per hour rate is locked in as long as the job is running. To see the most up-to-date prices, login to the trainML platform and click Start a Notebook
or Create a Training Job
on the Home page.
Data
This section allows you to configure what data is populated for a job and (if applicable) allows you to specify where to upload the jobs output after it completes.
Datasets
Notebook and Training Jobs allow you to attach persistent datasets as input data. To add a dataset, click the Add Dataset
button.
Dataset Type
: If you wish to attach a dataset a job, select from one of the dataset options in this list.
Public Dataset
: Choose from a list of public domain datasets pre-loaded by trainML.My Dataset
: Choose from a list of the private datasets you have previously created.
Dataset
: Select the desired dataset to attach from this list.
If you add a single dataset, the dataset will be mounted to /opt/trainml/input
inside the job workers. To add multiple datasets to a job, you can continue to click the Add Dataset
button until you have selected all the datasets you need for the job. Each dataset will be mounted into its own directory inside the /opt/trainml/input
directory. The directory name will be the name of the checkpoint with spaces converted to underscores. For example, a dataset named PASCAL VOC
will be mounted to /opt/trainml/input/PASCAL_VOC
if it is one of multiple datasets selected.
To remove a selected dataset from the job, click the x
button to the right of the dataset name.
Input Data
Inference jobs do not use persistent datasets and instead allow you to perform a one-time download of the data to run inference on from an external source. Data downloaded in this manner is purged from the system when the job finishes.
Unlike datasets, inference job persistent data consumes the job worker's space allocation. If the source data is large, ensure you have configured sufficient disk space in the resources section to account for this.
Input Type
: To automatically download the data to perform the inference operation on, select from the available input type options from the Input Type
dropdown.
AWS
: Select this option if the data resides on Amazon S3.Azure
: Select this option if the data resides on Azure Blob Storage.GCP
: Select this option if the data resides on Google Cloud Storage.Local
: Select this option if the data resides on your local computer. You will be required to connect to the job for this option to work.
Jobs will wait indefinitely for you to connect, and you will continue to billed until the job stops.
Regional Datastore
- Select this option to mount the data directly from a Regional Datastore in an existing CloudBender region.Wasabi
- Select this option if the data resides on Wasabi Storage.Web
: Select this option if the data resides on a publicly accessible HTTP or FTP server.
Input Storage Path
: The path of the data within the storage type. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be downloaded and automatically extracted prior to any worker starting. If you specify a directory path (ending in /
), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path. Valid paths for each Input Type
are the following:
AWS
: Must begin withs3://
.Azure
: Must begin withhttps://
.GCP
: Must begin withgs://
.Local
: Must begin with/
(absolute path),~/
(home directory relative), or$
(environment variable path). Relative paths (using./
) are not supported.Wasabi
: Must begin withs3://
.Web
: Must begin withhttp://
,https://
,ftp://
, orftps://
.
Source Specific fields
Endpoint
(Wasabi Only): The service URL of the Wasabi bucket you are using.
Path
(Regional Datastore Only): The subdirectory inside the regional datastore to load the data from. Use /
to load the entire datastore.
Output Data
Both training and inference jobs support the ability to upload their results (data stored in the TRAINML_OUTPUT_PATH
environment variable) to an external source when the job finishes.
Output Type
: To automatically upload model results after each worker completes, select from the available output type options from the Output Type
dropdown. The worker remains active and counts towards a job's billed GPU time while uploading its results.
AWS
: Select this option to upload results to Amazon S3.Azure
: Select this option to upload results to Azure Blob Storage.GCP
: Select this option to upload results to Google Cloud Storage.Local
: Select this option if you want the workers to upload their results directly to your local computer. You must be connected to the job when the worker attempts to upload their results to receive them.
Workers will wait indefinitely for you to connect, and you will continue to billed until the upload completes.
Regional Datastore
- Select this option to mount the output folder directly to a Regional Datastore location in an existing CloudBender region.-
trainML
: Select this option to create a trainML model, checkpoint, or dataset with the results. Wasabi
- Select this option to upload results to Wasabi Storage.
Output Storage Path
/opt/trainml/output
directory and push it to the specified storage path with a naming convention of <job_name>.zip
(or <job_name>_<worker_number>.zip
for multi-worker jobs). Valid paths for each Output Type
are the following:
AWS
: Must begin withs3://
.Azure
: Must begin withhttps://
.GCP
: Must begin withgs://
.Local
: Must begin with/
(absolute path),~/
(home directory relative), or$
(environment variable path). Relative paths (using./
) are not supported.Regional Datastore
- Must begin with/
(absolute path)-
trainML
: Must bemodel
,dataset
, orcheckpoint
Wasabi
: Must begin withs3://
.
Source Specific fields
Endpoint
(Wasabi Only): The service URL of the Wasabi bucket you are using.
Path
(Regional Datastore Only): The subdirectory inside the regional datastore to save the data to. Use /
to save the data to the datastore root.
Model
trainML jobs can prepopulate the model code from external sources on job creation. To utilize this capability, configure the following two fields.
Model Type
: The source from which to populate the model. The following options are supported:
AWS
: Select this option if the model code resides on Amazon S3.Azure
: Select this option if the model code resides on Azure Blob Storage.GCP
: Select this option if the model code resides on Google Cloud Storage.Git
: Select this option if the model code resides in a git repository.Local
: Select this option if the model code resides on your local computer. You will be required to connect to the model for this option to work.
Jobs will wait indefinitely for you to connect, and you will continue to billed until the job stops.
Wasabi
- Select this option if the model code resides on Wasabi Storage.Web
: Select this option if the model code resides on a publicly accessible HTTP or FTP server.
Model Code Location
: The path of the model code within the source type. The resulting code will be loaded into the job at /opt/trainml/models
. Valid paths for each Model Type
are the following:
AWS
: Must begin withs3://
.Azure
: Must begin withhttps://
.GCP
: Must begin withgs://
.Git
: The HTTP(S) or SSHgit clone
url. If you are using github, this is the url shown in theClone or Download
button for the repository. To access private repositories, you must configure a Git SSH key in your account settings and use anssh
git url here.Local
: Must begin with/
(absolute path),~/
(home directory relative), or$
(environment variable path). Relative paths (using./
) are not supported.-
trainML
: Select the desired model from the list of your stored trainML models. Wasabi
: Must begin withs3://
.Web
: Must begin withhttp://
,https://
,ftp://
, orftps://
.
Source Specific fields
Endpoint
(Wasabi Only): The service URL of the Wasabi bucket you are using.
Checkpoints
All jobs allow you to attach checkpoints for use during job processing. To add a checkpoint, click the Add Checkpoint
button.
Checkpoint
: Select the desired checkpoint to attach from this list.
Public
: Check this box to select a public checkpoint
If you add a single checkpoint, the checkpoint will be mounted to /opt/trainml/checkpoint
inside each job worker. To add multiple checkpoints to a job, you can continue to click the Add Checkpoint
button until you have selected all the checkpoints you need for the job. Each checkpoint will be mounted into its own directory inside the /opt/trainml/checkpoint
directory. The directory name will be the name of the checkpoint with spaces converted to underscores. For example, a checkpoint named My Checkpoint
will be mounted to /opt/trainml/checkpoint/My_Checkpoint
if it is one of multiple checkpoints selected.
To remove a selected checkpoint from the job, click the x
button to the right of the checkpoint name.
Workers
This section is only visible for training and inference jobs.
Number of Workers
(Training Jobs Only): The number of workers to use for this job. Each worker will be assigned dedicated GPUs of the amount specified in the GPU Count
field. Workers run independently and in parallel as long as sufficient GPUs of the selected type are available. You do not pay for workers awaiting GPUs.
Worker Commands
: You can specify a unique command for each worker, or the same command for all workers. If you are using an external solution like hyperopt or Weights & Biases to control the experiments each worker is running, you may want to specify a single command for all workers. If you are using command line arguments to your training script to make different workers try different hyperparameters or architectures, specify a unique command for each worker.
Endpoint
This section is only visible for endpoints.
Manually Specify Server Command
: If you wish to run your own web server for this endpoint instead of configuring the trainML built-in server, check this box.
Start Commmand
(Manual Server Command Only): The command required to start the server that will listen for incoming requests. The server must listen on port 80.
If you endpoint stops shortly after starting, check the logs for any execution errors. If none are found, ensure that the start command you use starts the web server in the foreground, not as a background/daemon process. For example, if you are using NGINX, ensure it is configured with daemon off
setting.
Regional Port Reservation
: Deploy the endpoint to a specific region on the port and hostname defined in the regional reservation.
Attach an endpoint to a regional port reservation will disable external endpoint connectability. Only systems on the local LAN of the region the reservation is in will be able to access the endpoint.
Routes
Endpoints are defined using routes. Routes configure what function within the model's code will be executed when making a HTTP request to the endpoint. They are uniquely identified by the HTTP verb and the URL path that the endpoint will respond to. To add a route, click the Add Route
button.
HTTP Verb
: The HTTP verb (or request method) to use for this route. Currently, only POST
requests are supported.
Path
: The URL path to use for this route.
File Name
: The file that contains the code that will be executed when a request is made to this route. If the file is not in the root directory of the model code, specify a relative file path for the file (e.g. subdir1/subdir2/file.py
for the file.py
file in the subdir2
directory of the subdir1
directory of the model code). Only Python files are supported.
Function Name
: The python function within the file that will be called when a request is made to this route.
Function Uses Positional Arguments
: Indicates if the function specified is configured to receive arguments as positional (checked) or keyword (unchecked). If this is true, the order of the request body parameters will be the order in which they are passed to the function.
You cannot have two routes with the same verb and path.
Request Body Template Parameters
The request body template defines the allowed and required attributes that must be specified in the request body when the client makes a request to this route. The request body exactly match the allowed arguments of the function serving this route. Click the Add Parameter
button to add a parameter definition for the request body. If Function Uses Positional Arguments
is checked, multiple parameters can be resorted using the up and down arrow buttons.
Name
: The name of the attribute in the request body. This must also be the keyword argument name if using keyword arguments instead of positional in the function.
Data Type
: The expected data type of the attribute value. String, Integer, Float, Boolean, Object, and List are supported.
Optional
: Indicates that the attribute is not required to be present in the request body, but is allowed. Optional attributes require a default value.
Default Value
: The default value to use for unspecified optional attributes. The default value must be valid Python (e.g. None
should be used rather than null
or undefined
). See FastAPI's request body documentation for more details.
You cannot have two parameters of the same name.
Environment
This section is optional for all job types.
Base Environment
: Job environments determine the software the is preinstalled in the operating environment for the job workers. You can set the default job environment for each job type in the account setting page. The environment is determined by selecting the Python Version
, the Framework
, and the Framework Version
(if applicable). Selecting the correct environment will save you time setting up your environment and minimize the amount of space required for each worker. The space required by the base environment is free and does not count towards your storage quota, but modifications to the environment will. Downgrading PyTorch or Tensorflow can consume a large amount of space (10GB+).
Customer Provided
: Check this box to use a customer provided docker image as the base environment instead of a trainML environment. Unlike trainML built-in environments, the image size does counts towards the disk size quota you specify when creating the job.
Ensure you are reserving enough disk space to accommodate the image size. CUDA layers can be 3+ GB alone. If the image size is greater than the requested disk space, the job will fail.
Python Version
: The Python version of the conda environment that forms the base of the environment. All base environments contain a wide variety of popular data science, machine learning, and GPU-acceleration libraries. Only Python 3.8 and 3.9 environments are currently available.
Framework
: The primary deep learning framework to be used. If you do not have specific version requirements for your model, select Deep Learning
. Otherwise, select the major framework you intend to use to see the available versions.
Deep Learning
: All supported frameworks are installed using their latest version compatible with the Python version selected: Tensorflow, PyTorch, MXNet.PyTorch
: Select this option if your model code requires a specific version of PyTorchTensorflow
: Select this option if your model code requires a specific version of TensorflowMXNet
: Select this option if your model code requires a specific version of MXNet
Framework Version
: Select the version of the major framework you need.
Image
(Customer Provided Only): The full image name of the customer provided docker image to use as the base environment. trainML currently supports pulling images stored in five Docker registries; Docker Hub (both public and private repositories), AWS Elastic Container Registry, Azure Container Registry, Google Artifact Registry, or NVIDIA NGC. In order to use an image in one of these registries, you must first configure third party access keys for the provider you intend to use. This is not strictly required for public Docker Hub images, but is highly recommended. If you use public images anonymously, your job may fail due to Docker rate limiting.
Customers are responsible for building their images with their desired version of CUDA and any associated acceleration libraries. Additionally, notebooks and endpoints require certain libraries to be installed to function properly. You can add these to the image during the image build process, or include them as additional pip
packages during job creation.
- Notebook:
jupyterlab
- Endpoint:
fastapi
,pydantic
,uvicorn[standard]
,python-multipart
Jobs using images that do not meet these requirements may fail or immediately stop once they reach the running
state.
Package Dependencies
: Specify the lists of apt, pip, and/or conda packages to be installed in the job environment prior to starting the job. Each package should be added on its own line. Package dependencies will be installed in the following order:
- apt
- conda
requirements.txt
file, if found- pip
You should NOT use the Package Dependencies
section to update a major framework version (Tensorflow/PyTorch/MXNet). Instead, select the correct major version as the Base Environment
above.
pip
: PyPi packages. Usepackage==version
to pin a package version.apt
: Ubuntu packages. Usepackage=version
to pin a package versionconda
: Conda packages. Use"package=version"
to pin a package version.
Environment Variables
: To add environment variables to the job environment, click the plus button. These can be used to control the execution of the workers or provide training scripts with additional data. For example, if the training script automatically uploads checkpoints to s3 to a bucket defined by the environment variable BUCKET_NAME, set that here. If you are using setting Weights & Biases to track your experiments, the WANDB_PROJECT and WANDB_API_KEY can be set here. Like the model code, environment variables are shared across all job workers.
Third-Party Access Keys
: If you want your job workers to utilize third-party cloud services, you can also attach their keys to the workers. This will set the relevant environment variables or load credential files in the worker containers for the configured key values.