Skip to main content

Job Configuration

Notebook and training job configuration is organized into five different sections, Resources, Data, Model, Workers, and Environment.

Resources

This section is mandatory for all job types.

Job Name: A friendly name for the job. The job name is used to identify both the job's connection utility and the job's output data (if configured). It is available as an environment variable in the job environment.

GPU Type: The types of GPU that will satisfy the job request. GPU types differ by both their memory size and their operations per second (OPS). A GPU's OPS rate varies based on the data type being used. Single-precision (FP32) is the most common data type used in deep learning.

  • GTX 1060: 6 GB of GPU memory, 4.5 TFLOPS of single-precision (FP32), 0.9 TFLOPS of half-precision (FP16) - GTX 1060 instances should not be used for half-precision training
  • RTX 2060 Super: 8 GB of GPU memory, 7.1 TFLOPS of single-precision (FP32), 15 TFLOPS of half-precision (FP16)
  • RTX 2070 Super: 8 GB of GPU memory, 9 TFLOPS of single-precision (FP32), 18.5 TFLOPS of half-precision (FP16)
  • RTX 2080 Ti: 11 GB of GPU memory, 13.5 TFLOPS of single-precision (FP32), 25 TFLOPS of half-precision (FP16)
  • RTX 3090: 24 GB of GPU memory, 35.5 TFLOPS of single-precision (FP32), 39 TFLOPS of half-precision (FP16)
  • RTX A4000: 16 GB of GPU memory, 9.7 TFLOPS of single-precision (FP32), 19 TFLOPS of half-precision (FP16)
  • RTX A6000:48 GB of GPU memory, 36 TFLOPS of single-precision (FP32), 40 TFLOPS of half-precision (FP16)
  • T4: 16 GB of GPU memory, 7.5 TFLOPS of single-precision (FP32), 15.5 TFLOPS of half-precision (FP16)
  • P100: 16 GB of GPU memory, 8 TFLOPS of single-precision (FP32), 16 TFLOPS of half-precision (FP16)
  • V100: 16 GB of GPU memory, 15 TFLOPS of single-precision (FP32), 25.5 TFLOPS of half-precision (FP16)
  • A100 (40GB): 40 GB of GPU memory, 39 TFLOPS of single-precision(FP32), 62.5 TFLOPS of half-precision (FP16)
  • A100 (80GB): 80 GB of GPU memory, 39 TFLOPS of single-precision(FP32), 62.5 TFLOPS of half-precision (FP16)
  • None (CPU-Only): Do not attach a GPU.

All GPU instance types have at least 2 cores (4 vCPU) per GPU for PCIe 3.0 GPUs or 4 cores (8 vCPU) per GPU for PCIe 4.0 GPUs with twice the GPU memory in CPU memory. CPU to GPU bandwidth is a minimum of 8 PCIe lanes per GPU.

GPU Count: This is the number of GPUs per instance. If you use multiple job workers, this is the number of GPUs each worker will use.

CPU Count (CPU Only Jobs): This is the number of vCPU per instance. Must be a multiple of 4.

Max Price: The maximum credits per hour per GPU that will satisfy the job request. If the only GPU types available exceed the max price setting, the job will remain in the Waiting for GPUs state until the request can be fulfilled.

Disk Size: The amount of working directory storage in GB to allocate for this job. The minimum is 10 GB and the maximum is 1 TB. If you use multiple job workers, each worker will have this much space. Any data added in the Data section has a separate allocation and will not impact this.

GPU Pricing

The credits per hour each GPU type costs can vary based on the supply and demand of that GPU type. Once you create a job, your credits per hour rate is locked in as long as the job is running. To see the most up-to-date prices, login to the trainML platform and click Start a Notebook or Create a Training Job on the Home page.

Data

This section allows you to configure what data is populated for a job and (if applicable) allows you to specify where to upload the jobs output after it completes.

Datasets

Notebook and Training Jobs allow you to attach persistent datasets as input data. To add a dataset, click the Add Dataset button.

Dataset Type: If you wish to attach a dataset a job, select from one of the dataset options in this list.

  • Public Dataset: Choose from a list of public domain datasets pre-loaded by trainML.
  • My Dataset: Choose from a list of the private datasets you have previously created.

Dataset: Select the desired dataset to attach from this list.

If you add a single dataset, the dataset will be mounted to /opt/trainml/input inside the job workers. To add multiple datasets to a job, you can continue to click the Add Dataset button until you have selected all the datasets you need for the job. Each dataset will be mounted into its own directory inside the /opt/trainml/input directory. The directory name will be the name of the checkpoint with spaces converted to underscores. For example, a dataset named PASCAL VOC will be mounted to /opt/trainml/input/PASCAL_VOC if it is one of multiple datasets selected.

To remove a selected dataset from the job, click the x button to the right of the dataset name.

Input Data

Inference jobs do not use persistent datasets and instead allow you to perform a one-time download of the data to run inference on from an external source. Data downloaded in this manner is purged from the system when the job finishes.

caution

Unlike datasets, inference job persistent data consumes the job worker's space allocation. If the source data is large, ensure you have configured sufficient disk space in the resources section to account for this.

Input Type: To automatically download the data to perform the inference operation on, select from the available input type options from the Input Type dropdown.

  • AWS: Select this option if the data resides on Amazon S3.
  • Azure: Select this option if the data resides on Azure Blob Storage.
  • GCP: Select this option if the data resides on Google Cloud Storage.
  • Local: Select this option if the data resides on your local computer. You will be required to connect to the job for this option to work.
Warning

Jobs will wait indefinitely for you to connect, and you will continue to billed until the job stops.

  • Regional Datastore - Select this option to mount the data directly from a Regional Datastore in an existing CloudBender region.
  • Wasabi - Select this option if the data resides on Wasabi Storage.
  • Web: Select this option if the data resides on a publicly accessible HTTP or FTP server.

Input Storage Path: The path of the data within the storage type. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be downloaded and automatically extracted prior to any worker starting. If you specify a directory path (ending in /), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path. Valid paths for each Input Type are the following:

  • AWS: Must begin with s3://.
  • Azure: Must begin with https://.
  • GCP: Must begin with gs://.
  • Local: Must begin with / (absolute path), ~/ (home directory relative), or $ (environment variable path). Relative paths (using ./) are not supported.
  • Wasabi: Must begin with s3://.
  • Web: Must begin with http://, https://, ftp://, or ftps://.
Source Specific fields

Endpoint (Wasabi Only): The service URL of the Wasabi bucket you are using. Path(Regional Datastore Only): The subdirectory inside the regional datastore to load the data from. Use / to load the entire datastore.

Output Data

Both training and inference jobs support the ability to upload their results (data stored in the TRAINML_OUTPUT_PATH environment variable) to an external source when the job finishes.

Output Type: To automatically upload model results after each worker completes, select from the available output type options from the Output Type dropdown. The worker remains active and counts towards a job's billed GPU time while uploading its results.

  • AWS: Select this option to upload results to Amazon S3.
  • Azure: Select this option to upload results to Azure Blob Storage.
  • GCP: Select this option to upload results to Google Cloud Storage.
  • Local: Select this option if you want the workers to upload their results directly to your local computer. You must be connected to the job when the worker attempts to upload their results to receive them.
Warning

Workers will wait indefinitely for you to connect, and you will continue to billed until the upload completes.

  • Regional Datastore - Select this option to mount the output folder directly to a Regional Datastore location in an existing CloudBender region.
  • trainML : Select this option to create a trainML model, checkpoint, or dataset with the results.
  • Wasabi - Select this option to upload results to Wasabi Storage.

Output Storage Path

path to upload the outputs within the storage type specified. If this is configured, when a worker exits, it will automatically zip the contents of the /opt/trainml/output directory and push it to the specified storage path with a naming convention of <job_name>.zip (or <job_name>_<worker_number>.zip for multi-worker jobs). Valid paths for each Output Type are the following:

  • AWS: Must begin with s3://.
  • Azure: Must begin with https://.
  • GCP: Must begin with gs://.
  • Local: Must begin with / (absolute path), ~/ (home directory relative), or $ (environment variable path). Relative paths (using ./) are not supported.
  • Regional Datastore - Must begin with / (absolute path)
  • trainML : Must be model, dataset, or checkpoint
  • Wasabi: Must begin with s3://.
Source Specific fields

Endpoint (Wasabi Only): The service URL of the Wasabi bucket you are using. Path(Regional Datastore Only): The subdirectory inside the regional datastore to save the data to. Use / to save the data to the datastore root.

Model

trainML jobs can prepopulate the model code from external sources on job creation. To utilize this capability, configure the following two fields.

Model Type: The source from which to populate the model. The following options are supported:

  • AWS: Select this option if the model code resides on Amazon S3.
  • Azure: Select this option if the model code resides on Azure Blob Storage.
  • GCP: Select this option if the model code resides on Google Cloud Storage.
  • Git: Select this option if the model code resides in a git repository.
  • Local: Select this option if the model code resides on your local computer. You will be required to connect to the model for this option to work.
Warning

Jobs will wait indefinitely for you to connect, and you will continue to billed until the job stops.

  • Wasabi - Select this option if the model code resides on Wasabi Storage.
  • Web: Select this option if the model code resides on a publicly accessible HTTP or FTP server.

Model Code Location: The path of the model code within the source type. The resulting code will be loaded into the job at /opt/trainml/models. Valid paths for each Model Type are the following:

  • AWS: Must begin with s3://.
  • Azure: Must begin with https://.
  • GCP: Must begin with gs://.
  • Git: The HTTP(S) or SSH git clone url. If you are using github, this is the url shown in the Clone or Download button for the repository. To access private repositories, you must configure a Git SSH key in your account settings and use an ssh git url here.
  • Local: Must begin with / (absolute path), ~/ (home directory relative), or $ (environment variable path). Relative paths (using ./) are not supported.
  • trainML : Select the desired model from the list of your stored trainML models.
  • Wasabi: Must begin with s3://.
  • Web: Must begin with http://, https://, ftp://, or ftps://.
Source Specific fields

Endpoint (Wasabi Only): The service URL of the Wasabi bucket you are using.

Checkpoints

All jobs allow you to attach checkpoints for use during job processing. To add a checkpoint, click the Add Checkpoint button.

Checkpoint: Select the desired checkpoint to attach from this list.

Public: Check this box to select a public checkpoint

If you add a single checkpoint, the checkpoint will be mounted to /opt/trainml/checkpoint inside each job worker. To add multiple checkpoints to a job, you can continue to click the Add Checkpoint button until you have selected all the checkpoints you need for the job. Each checkpoint will be mounted into its own directory inside the /opt/trainml/checkpoint directory. The directory name will be the name of the checkpoint with spaces converted to underscores. For example, a checkpoint named My Checkpoint will be mounted to /opt/trainml/checkpoint/My_Checkpoint if it is one of multiple checkpoints selected.

To remove a selected checkpoint from the job, click the x button to the right of the checkpoint name.

Workers

This section is only visible for training and inference jobs.

Number of Workers (Training Jobs Only): The number of workers to use for this job. Each worker will be assigned dedicated GPUs of the amount specified in the GPU Count field. Workers run independently and in parallel as long as sufficient GPUs of the selected type are available. You do not pay for workers awaiting GPUs.

Worker Commands: You can specify a unique command for each worker, or the same command for all workers. If you are using an external solution like hyperopt or Weights & Biases to control the experiments each worker is running, you may want to specify a single command for all workers. If you are using command line arguments to your training script to make different workers try different hyperparameters or architectures, specify a unique command for each worker.

Endpoint

This section is only visible for endpoints.

Manually Specify Server Command: If you wish to run your own web server for this endpoint instead of configuring the trainML built-in server, check this box.

Start Commmand (Manual Server Command Only): The command required to start the server that will listen for incoming requests. The server must listen on port 80.

tip

If you endpoint stops shortly after starting, check the logs for any execution errors. If none are found, ensure that the start command you use starts the web server in the foreground, not as a background/daemon process. For example, if you are using NGINX, ensure it is configured with daemon off setting.

Regional Port Reservation: Deploy the endpoint to a specific region on the port and hostname defined in the regional reservation.

caution

Attach an endpoint to a regional port reservation will disable external endpoint connectability. Only systems on the local LAN of the region the reservation is in will be able to access the endpoint.

Routes

Endpoints are defined using routes. Routes configure what function within the model's code will be executed when making a HTTP request to the endpoint. They are uniquely identified by the HTTP verb and the URL path that the endpoint will respond to. To add a route, click the Add Route button.

HTTP Verb: The HTTP verb (or request method) to use for this route. Currently, only POST requests are supported.

Path: The URL path to use for this route.

File Name: The file that contains the code that will be executed when a request is made to this route. If the file is not in the root directory of the model code, specify a relative file path for the file (e.g. subdir1/subdir2/file.py for the file.py file in the subdir2 directory of the subdir1 directory of the model code). Only Python files are supported.

Function Name: The python function within the file that will be called when a request is made to this route.

Function Uses Positional Arguments: Indicates if the function specified is configured to receive arguments as positional (checked) or keyword (unchecked). If this is true, the order of the request body parameters will be the order in which they are passed to the function.

You cannot have two routes with the same verb and path.

Request Body Template Parameters

The request body template defines the allowed and required attributes that must be specified in the request body when the client makes a request to this route. The request body exactly match the allowed arguments of the function serving this route. Click the Add Parameter button to add a parameter definition for the request body. If Function Uses Positional Arguments is checked, multiple parameters can be resorted using the up and down arrow buttons.

Name: The name of the attribute in the request body. This must also be the keyword argument name if using keyword arguments instead of positional in the function.

Data Type: The expected data type of the attribute value. String, Integer, Float, Boolean, Object, and List are supported.

Optional: Indicates that the attribute is not required to be present in the request body, but is allowed. Optional attributes require a default value.

Default Value: The default value to use for unspecified optional attributes. The default value must be valid Python (e.g. None should be used rather than null or undefined). See FastAPI's request body documentation for more details.

You cannot have two parameters of the same name.

Environment

This section is optional for all job types.

Base Environment: Job environments determine the software the is preinstalled in the operating environment for the job workers. You can set the default job environment for each job type in the account setting page. The environment is determined by selecting the Python Version, the Framework, and the Framework Version (if applicable). Selecting the correct environment will save you time setting up your environment and minimize the amount of space required for each worker. The space required by the base environment is free and does not count towards your storage quota, but modifications to the environment will. Downgrading PyTorch or Tensorflow can consume a large amount of space (10GB+).

Customer Provided: Check this box to use a customer provided docker image as the base environment instead of a trainML environment. Unlike trainML built-in environments, the image size does counts towards the disk size quota you specify when creating the job.

caution

Ensure you are reserving enough disk space to accommodate the image size. CUDA layers can be 3+ GB alone. If the image size is greater than the requested disk space, the job will fail.

Python Version: The Python version of the conda environment that forms the base of the environment. All base environments contain a wide variety of popular data science, machine learning, and GPU-acceleration libraries. Only Python 3.8 and 3.9 environments are currently available.

Framework: The primary deep learning framework to be used. If you do not have specific version requirements for your model, select Deep Learning. Otherwise, select the major framework you intend to use to see the available versions.

  • Deep Learning: All supported frameworks are installed using their latest version compatible with the Python version selected: Tensorflow, PyTorch, MXNet.
  • PyTorch: Select this option if your model code requires a specific version of PyTorch
  • Tensorflow: Select this option if your model code requires a specific version of Tensorflow
  • MXNet: Select this option if your model code requires a specific version of MXNet

Framework Version: Select the version of the major framework you need.

Image (Customer Provided Only): The full image name of the customer provided docker image to use as the base environment. trainML currently supports pulling images stored in five Docker registries; Docker Hub (both public and private repositories), AWS Elastic Container Registry, Azure Container Registry, Google Artifact Registry, or NVIDIA NGC. In order to use an image in one of these registries, you must first configure third party access keys for the provider you intend to use. This is not strictly required for public Docker Hub images, but is highly recommended. If you use public images anonymously, your job may fail due to Docker rate limiting.

Customers are responsible for building their images with their desired version of CUDA and any associated acceleration libraries. Additionally, notebooks and endpoints require certain libraries to be installed to function properly. You can add these to the image during the image build process, or include them as additional pip packages during job creation.

  • Notebook: jupyterlab
  • Endpoint: fastapi, pydantic, uvicorn[standard], python-multipart
Warning

Jobs using images that do not meet these requirements may fail or immediately stop once they reach the running state.

Package Dependencies: Specify the lists of apt, pip, and/or conda packages to be installed in the job environment prior to starting the job. Each package should be added on its own line. Package dependencies will be installed in the following order:

  1. apt
  2. conda
  3. requirements.txt file, if found
  4. pip
caution

You should NOT use the Package Dependencies section to update a major framework version (Tensorflow/PyTorch/MXNet). Instead, select the correct major version as the Base Environment above.

  • pip: PyPi packages. Use package==version to pin a package version.
  • apt: Ubuntu packages. Use package=version to pin a package version
  • conda: Conda packages. Use "package=version" to pin a package version.

Environment Variables: To add environment variables to the job environment, click the plus button. These can be used to control the execution of the workers or provide training scripts with additional data. For example, if the training script automatically uploads checkpoints to s3 to a bucket defined by the environment variable BUCKET_NAME, set that here. If you are using setting Weights & Biases to track your experiments, the WANDB_PROJECT and WANDB_API_KEY can be set here. Like the model code, environment variables are shared across all job workers.

Third-Party Access Keys: If you want your job workers to utilize third-party cloud services, you can also attach their keys to the workers. This will set the relevant environment variables or load credential files in the worker containers for the configured key values.