Skip to main content

CloudBender

CloudBender™ lets you connect your on-prem and cloud GPUs to the trainML platform and seamlessly run jobs on any CloudBender enabled system. When you start a notebook or submit a job, CloudBender will automatically select the lowest cost available resource that meets your hardware, cost, data, and security specifications.

Providers

CloudBender currently supports two different infrastructure providers, Physical and Google Cloud (GCP).

Physical

The Physical provider is used for any on-premise equipment, whether that be a large DGX server or a GPU-enabled AI workstation. The physical provider has the lowest cost per GPU hour for on-demand jobs. Physical systems must meet specific hardware requirements and must run the trainML operating system

To enable the Physical provider, click Enable Provider in the actions menu of the Providers list, or select Physical (On-Prem) from the Enabled Provider menu if you have no current providers configured. On the provider configuration form, select the desired payment mode and click Submit.

Google Cloud

The Google Cloud provider allows you to run trainML workloads in your GCP account without the learning curve of becoming a Google Cloud Architect. When combined with using the physical provider for managing your on-prem systems, CloudBender enables a truly seamless "own the base, rent the burst" architecture. Even if you don't have physical resources now, using CloudBender to orchestrate your cloud resources makes transitioning to physical resources later effortless.

CloudBender will create the majority of the resources in your GCP account, but you must create the GCP account, the project, the service account, and the service account IAM permissions.

Account Requirements

Only paid Cloud Billing accounts are permitted to attach GPUs to instances. If your account is new, it starts in the free tier by default. You must upgrade your GCP account to use CloudBender. If your account is managed by your organization, no change is required.

Project Configuration

trainML recommends that you create a project that only trainML will use to avoid issues of quota management, resource conflicts, or security permissions. Once the project is created, enable the following APIs in the project:

Once these APIs are enabled, additional IAM permissions must be granted to some of the service accounts. On the IAM page, configure the following additional permissions.

  1. Add "Service Account User" to the Compute Engine default service account. The account name will be of the format <project_id>-compute@developer.gserviceaccount.com.
  2. If your project was created prior to April 8, 2021, add "Service Account Token Creator" to the Cloud Pub/Sub Service Agent account. The account name will be of the format service-<project_id>@gcp-sa-pubsub.iam.gserviceaccount.com. You may have to check the Include Google-provided role grants checkbox to see this account. See more details here.
  3. Add "Pub/Sub Publisher" to the global cloud logs service account cloud-logs@system.gserviceaccount.com. Because this account will not show up in the principals list, click the Add button near the top of the page and paste the account into the New principals box, then configure the permissions as normal.

If you do not see these accounts in your project, ensure that you have first enabled the required APIs above.

Service Account

Create a service account in the new project for CloudBender to use. Add the following roles to the service account during the creation process:

  • Compute Admin
  • Pub/Sub Admin
  • Logs Configuration Writer
  • Service Account User
  • Billing Account Viewer (optional)

Once the account is created, create new service account keys. Select JSON as the Key Type, and the credentials JSON file should download automatically. This file will be used in the next step.

Enable Provider

To enable the Google Cloud provider, click Enable Provider in the actions menu of the Providers list, or select Google Cloud from the Enabled Provider menu if you have no current providers configured. On the provider configuration form, select the desired payment mode and upload the JSON file from the previous step. If preferred, you can disable Upload JSON File and paste the contents of the key file from the previous step into the text area.

If the file is in the correct format, the system will automatically extract the correct project ID and service account email address from the file and populate the fields. Once the form is complete, click Submit to enable the provider.

CloudBender will begin creating some project level resources necessary for it to manage resources in your project. Once this process is complete, the Create Region button will appear.

Regions

All providers are organized by regions, which represent the physical location of the resources. For cloud providers, you can create one trainML region per cloud region. For the physical provider, you should have at least one region per physical site, but are not limited to one.

Physical Regions

Regions of the physical provider are designed to organize resources based on physical proximity. If you have multiple data centers, even if they are connected over a high-speed WAN link, you should create a region for each data center. All nodes within the same region must be able to communicate with each other using a private (RFC 1918) address. You can create multiple regions for the same data center if you wish apply different access control policies for different nodes. For example, if you have some nodes you want dedicated to your organization and others that you want to earn credits by selling spare compute cycles to other trainML customers, you can create two different regions to control that.

To create a new region, select Create Region button from actions menu of the Physical provider.

  • Name: A unique name for this region. trainML recommends that region names only contain characters allowed in DNS hostnames.
  • Storage Mode: Select Local if the region will only contain stand-alone compute nodes. Select Central if you have purchased a central storage controller from trainML. Central is required to use Device nodes.
  • Allow Public Jobs: If enabled, CloudBender will use the spare capacity of your nodes in this region to service other customer's jobs. When your node runs jobs for other customers, your account will accumulate credits based on job credit calculation in exchange for the use of your resources. If disabled, only projects that are owned by your account will send jobs to the nodes in this region. Regardless of if this setting is enabled, jobs in projects owned by your account will always prefer to run on the nodes in your regions.

GCP Regions

Currently, only ten google cloud regions have most of the common GPU types, so these are they only regions you can configure with CloudBender. To add a new region to CloudBender, select Create Region button from actions menu of the Google Cloud provider. The create region form will query the region quotas in your GCP project. If this process fails, ensure the service account you used when creating the provider has the correct permissions.

Select a region from the list under Region Name. When a region is selected, you will see the quotas relevant to setting up GPU nodes. The quotas show both the limit and the amount of the limit currently used. If the region you selected does not have sufficient quota, request a quota increase from the GCP console and try again after the increase has been approved.

info

Currently, only On-Demand instances are supported, but Preemptible and Committed are coming soon.

The number of runnable jobs in a region is the minimum of:

  1. the available GPUs of that type
  2. CPUs divided by 6 for P100, 8 for V100/T4, or 12 for A100
  3. local SSD / 375

For example, if you have a V100 quota of 8, but a Local SSD or CPU quota of 0, you cannot run any V100 jobs. If you have a very high Local SSD and CPU quota, but a V100 quota of 1, you can only run 1 job with a single V100. If your region runs out of quota, your job will either fail or run in another CloudBender location, depending on the project configuration.

All GCP regions require a storage node to cache the jobs, models, checkpoints, and datasets locally in the region. Select the amount of storage you expect to require in this region from the Storage Class field. Since the storage node will run continuously no matter how many jobs are running, trainML recommends that you purchase committed use discounts for the instance and storage once you are confident in your deployment requirements.

When you submit the new region form, CloudBender will automatically provision the storage node of the selected size. Once the storage node is ready, you can begin creating jobs in the region.

Nodes

The actual systems that run CloudBender workloads are considered Nodes. Nodes are categorized by the function they serve. Compute Nodes are systems with attached GPUs and actually run the jobs. Storage Nodes cache the job, model, checkpoint, and dataset data from the CloudBender persistence layer into their region to accelerate job provisioning.

Compute

Compute nodes are the GPU resources managed by CloudBender. In the case of the Physical provider, each GPU-enabled server you connect to CloudBender is a compute node. For cloud providers, jobs can create ephemeral compute nodes (using On-Demand or Preemptible GPU capacity) or can run on permanent compute nodes that use committed/reserved instances to save money. Once a node is connected to CloudBender, it can no longer be accessed outside of the trainML interface. CloudBender secures the system down to the boot process to ensure no unauthorized data access is possible.

Cloud compute nodes require no additional configuration. However, physical systems must meet certain requirements based on the desired regional configuration and you must perform some manual steps as part of the node provisioning process.

Hardware Requirements

The following requirements must be satisfied for all systems managed by CloudBender:

  • All GPUs in the system must have the same model type (e.g. A100, RTX 3090).
  • 2+ NVMe drives with a minimum total size of 1 TB per 2 GPUs (e.g. 1 TB for a 1 GPU system, 1 TB for a 2 GPU system, 2 TB for a 3 GPU system, etc.)
  • 2.0 compatible TPM installed (or 2.0 compatible fTPM enabled)

The following additional requires must be satisfied for systems to sell their spare capacity to the network:

  • The total system memory must be at least 2x the total GPU VRAM. For example, if the system has 4 40 GB A100s, the system must have at least 320 GB of RAM.
  • At least 8x PCIe lanes for every GPU (not applicable for SXM connected GPUs)
  • At least 2 cores per GPU for PCIe 3.0 GPUs or 4 cores per GPU for PCIe 4.0 GPUs.

If your region is using the local storage mode (the default for physical regions), the system must also have at least 3 2TB SATA (SSD or 7200RPM HDD) hard drives. If your region is using central storage mode, the system must use a minimum 10GB Ethernet connection, 100GB preferred.

Physical Server Provisioning

To add a physical server as a CloudBender compute node, contact us to request a new boot drive and allow us to validate your request. Once your request is approved, we will ship a USB drive to your office. When you receive the drive, prepare the system by performing the following activities in your servers BIOS configuration.

  • Ensure the TPM is enabled and clear it
  • Disable CSM (Compatibility support mode)
  • Disable Secure Boot (NVIDIA systems only)

Refer to your motherboard manual on how to perform the above tasks.

Plug the USB drive into the system and reboot. Once the system boots, it will detect that no trainML Operating System is installed and prompt you to install. Type 'yes' to continue. After a few minutes, installation will complete and you will be prompted to restart the system to continue configuration. The system will restart at least one time during the subsequent configuration process.

warning

Never remove the USB drive from the system once the operating system is installed. The system will not be able to boot without it.

The system is ready to be added once you see only the trainML banner with a print out of the trainML minion_id and some other diagnostic information. Write down the minion_id and go to the CloudBender Resources Dashboard.

tip

If the region is configured to use the Central storage mode, you must add the storage node before you can add compute nodes.

Select View Region from the action menu of the new node's physical region. Once on the region dashboard, click the Add button on the Compute Nodes grid. Enter a name for the new node and the minion id from the console. If the system is in a region behind a commercial corporate firewall, UPnP should be disabled. If it is behind a residential internet connection or other firewall managed by your ISP, leave UPnP enabled. If you want the system to mine cryptocurrency while the GPUs are idle, set mining to Enabled. Cryptocurrency mining requires that you configure valid wallet addresses in your account. Click Submit to add the new node.

You will be navigated to the region's node dashboard. The new node will finalize its provisioning process, and may restart again. When the node is ready, it will be in maintenance mode until you are ready to activate it. To activate the new node, select it from the list and click Toggle Maintenance. The status will change to Active and the node is now ready to service jobs.

Devices

Devices are edge inference nodes that contain a system-on-chip (SOC) accelerator and permanently run a single, always-on inference model. Currently supported devices include:

  • NVIDIA Jetson AGX Xavier
  • NVIDIA Jetson Xavier NX
  • NVIDIA Jetson AGX Orin
  • NVIDIA Jetson Orin NX
  • NVIDIA Jetson Orin Nano

CloudBender compatible devices can be purchased through trainML. Contact us for a quote.

caution

Devices are only supported in Physical regions configured with a centralized storage node. If you already have compute nodes in a region configured with local storage, you must create a new region to add devices.

Obtain the device minion_id from the sticker on device or by attaching a display to the device. Select View Region from the action menu of the new devices's physical region. Once on the region dashboard, click the Add button on the Devices grid. Enter a name for the new device, the minion id, and click Submit.

You will be navigated back to the region's dashboard. The new device will finalize its provisioning process, and may restart again. When the process is complete, the device will automatically enter the Active state. Once it is active, you must set the desired device configuration to run the inference model. Once you have created a Device Configuration, select Set Device Config from the action menu. Select the desired device configuration and click Select. Once the configuration is set, deploy the inference model to the device by selecting it on the grid and clicking Deploy Config on the toolbar, or select Deploy Latest Config from the device action menu.

When the deployment is complete, the Inference Status will show runnning, and the configuration status will indicate the last date it was deployed. While running, the inference job will have access to the SoC accelerator and any video or media devices plugged into the device.

Device Configurations

Device Configurations allow you to share the inference model configuration across many devices in a region. Device Configurations are region specific because they allow you to integrate the device with regional resources like datastores. Multiple devices can use the same configuration, but a single device can only run one configuration at a time.

To add a new Device Configuration, click the Manage Device Configs button on the Devices toolbar. Click Add New to create a new configuration.

Configuration Name (required): A unique name for this device configuration. Names must be unique within regions but can be reused across regions.

Image Name (required): The docker image to run as the inference container. The same repositories are supported as when using a Customer Provided Image for a job environment.

Model (required): Select the trainML Model containing the code that will run the inference task.

Checkpoints: Add any trainML Checkpoints that the model's inference code requires.

Start Command (required): The command to run in the model code's root directory to start running the inference task.

tip

Devices job commands must be designed to run continously in the foreground. The inference task will be restarted if the device is rebooted, but will not automatically restart if the command itself exits.

Datastore: Select a regional datastore to mount to the TRAINML_OUTPUT_PATH location of the running container to read or save private data.

Datastore Path: The subdirectory inside the regional datastore to mount.

Environment Variable: Configure any environment variables to set in the inference container.

Attached Keys: Add any Third-Party Key credentials to the inference container required for successful code execution.

Storage

Storage nodes are mandatory for all cloud provider regions. Contact us if you would like to use a centralized storage node in a physical region to free up the compute node's local disks.

Configuration of the storage node is automatic in cloud provider regions. To add a storage node to a Physical region, ensure the region is configured with Storage Mode set to Central. Attach a display to the storage node and connect it to the network. Once it has started, you should see the trainML banner on the screen, which displays the trainML minion_id and some other diagnostic data. Write down the minion_id and go to the CloudBender Resources Dashboard.

Select View Region from the action menu of the new node's physical region. Once on the region dashboard, click the Add button on the Storage Nodes grid. Enter a name for the new node and the minion id from the console. If the system is in a region behind a commercial corporate firewall, UPnP should be disabled. If it is behind a residential internet connection or other firewall managed by your ISP, leave UPnP enabled.

info

Only one storage node per region is currently supported. If you already have a storage node configured in a region, it must be removed before a new one can be added. Alternatively, you can replace the existing storage node with the new one by selecting Replace Node from the actions menu.

You will be navigated back to the region's dashboard. The new node will finalize its provisioning process, and may restart again. When the process is complete, the node will automatically enter the Active state, and you can begin to add compute nodes or devices to the region.

Regional Resources

info

Regional resources require a Team or higher feature plan.

Regional resources allow trainML jobs to utilize data and provide services in a specific region. If you attach a job to any regional resource, the resource reservation system will ensure that the job will only start in that specific region.

tip

Regional resources are only accessible from compute nodes or devices. Ensure that at least one is configured before adding regional resources

caution

Since using a regional resource will constrain the available GPUs to only those that exist in that region, be sure the job specification requests a GPU type that exists in the region or the job will never start.

Datastores

Datastores allow jobs to connect to data storage infrastructure that is local to a region. This avoids having to upload the data as a trainML Dataset, Checkpoint, or as input data to an Inference Job. Datastores can also be used as the output data location for training/inference jobs or mounted to Notebook and Endpoint jobs to provide additional scratch space or access to additional data at runtime. Datastores are ideal for data that is too large or too sensitive to be uploaded to the trainML platform.

To add a datastore, select View Region from the action menu of the new datastore's region. Once on the region dashboard, click the Add button on the Datastores toolbar.

Name: A unique name for this datastore.

Type: The type of datastore. Different datastore types require different configuration options.

Address/URI: The hostname, network name, or IP address of the datastore server.

Root Path: The directory on the server to act as the root for this datastore. Use "/" to enable access to the entire datastore. If you specify a subdirectory as the datastore root path, subsequent jobs, datasets, and checkpoints will not be able to access data above that subdirectory and their path specification will be relative to this root path. For example, if the datastore exposes the following directories:

- dir1
- subdir1
- dir2
- subdir2
- subdir3

Setting /dir2 as the Root Path will prevent access to dir1 or any of its subdirectories. When configuring a job to mount subdir2 as the input data, specific /subdir2 as the input path, NOT /dir2/subdir2. The datastore system automatically concatenates the datastore root path with the requested subdirectory path.

Username (SMB Only): The username to connect to the datastore with.

Password (SMB Only): The password for the user.

NFS Share Requirements

The default user that containers run as is root. Since NFS passes through the user ID for all operations, all file access actions on an NFS datastore will show up as the root user of the NFS client. Since allowing the client's root user to pass through to the server is a signifcant security risk, trainML recommends that all NFS exports used as trainML datastores enable root_squash to convert client IDs to the anonymous ID. When this is enabled, if the NFS share is used for output data, you must set the anonuid and anonguid settings on the export to a user/group that has write access to the exported directory. Because of this, trainML recommends not reusing existing exports for trainML job output locations, but create a dedicated export with these settings.

Reservations

Reservations allow jobs to host local applications or services that are expected to be available at a specific hostname and port. This enables multiple local applications to be deployed on the same compute rig by fixing each application to a dedicated port, as well as allows the same application to transparently migrate between multiple compute rigs by dynamically updating the dedicated local hostname.

To add a reservation, select View Region from the action menu of the new reservations's region. Once on the region dashboard, click the Add button on the Reservations grid.

Name: A unique name for this reservation.

Type: Only port reservations are currently supported.

Hostname: The hostname to publish with mDNS.

Port: The port the endpoint should listen on.

tip

Dynamic hostname publishing is currently only supported with mDNS, which is a link-local protocol. If the systems accessing the endpoint are on a different IP subnet, you will need an mDNS Gateway for the hostname to resolve properly.

Billing and Credits

Usage Fee

Running jobs on your own resources incurs an hourly usage fee that varies based on the type of GPU. The usage fees are as follows:

  • Development Only GPUs (GeForce, Radeon) - 0.05 credits/hr
  • Professional GPUs (Quadro/RTX A6000, Radeon Pro) - 0.10 credits/hr
  • Datacenter GPUs (Tesla, Radeon Instinct) - 0.25 credits/hr

Federated Training/Inference Fee

Cross-project Federated jobs incur an additional hourly fee that is irrespective of the GPU type or number of GPUs per worker.

  • Federated Inference (Inference Jobs, Endpoints) - 1.85 credits/hr
  • Federated Training (Training Jobs) - 2.65 credits/hr

Unlike the usage fee, which is charged to the owner of the project, this fee is charged to the job creator.

Device Fee

Devices incur a fixed monthly fee of 15 credits/month based on how many devices are configured in CloudBender. The fee calculation begins as soon as the device is added to CloudBender and continues until it is removed. Offline devices still incur the monthly fee. The fee is calculated and charged daily based on the total number of devices configured that day.

For example, if you have 10 devices configured in cloudbender, you will be charged 5 (15 * 10 / 30) credits per day in a month with 30 days. If you start a day with 10 devices, add 5 more during the day, you will be charged 7.5 (15 * 15 / 30) credits, no matter what time of day you add the devices. If you start the day with 10 devices, and remove 5 devices during the day, you will be charged 5 (15 * 10 / 30) credits on that day, and 2.5 (15 * 5 / 30) credits on subsequent days. If you start a day with 10 devices, add and remove a device 5 times, you will be charged 7.5 (15 * 15 / 30) credits, since each addition represents a unique device, and 5 credits on subsequent days (since the devices were all removed by EOD).

Earning Credits

If you configure your CloudBender region to allow other customers to consume your spare resources, you can earn credits that can be used to offset your own trainML resource costs or paid out in cash. When you service another customer's job, you receive a credit at the end of each hour the job has been running. When a job stops, you receive the credit for the partial time the job ran. The customer pays the full hourly price for the GPU. From this, trainML deducts the standard usage fee based on the GPU type, and splits the remaining credits evenly with you.

For example, if you add a system with A100 GPUs, other trainML customers will pay 2.78 credits per hour per GPU. At the end of each hour a job runs, you will receive 1.265 credits ((2.78 - 0.25) / 2) per GPU.

Payment Modes

Credits - All accumulated credits will be refunded to your billing account every 24 hours. There is no fee associated with this transfer.

Stripe (coming soon) - Receive cash payouts of your credit balance through Stripe Connect