CloudBender™ lets you connect your on-prem and cloud GPUs to the trainML platform and seamlessly run jobs on any CloudBender enabled system. When you start a notebook or submit a job, CloudBender will automatically select the lowest cost available resource that meets your hardware, cost, data, and security specifications.
CloudBender currently supports two different infrastructure providers, Physical and Google Cloud (GCP).
The Physical provider is used for any on-premise equipment, whether that be a large DGX server or a GPU-enabled AI workstation. The physical provider has the lowest cost per GPU hour for on-demand jobs. Physical systems must meet specific hardware requirements and must run the trainML operating system
To enable the Physical provider, click
Enable Provider in the actions menu of the Providers list, or select
Physical (On-Prem) from the
Enabled Provider menu if you have no current providers configured. On the provider configuration form, select the desired payment mode and click
The Google Cloud provider allows you to run trainML workloads in your GCP account without the learning curve of becoming a Google Cloud Architect. When combined with using the physical provider for managing your on-prem systems, CloudBender enables a truly seamless "own the base, rent the burst" architecture. Even if you don't have physical resources now, using CloudBender to orchestrate your cloud resources makes transitioning to physical resources later effortless.
CloudBender will create the majority of the resources in your GCP account, but you must create the GCP account, the project, the service account, and the service account IAM permissions.
Only paid Cloud Billing accounts are permitted to attach GPUs to instances. If your account is new, it starts in the free tier by default. You must upgrade your GCP account to use CloudBender. If your account is managed by your organization, no change is required.
trainML recommends that you create a project that only trainML will use to avoid issues of quota management, resource conflicts, or security permissions. Once the project is created, enable the following APIs in the project:
Once these APIs are enabled, additional IAM permissions must be granted to some of the service accounts. On the IAM page, configure the following additional permissions.
- Add "Service Account User" to the Compute Engine default service account. The account name will be of the format
- If your project was created prior to April 8, 2021, add "Service Account Token Creator" to the Cloud Pub/Sub Service Agent account. The account name will be of the format
service-<project_id>@gcp-sa-pubsub.iam.gserviceaccount.com. You may have to check the
Include Google-provided role grantscheckbox to see this account. See more details here.
- Add "Pub/Sub Publisher" to the global cloud logs service account
email@example.com. Because this account will not show up in the principals list, click the Add button near the top of the page and paste the account into the
New principalsbox, then configure the permissions as normal.
If you do not see these accounts in your project, ensure that you have first enabled the required APIs above.
- Compute Admin
- Pub/Sub Admin
- Logs Configuration Writer
- Service Account User
- Billing Account Viewer (optional)
Once the account is created, create new service account keys. Select
JSON as the Key Type, and the credentials JSON file should download automatically. This file will be used in the next step.
To enable the Google Cloud provider, click
Enable Provider in the actions menu of the Providers list, or select
Google Cloud from the
Enabled Provider menu if you have no current providers configured. On the provider configuration form, select the desired payment mode and upload the JSON file from the previous step. If preferred, you can disable
Upload JSON File and paste the contents of the key file from the previous step into the text area.
If the file is in the correct format, the system will automatically extract the correct project ID and service account email address from the file and populate the fields. Once the form is complete, click
Submit to enable the provider.
CloudBender will begin creating some project level resources necessary for it to manage resources in your project. Once this process is complete, the
Create Region button will appear.
All providers are organized by regions, which represent the physical location of the resources. For cloud providers, you can create one trainML region per cloud region. For the physical provider, you should have at least one region per physical site, but are not limited to one.
Regions of the physical provider are designed to organize resources based on physical proximity. If you have multiple data centers, even if they are connected over a high-speed WAN link, you should create a region for each data center. All nodes within the same region must be able to communicate with each other using a private (RFC 1918) address. You can create multiple regions for the same data center if you wish apply different access control policies for different nodes. For example, if you have some nodes you want dedicated to your organization and others that you want to earn credits by selling spare compute cycles to other trainML customers, you can create two different regions to control that.
To create a new region, select
Create Region button from actions menu of the Physical provider.
Name: A unique name for this region. trainML recommends that region names only contain characters allowed in DNS hostnames.
Storage Mode: Select
Localif the region will only contain stand-alone compute nodes. Select
Centralif you have purchased a central storage controller from trainML.
Centralis required to use Device nodes.
Allow Public Jobs: If enabled, CloudBender will use the spare capacity of your nodes in this region to service other customer's jobs. When your node runs jobs for other customers, your account will accumulate credits based on job credit calculation in exchange for the use of your resources. If disabled, only projects that are owned by your account will send jobs to the nodes in this region. Regardless of if this setting is enabled, jobs in projects owned by your account will always prefer to run on the nodes in your regions.
Currently, only ten google cloud regions have most of the common GPU types, so these are they only regions you can configure with CloudBender. To add a new region to CloudBender, select
Create Region button from actions menu of the Google Cloud provider. The create region form will query the region quotas in your GCP project. If this process fails, ensure the service account you used when creating the provider has the correct permissions.
Select a region from the list under
Region Name. When a region is selected, you will see the quotas relevant to setting up GPU nodes. The quotas show both the limit and the amount of the limit currently used. If the region you selected does not have sufficient quota, request a quota increase from the GCP console and try again after the increase has been approved.
On-Demand instances are supported, but
Committed are coming soon.
The number of runnable jobs in a region is the minimum of:
- the available GPUs of that type
- CPUs divided by 6 for P100, 8 for V100/T4, or 12 for A100
- local SSD / 375
For example, if you have a V100 quota of 8, but a Local SSD or CPU quota of 0, you cannot run any V100 jobs. If you have a very high Local SSD and CPU quota, but a V100 quota of 1, you can only run 1 job with a single V100. If your region runs out of quota, your job will either fail or run in another CloudBender location, depending on the project configuration.
All GCP regions require a storage node to cache the jobs, models, checkpoints, and datasets locally in the region. Select the amount of storage you expect to require in this region from the
Storage Class field. Since the storage node will run continuously no matter how many jobs are running, trainML recommends that you purchase committed use discounts for the instance and storage once you are confident in your deployment requirements.
When you submit the new region form, CloudBender will automatically provision the storage node of the selected size. Once the storage node is ready, you can begin creating jobs in the region.
The actual systems that run CloudBender workloads are considered Nodes. Nodes are categorized by the function they serve. Compute Nodes are systems with attached GPUs and actually run the jobs. Storage Nodes cache the job, model, checkpoint, and dataset data from the CloudBender persistence layer into their region to accelerate job provisioning.
Compute nodes are the GPU resources managed by CloudBender. In the case of the Physical provider, each GPU-enabled server you connect to CloudBender is a compute node. For cloud providers, jobs can create ephemeral compute nodes (using On-Demand or Preemptible GPU capacity) or can run on permanent compute nodes that use committed/reserved instances to save money. Once a node is connected to CloudBender, it can no longer be accessed outside of the trainML interface. CloudBender secures the system down to the boot process to ensure no unauthorized data access is possible.
Cloud compute nodes require no additional configuration. However, physical systems must meet certain requirements based on the desired regional configuration and you must perform some manual steps as part of the node provisioning process.
The following requirements must be satisfied for all systems managed by CloudBender:
- All GPUs in the system must have the same model type (e.g. A100, RTX 3090).
- 2+ NVMe drives with a minimum total size of 1 TB per 2 GPUs (e.g. 1 TB for a 1 GPU system, 1 TB for a 2 GPU system, 2 TB for a 3 GPU system, etc.)
- 2.0 compatible TPM installed (or 2.0 compatible fTPM enabled)
The following additional requires must be satisfied for systems to sell their spare capacity to the network:
- The total system memory must be at least 2x the total GPU VRAM. For example, if the system has 4 40 GB A100s, the system must have at least 320 GB of RAM.
- At least 8x PCIe lanes for every GPU (not applicable for SXM connected GPUs)
- At least 2 cores per GPU for PCIe 3.0 GPUs or 4 cores per GPU for PCIe 4.0 GPUs.
If your region is using the local storage mode (the default for physical regions), the system must also have at least 3 2TB SATA (SSD or 7200RPM HDD) hard drives. If your region is using central storage mode, the system must use a minimum 10GB Ethernet connection, 100GB preferred.
Physical Server Provisioning
To add a physical server as a CloudBender compute node, contact us to request a new boot drive and allow us to validate your request. Once your request is approved, we will ship a USB drive to your office. When you receive the drive, prepare the system by performing the following activities in your servers BIOS configuration.
- Ensure the TPM is enabled and clear it
- Disable CSM (Compatibility support mode)
- Disable Secure Boot (NVIDIA systems only)
Refer to your motherboard manual on how to perform the above tasks.
Plug the USB drive into the system and reboot. Once the system boots, it will detect that no trainML Operating System is installed and prompt you to install. Type 'yes' to continue. After a few minutes, installation will complete and you will be prompted to restart the system to continue configuration. The system will restart at least one time during the subsequent configuration process.
Never remove the USB drive from the system once the operating system is installed. The system will not be able to boot without it.
The system is ready to be added once you see only the trainML banner with a print out of the trainML minion_id and some other diagnostic information. Write down the minion_id and go to the CloudBender Resources Dashboard.
If the region is configured to use the
Central storage mode, you must add the storage node before you can add compute nodes.
View Region from the action menu of the new node's physical region. Once on the region dashboard, click the
Add button on the Compute Nodes grid. Enter a name for the new node and the minion id from the console. If the system is in a region behind a commercial corporate firewall, UPnP should be disabled. If it is behind a residential internet connection or other firewall managed by your ISP, leave UPnP enabled. If you want the system to mine cryptocurrency while the GPUs are idle, set mining to
Enabled. Cryptocurrency mining requires that you configure valid wallet addresses in your account. Click
Submit to add the new node.
You will be navigated to the region's node dashboard. The new node will finalize its provisioning process, and may restart again. When the node is ready, it will be in maintenance mode until you are ready to activate it. To activate the new node, select it from the list and click
Toggle Maintenance. The status will change to
Active and the node is now ready to service jobs.
Devices are edge inference nodes that contain a system-on-chip (SOC) accelerator and permanently run a single, always-on inference model. Currently supported devices include:
- NVIDIA Jetson AGX Xavier
- NVIDIA Jetson Xavier NX
- NVIDIA Jetson AGX Orin
- NVIDIA Jetson Orin NX
- NVIDIA Jetson Orin Nano
CloudBender compatible devices can be purchased through trainML. Contact us for a quote.
Devices are only supported in Physical regions configured with a centralized storage node. If you already have compute nodes in a region configured with local storage, you must create a new region to add devices.
Obtain the device
minion_id from the sticker on device or by attaching a display to the device. Select
View Region from the action menu of the new devices's physical region. Once on the region dashboard, click the
Add button on the Devices grid. Enter a name for the new device, the minion id, and click
You will be navigated back to the region's dashboard. The new device will finalize its provisioning process, and may restart again. When the process is complete, the device will automatically enter the
Active state. Once it is active, you must set the desired device configuration to run the inference model. Once you have created a Device Configuration, select
Set Device Config from the action menu. Select the desired device configuration and click
Select. Once the configuration is set, deploy the inference model to the device by selecting it on the grid and clicking
Deploy Config on the toolbar, or select
Deploy Latest Config from the device action menu.
When the deployment is complete, the
Inference Status will show
runnning, and the configuration status will indicate the last date it was deployed. While running, the inference job will have access to the SoC accelerator and any
media devices plugged into the device.
Device Configurations allow you to share the inference model configuration across many devices in a region. Device Configurations are region specific because they allow you to integrate the device with regional resources like datastores. Multiple devices can use the same configuration, but a single device can only run one configuration at a time.
To add a new Device Configuration, click the
Manage Device Configs button on the
Devices toolbar. Click
Add New to create a new configuration.
Configuration Name (required): A unique name for this device configuration. Names must be unique within regions but can be reused across regions.
Image Name (required): The docker image to run as the inference container. The same repositories are supported as when using a Customer Provided Image for a job environment.
Model (required): Select the trainML Model containing the code that will run the inference task.
Checkpoints: Add any trainML Checkpoints that the model's inference code requires.
Start Command (required): The command to run in the model code's root directory to start running the inference task.
Devices job commands must be designed to run continously in the foreground. The inference task will be restarted if the device is rebooted, but will not automatically restart if the command itself exits.
Datastore: Select a regional datastore to mount to the
TRAINML_OUTPUT_PATH location of the running container to read or save private data.
Datastore Path: The subdirectory inside the regional datastore to mount.
Environment Variable: Configure any environment variables to set in the inference container.
Attached Keys: Add any Third-Party Key credentials to the inference container required for successful code execution.
Storage nodes are mandatory for all cloud provider regions. Contact us if you would like to use a centralized storage node in a physical region to free up the compute node's local disks.
Configuration of the storage node is automatic in cloud provider regions. To add a storage node to a Physical region, ensure the region is configured with
Storage Mode set to
Central. Attach a display to the storage node and connect it to the network. Once it has started, you should see the trainML banner on the screen, which displays the trainML
minion_id and some other diagnostic data. Write down the minion_id and go to the CloudBender Resources Dashboard.
View Region from the action menu of the new node's physical region. Once on the region dashboard, click the
Add button on the Storage Nodes grid. Enter a name for the new node and the minion id from the console. If the system is in a region behind a commercial corporate firewall, UPnP should be disabled. If it is behind a residential internet connection or other firewall managed by your ISP, leave UPnP enabled.
Only one storage node per region is currently supported. If you already have a storage node configured in a region, it must be removed before a new one can be added. Alternatively, you can replace the existing storage node with the new one by selecting
Replace Node from the actions menu.
You will be navigated back to the region's dashboard. The new node will finalize its provisioning process, and may restart again. When the process is complete, the node will automatically enter the
Active state, and you can begin to add compute nodes or devices to the region.
Regional resources require a Team or higher feature plan.
Regional resources allow trainML jobs to utilize data and provide services in a specific region. If you attach a job to any regional resource, the resource reservation system will ensure that the job will only start in that specific region.
Since using a regional resource will constrain the available GPUs to only those that exist in that region, be sure the job specification requests a GPU type that exists in the region or the job will never start.
Datastores allow jobs to connect to data storage infrastructure that is local to a region. This avoids having to upload the data as a trainML Dataset, Checkpoint, or as input data to an Inference Job. Datastores can also be used as the output data location for training/inference jobs or mounted to Notebook and Endpoint jobs to provide additional scratch space or access to additional data at runtime. Datastores are ideal for data that is too large or too sensitive to be uploaded to the trainML platform.
To add a datastore, select
View Region from the action menu of the new datastore's region. Once on the region dashboard, click the
Add button on the Datastores toolbar.
Name: A unique name for this datastore.
Type: The type of datastore. Different datastore types require different configuration options.
NFS: Select if adding a NFS server. See additional requirements for the required server export configuration.
SMB/CIFS: Select if adding a Windows or Samba file share server.
Address/URI: The hostname, network name, or IP address of the datastore server.
Root Path: The directory on the server to act as the root for this datastore. Use "/" to enable access to the entire datastore. If you specify a subdirectory as the datastore root path, subsequent jobs, datasets, and checkpoints will not be able to access data above that subdirectory and their path specification will be relative to this root path. For example, if the datastore exposes the following directories:
/dir2 as the
Root Path will prevent access to
dir1 or any of its subdirectories. When configuring a job to mount
subdir2 as the input data, specific
/subdir2 as the input path, NOT
/dir2/subdir2. The datastore system automatically concatenates the datastore root path with the requested subdirectory path.
Username (SMB Only): The username to connect to the datastore with.
Password (SMB Only): The password for the user.
NFS Share Requirements
The default user that containers run as is root. Since NFS passes through the user ID for all operations, all file access actions on an NFS datastore will show up as the root user of the NFS client. Since allowing the client's root user to pass through to the server is a signifcant security risk, trainML recommends that all NFS exports used as trainML datastores enable
root_squash to convert client IDs to the anonymous ID. When this is enabled, if the NFS share is used for output data, you must set the
anonguid settings on the export to a user/group that has write access to the exported directory. Because of this, trainML recommends not reusing existing exports for trainML job output locations, but create a dedicated export with these settings.
Reservations allow jobs to host local applications or services that are expected to be available at a specific hostname and port. This enables multiple local applications to be deployed on the same compute rig by fixing each application to a dedicated port, as well as allows the same application to transparently migrate between multiple compute rigs by dynamically updating the dedicated local hostname.
To add a reservation, select
View Region from the action menu of the new reservations's region. Once on the region dashboard, click the
Add button on the Reservations grid.
Name: A unique name for this reservation.
Type: Only port reservations are currently supported.
Hostname: The hostname to publish with mDNS.
Port: The port the endpoint should listen on.
Dynamic hostname publishing is currently only supported with mDNS, which is a link-local protocol. If the systems accessing the endpoint are on a different IP subnet, you will need an mDNS Gateway for the hostname to resolve properly.
Billing and Credits
Running jobs on your own resources incurs an hourly usage fee that varies based on the type of GPU. The usage fees are as follows:
- Development Only GPUs (GeForce, Radeon) - 0.05 credits/hr
- Professional GPUs (Quadro/RTX A6000, Radeon Pro) - 0.10 credits/hr
- Datacenter GPUs (Tesla, Radeon Instinct) - 0.25 credits/hr
Federated Training/Inference Fee
Cross-project Federated jobs incur an additional hourly fee that is irrespective of the GPU type or number of GPUs per worker.
- Federated Inference (Inference Jobs, Endpoints) - 1.85 credits/hr
- Federated Training (Training Jobs) - 2.65 credits/hr
Unlike the usage fee, which is charged to the owner of the project, this fee is charged to the job creator.
Devices incur a fixed monthly fee of 15 credits/month based on how many devices are configured in CloudBender. The fee calculation begins as soon as the device is added to CloudBender and continues until it is removed. Offline devices still incur the monthly fee. The fee is calculated and charged daily based on the total number of devices configured that day.
For example, if you have 10 devices configured in cloudbender, you will be charged 5 (15 * 10 / 30) credits per day in a month with 30 days. If you start a day with 10 devices, add 5 more during the day, you will be charged 7.5 (15 * 15 / 30) credits, no matter what time of day you add the devices. If you start the day with 10 devices, and remove 5 devices during the day, you will be charged 5 (15 * 10 / 30) credits on that day, and 2.5 (15 * 5 / 30) credits on subsequent days. If you start a day with 10 devices, add and remove a device 5 times, you will be charged 7.5 (15 * 15 / 30) credits, since each addition represents a unique device, and 5 credits on subsequent days (since the devices were all removed by EOD).
If you configure your CloudBender region to allow other customers to consume your spare resources, you can earn credits that can be used to offset your own trainML resource costs or paid out in cash. When you service another customer's job, you receive a credit at the end of each hour the job has been running. When a job stops, you receive the credit for the partial time the job ran. The customer pays the full hourly price for the GPU. From this, trainML deducts the standard usage fee based on the GPU type, and splits the remaining credits evenly with you.
For example, if you add a system with A100 GPUs, other trainML customers will pay 2.78 credits per hour per GPU. At the end of each hour a job runs, you will receive 1.265 credits ((2.78 - 0.25) / 2) per GPU.
Credits - All accumulated credits will be refunded to your billing account every 24 hours. There is no fee associated with this transfer.
Stripe (coming soon) - Receive cash payouts of your credit balance through Stripe Connect