CloudBender™ lets you connect your on-prem and cloud GPUs to the trainML platform and seamlessly run jobs on any CloudBender enabled system. When you start a notebook or submit a job, CloudBender will automatically select the lowest cost available resource that meets your hardware, cost, data, and security specifications.
CloudBender currently supports two different infrastructure providers, Physical and Google Cloud (GCP).
The Physical provider is used for any on-premise equipment, whether that be a large DGX server or a GPU-enabled AI workstation. The physical provider has the lowest cost per GPU hour for on-demand jobs. Physical systems must meet specific hardware requirements and must run the trainML operating system
To enable the Physical provider, click
Enable Provider in the actions menu of the Providers list, or select
Physical (On-Prem) from the
Enabled Provider menu if you have no current providers configured. On the provider configuration form, select the desired payment mode and click
The Google Cloud provider allows you to run trainML workloads in your GCP account without the learning curve of becoming a Google Cloud Architect. When combined with using the physical provider for managing your on-prem systems, CloudBender enables a truly seamless "own the base, rent the burst" architecture. Even if you don't have physical resources now, using CloudBender to orchestrate your cloud resources makes transitioning to physical resources later effortless.
CloudBender will create the majority of the resources in your GCP account, but you must create the GCP account, the project, the service account, and the service account IAM permissions.
Only paid Cloud Billing accounts are permitted to attach GPUs to instances. If your account is new, it starts in the free tier by default. You must upgrade your GCP account to use CloudBender. If your account is managed by your organization, no change is required.
trainML recommends that you create a project that only trainML will use to avoid issues of quota management, resource conflicts, or security permissions. Once the project is created, enable the following APIs in the project:
Once these APIs are enabled, additional IAM permissions must be granted to some of the service accounts. On the IAM page, configure the following additional permissions.
- Add "Service Account User" to the Compute Engine default service account. The account name will be of the format
- If your project was created prior to April 8, 2021, add "Service Account Token Creator" to the Cloud Pub/Sub Service Agent account. The account name will be of the format
service-<project_id>@gcp-sa-pubsub.iam.gserviceaccount.com. You may have to check the
Include Google-provided role grantscheckbox to see this account. See more details here.
- Add "Pub/Sub Publisher" to the global cloud logs service account
firstname.lastname@example.org. Because this account will not show up in the principals list, click the Add button near the top of the page and paste the account into the
New principalsbox, then configure the permissions as normal.
If you do not see these accounts in your project, ensure that you have first enabled the required APIs above.
- Compute Admin
- Pub/Sub Admin
- Logs Configuration Writer
- Service Account User
- Billing Account Viewer (optional)
Once the account is created, create new service account keys. Select
JSON as the Key Type, and the credentials JSON file should download automatically. This file will be used in the next step.
To enable the Google Cloud provider, click
Enable Provider in the actions menu of the Providers list, or select
Google Cloud from the
Enabled Provider menu if you have no current providers configured. On the provider configuration form, select the desired payment mode and upload the JSON file from the previous step. If preferred, you can disable
Upload JSON File and paste the contents of the key file from the previous step into the text area.
If the file is in the correct format, the system will automatically extract the correct project ID and service account email address from the file and populate the fields. Once the form is complete, click
Submit to enable the provider.
CloudBender will begin creating some project level resources necessary for it to manage resources in your project. Once this process is complete, the
Create Region button will appear.
All providers are organized by regions, which represent the physical location of the resources. For cloud providers, you can create one trainML region per cloud region. For the physical provider, you should have at least one region per physical site, but are not limited to one.
Regions of the physical provider are designed to organize resources based on physical proximity. If you have multiple data centers, even if they are connected over a high-speed WAN link, you should have create a region for each data center. All nodes within the same region must be able to communicate with each other using a private (RFC 1918) address. You can create multiple regions for the same data center if you wish apply different access control policies for different nodes. For example, if you have some nodes you want dedicated to your organization and others that you want to earn credits by selling spare compute cycles to other trainML customers, you can create two different regions to control that.
To create a new region, select
Create Region button from actions menu of the Physical provider. Specify a name for the new region and configure whether to allow public jobs on the nodes in this region. If
Allow Public Jobs is enabled, CloudBender will use the spare capacity of your nodes in this region to service other customer's jobs. When your node runs jobs for other customers, your account will accumulate credits based on job credit calculation in exchange for the use of your resources. If
Allow Public Jobs is disabled, only projects that are owned by your account will send jobs to the nodes in this region. Regardless of if this setting is enabled, jobs in projects owned by your account will always prefer to run on the nodes in your regions.
Currently, only six google cloud regions have most of the common GPU types, so these are they only regions you can configure with CloudBender. To add a new region to CloudBender, select
Create Region button from actions menu of the Google Cloud provider. The create region form will query the region quotas in your GCP project. If this process fails, ensure the service account you used when creating the provider has the correct permissions.
Select a region from the list under
Region Name. When a region is selected, you will see the quotas relevant to setting up GPU nodes. The quotas show both the limit and the amount of the limit currently used. If the region you selected does not have sufficient quota, request a quota increase from the GCP console and try again after the increase has been approved.
On-Demandinstances are supported, but
Committedare coming soon.
The number of runnable jobs in a region is the minimum of:
- the available GPUs of that type
- CPUs divided by 6 for P100, 8 for V100/T4, or 12 for A100
- local SSD / 375
For example, if you have a V100 quota of 8, but a Local SSD or CPU quota of 0, you cannot run any V100 jobs. If you have a very high Local SSD and CPU quota, but a V100 quota of 1, you can only run 1 job with a single V100. If your region runs out of quota, your job will either fail or run in another CloudBender location, depending on the project configuration.
All GCP regions require a storage node to cache the jobs and datasets locally in the region. Select the amount of storage you expect to require in this region from the
Storage Class field. Since the storage node will run continuously no matter how many jobs are running, trainML recommends that you purchase committed use discounts for the instance and storage once you are confident in your deployment requirements.
When you submit the new region form, CloudBender will automatically provision the storage node of the selected size. Once the storage node is ready, you can begin creating jobs in the region.
The actual systems the run CloudBender workload are considered Nodes. Nodes are categorized by the function they serve. Compute Nodes are systems with attached GPUs and actually run the jobs. Storage Nodes cache the job and dataset data from the CloudBender persistence layer into their region to accelerate job provisioning.
Compute nodes are the GPU resources managed by CloudBender. In the case of the Physical provider, each GPU-enabled server you connect to CloudBender is a compute node. For cloud providers, jobs can create ephemeral compute nodes (using On-Demand or Preemptible GPU capacity) or can run on permanent compute nodes that use committed/reserved instances to save money. Once a node is connected to CloudBender, it can no longer be accessed outside of the trainML interface. CloudBender secures the system down to the boot process to ensure no unauthorized data access is possible.
Cloud compute nodes require no additional configuration. However, physical systems must meet certain requirements based on the desired regional configuration and you must perform some manual steps as part of the node provisioning process.
The following requirements must be satisfied for all systems managed by CloudBender:
- All GPUs in the system must have the same model type (e.g. A100, RTX 3090).
- 2+ NVMe drives with a total size of 1 TB per 2 GPUs (e.g. 1 TB for a 1 GPU system, 1 TB for a 2 GPU system, 2 TB for a 3 GPU system, etc.)
- 2.0 compatible TPM installed (or 2.0 compatible fTPM enabled)
The following additional requires must be satisfied for systems to sell their spare capacity to the network:
- The total system memory must be at least 2x the total GPU VRAM. For example, if the system has 4 40 GB A100s, the system must have at least 320 GB of RAM.
- At least 8x PCIe lanes for every GPU (not applicable for SXM connected GPUs)
- At least 2 cores per GPU for PCIe 3.0 GPUs or 4 cores per GPU for PCIe 4.0 GPUs.
If your region is using the local storage mode (the default for physical regions), the system must also have at least 3 2TB 7200RPM SATA hard drives. If your region is using central storage mode, the system must use a minimum 10GB Ethernet connection, 100GB preferred.
Physical Server Provisioning
To add a physical server as a CloudBender compute node, contact us to request a new boot drive and allow us to validate your request. Once your request is approved, we will ship a USB drive to your office. When you receive the drive, prepare the system by performing the following activities in your servers BIOS configuration.
- Ensure the TPM is enabled and clear it
- Disable CSM (Compatibility support mode)
- Disable Secure Boot (NVIDIA systems only)
Refer to your motherboard manual on how to perform the above tasks.
Plug the USB drive into the system and reboot. Once the system boots, it will detect that no trainML Operating System is installed and prompt you to install. Type 'yes' to continue. After a few minutes, installation will complete and you will be prompted to restart the system to continue configuration. The system will restart at least one time during the subsequent configuration process.
Never remove the USB drive from the system once the operating system is installed. The system will not be able to boot without it.
The system is ready to be added once you see only the trainML banner, a print out of the trainML minion_id, and no other log messages on the console. Write down the minion_id and go to the CloudBender Resources Dashboard. Select
Add Node from the action menu of the new node's physical region. Enter a name for the new node and the minion id from the console. If the system is in a region behind a commercial corporate firewall, UPnP should be disabled. If it is behind a residential internet connection or other firewall managed by your ISP, leave UPnP enabled. If you want the system to mine cryptocurrency while the GPUs are idle, set mining to
Enabled. Cryptocurrency mining requires that you configure valid wallet addresses in your account. Click
Submit to add the new node.
You will be navigated to the region's node dashboard. The new node will finalize its provisioning process, and maybe restart again. When the node is ready, it will be in maintenance mode until you are ready to activate it. To activate the new node, select it from the list and click
Toggle Maintenance. The status will change to
Active and the node is now ready to service jobs.
Storage nodes are mandatory for all cloud provider regions. Contact us if you would like to use a centralized storage node in a physical region to free up the compute node's local disks.
Billing and Credits
Running jobs on your own resources incurs an hourly usage fee that varies based on the type of GPU. The usage fees are as follows:
- Development Only GPUs (GeForce, Radeon) - 0.05 credits/hr
- Professional GPUs (Quadro/RTX A6000, Radeon Pro) - 0.10 credits/hr
- Datacenter GPUs (Tesla, Radeon Instinct) - 0.25 credits/hr
If you currently have credits at one of the supported cloud providers, contact us to find out how to waive the usage fee for your cloud provider as long as you have credits remaining.
If you configure your CloudBender region to allow other customers to consume your spare resources, you can earn credits that can be used to offset your own trainML resource costs or paid out in cash. When you service another customer's job, you receive a credit at the end of each hour the job has been running. When a job stops, you receive the credit for the partial time the job ran. The customer pays the full hourly price for the GPU. From this, trainML deducts the standard usage fee based on the GPU type, and splits the remaining credits evenly with you.
For example, if you add a system with A100 GPUs, other trainML customers will pay 2.78 credits per hour per GPU. At the end of each hour a job runs, you will receive 1.265 credits ((2.78 - 0.25) / 2) per GPU.
Credits - All accumulated credits will be refunded to your billing account every 24 hours. There is no fee associated with this transfer.
Stripe (coming soon) - Receive cash payouts of your credit balance through Stripe Connect