CloudBender

CloudBender™ lets you connect your on-prem and cloud GPUs to the trainML platform and seamlessly run jobs on any CloudBender enabled system. When you start a notebook or submit a job, CloudBender will automatically select the lowest cost available resource that meets your hardware, cost, data, and security specifications.

Providers

CloudBender currently supports two different infrastructure providers, Physical and Google Cloud (GCP).

Physical

The Physical provider is used for any on-premise equipment, whether that be a large DGX server or a GPU-enabled AI workstation. The physical provider has the lowest cost per GPU hour for on-demand jobs. Physical systems must meet specific hardware requirements and must run the trainML operating system

To enable the Physical provider, click Enable Provider in the actions menu of the Providers list, or select Physical (On-Prem) from the Enabled Provider menu if you have no current providers configured. On the provider configuration form, select the desired payment mode and click Submit.

Google Cloud

The Google Cloud provider allows you to run trainML workloads in your GCP account without the learning curve of becoming a Google Cloud Architect. When combined with using the physical provider for managing your on-prem systems, CloudBender enables a truly seamless "own the base, rent the burst" architecture. Even if you don't have physical resources now, using CloudBender to orchestrate your cloud resources makes transitioning to physical resources later effortless.

CloudBender will create the majority of the resources in your GCP account, but you must create the GCP account, the project, the service account, and the service account IAM permissions.

Account Requirements

Only paid Cloud Billing accounts are permitted to attach GPUs to instances. If your account is new, it starts in the free tier by default. You must upgrade your GCP account to use CloudBender. If your account is managed by your organization, no change is required.

Project Configuration

trainML recommends that you create a project that only trainML will use to avoid issues of quota management, resource conflicts, or security permissions. Once the project is created, enable the following APIs in the project:

Once these APIs are enabled, additional IAM permissions must be granted to some of the service accounts. On the IAM page, configure the following additional permissions.

Add "Service Account User" to the Compute Engine default service account. The account name will be of the format <project_id>-compute@developer.gserviceaccount.com.
If your project was created prior to April 8, 2021, add "Service Account Token Creator" to the Cloud Pub/Sub Service Agent account. The account name will be of the format service-<project_id>@gcp-sa-pubsub.iam.gserviceaccount.com. You may have to check the Include Google-provided role grants checkbox to see this account. See more details here.
Add "Pub/Sub Publisher" to the global cloud logs service account cloud-logs@system.gserviceaccount.com. Because this account will not show up in the principals list, click the Add button near the top of the page and paste the account into the New principals box, then configure the permissions as normal.

If you do not see these accounts in your project, ensure that you have first enabled the required APIs above.

Service Account

Create a service account in the new project for CloudBender to use. Add the following roles to the service account during the creation process:

Compute Admin
Pub/Sub Admin
Logs Configuration Writer
Service Account User
Create Service Accounts
Delete Service Accounts
Billing Account Viewer (optional)

Once the account is created, create new service account keys. Select JSON as the Key Type, and the credentials JSON file should download automatically. This file will be used in the next step.

Enable GCP Provider

To enable the Google Cloud provider, click Enable Provider in the actions menu of the Providers list, or select Google Cloud from the Enabled Provider menu if you have no current providers configured. On the provider configuration form, select the desired payment mode and upload the JSON file from the previous step. If preferred, you can disable Upload JSON File and paste the contents of the key file from the previous step into the text area.

If the file is in the correct format, the system will automatically extract the correct project ID and service account email address from the file and populate the fields. Once the form is complete, click Submit to enable the provider.

CloudBender will begin creating some project level resources necessary for it to manage resources in your project. Once this process is complete, the Create Region button will appear.

AWS

The AWS provider allows you to run trainML workloads in your AWS account without the learning curve of becoming a AWS Architect. When combined with using the physical provider for managing your on-prem systems, CloudBender enables a truly seamless "own the base, rent the burst" architecture. Even if you don't have physical resources now, using CloudBender to orchestrate your cloud resources makes transitioning to physical resources later effortless.

CloudBender will create the majority of the resources in your AWS account, but you must create the AWS account, the IAM role, and the service account.

tip

If you would rather create the non-clean room resources (e.g. VPC, instance roles, etc.) yourself, contact us for more detailed instructions.

Account Configuration

trainML recommends that you create an account that only trainML will use to avoid issues of quota management, resource conflicts, or security permissions. Most new accounts do not have sufficient quota to run GPU-enabled instances by default. trainML can create resources using the following quotas:

Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances - CPU Only instances
Running On-Demand P instances - V100, A100, H100 Instances
Running On-Demand G and VT instances - L4, T4 Instances

The quota required is based on the instance vCPU (not the number of instances). For example, to run a single g6.12xlarge (a medium 4-GPU L4 instance), the required quota for Running On-Demand G and VT instances is 48.

IAM Service Account

To enable trainML to create clean rooms in your account, you must first create the IAM role with the correct policy trainML can use to manage its EC2 resources. First, it needs an IAM user to authenticate with so it can assume the required roles we will create later. From the IAM console, click on Users and click Create User. Name the user something memorable (e.g. proximl-user) and ensure Provide user access to the AWS Management Console is unchecked. Click Next and do not set any permissions on this user. Click Next, add any tags according to your organization's policies and click Create User.

Once the user is created, click on the username to view the user details. Copy the user's ARN as this will be required in subsequent steps. Click the Security credentials tab and navigate to the Access keys section. Click Create access key. Click Other under Access key best practices & alternatives and click Next. Add a description if desired and click Create access key. Obtain the Access Key and Secret Access Key and store them in a secure location. Instructions for providing them to trainML are in the Enable Provider section.

IAM role

Next, you must create the policy and role that trainML will use to create and manage its resources in your account. For a fully-managed deployment, trainML needs write access to the following services:

EC2
VPC
SNS
KMS
Event Bridge
Launch Templates (IAM Roles and Instance Profiles)

tip

This scope can be reduced by creating the non-clean room resources (e.g. VPC, instance roles, etc.) yourself, but will increase the amount of manual setup work performed by you. Contact us for more detailed instructions on this option.

From the IAM Console, click on the Policies tab and click Create Policy. In the Policy Editor toolbar, click the JSON option. Paste the following policy into Policy Editor and click Next.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:CopyImage",
                "ec2:CreateSecurityGroup",
                "ec2:CreateNetworkInterface",
                "ec2:Describe*",
                "ec2:ModifyHosts",
                "ec2:ModifyInstanceAttribute",
                "servicequotas:GetServiceQuota",
                "kms:CreateKey"
            ],
            "Resource": "*"
        },
        {
            "Action": [
                "ec2:AssociateIamInstanceProfile",
                "ec2:AttachVolume",
                "ec2:AttachInternetGateway",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CopySnapshot",
                "ec2:Create*",
                "ec2:Delete*",
                "ec2:Describe*",
                "ec2:DetachVolume",
                "ec2:DetachInternetGateway",
                "ec2:DisassociateIamInstanceProfile",
                "ec2:GetInstanceTpmEkPub",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:ModifyVpcAttribute",
                "ec2:ModifySubnetAttribute",
                "ec2:ModifyVolumeAttribute",
                "ec2:ModifySecurityGroupRules",
                "ec2:RevokeSecurityGroupIngress",
                "ec2:RunInstances",
                "ec2:StartInstances",
                "ec2:StopInstances",
                "ec2:TerminateInstances",
                "servicequotas:GetServiceQuota",
                "sns:CreateTopic",
                "sns:GetTopicAttributes",
                "sns:Subscribe",
                "sns:ConfirmSubscription",
                "sns:Unsubscribe",
                "sns:TagResource",
                "sns:DeleteTopic",
                "events:PutRule",
                "events:PutTargets",
                "events:DeleteRule",
                "events:RemoveTargets",
                "events:TagResource",
                "events:ListTargetsByRule",
                "kms:TagResource",
                "kms:EnableKey",
                "kms:DisableKey",
                "kms:ScheduleKeyDeletion",
                "kms:CreateAlias",
                "kms:DeleteAlias"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:vpc/*",
                "arn:aws:ec2:*:*:subnet/*",
                "arn:aws:ec2:*:*:instance/*",
                "arn:aws:ec2:*:*:launch-template/*",
                "arn:aws:ec2:*:*:key-pair/*",
                "arn:aws:ec2:*::snapshot/*",
                "arn:aws:ec2:*:*:volume/*",
                "arn:aws:ec2:*:*:security-group/*",
                "arn:aws:ec2:*:*:network-interface/*",
                "arn:aws:ec2:*:*:internet-gateway/*",
                "arn:aws:ec2:*:*:route-table/*",
                "arn:aws:ec2:*::image/*",
                "arn:aws:sns:*:*:proximl-*",
                "arn:aws:events:*:*:rule/proximl-*",
                "arn:aws:servicequotas:*:*:ec2/*",
                "arn:aws:servicequotas:*:*:vpc/*",
                "arn:aws:servicequotas:*:*:ebs/*",
                "arn:aws:kms:*:*:key/*",
                "arn:aws:kms:*:*:alias/proximl/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "iam:GetRole",
                "iam:CreateRole",
                "iam:DeleteRole",
                "iam:PassRole",
                "iam:TagRole",
                "iam:GetPolicy",
                "iam:GetPolicyVersion",
                "iam:ListPolicyVersions",
                "iam:ListPolicyTags",
                "iam:ListRoleTags",
                "iam:CreateInstanceProfile",
                "iam:DeleteInstanceProfile",
                "iam:AddRoleToInstanceProfile",
                "iam:RemoveRoleFromInstanceProfile",
                "iam:TagInstanceProfile",
                "iam:TagPolicy"
            ],
            "Resource": [
                "arn:aws:iam::*:role/proximl-*",
                "arn:aws:iam::*:policy/proximl-*",
                "arn:aws:iam::*:instance-profile/proximl-*"
            ],
            "Effect": "Allow"
        }
    ]
}

Name the policy something memorable (e.g. proximl-cloudbender-policy) and add any tags that are required by your organizations policies. Click Create policy to continue.

info

trainML recommends that you do not add any "data access" permissions to this policy. This policy/role should only be used for the creation and management of the clean room system itself. Data access policies/roles will be defined separately as region datastores or third-party credentials. Isolating these roles improves auditability and management of application access.

Next, click on the Roles tab and click Create Role. Select An AWS Account as the Trusted Entity Type and leave the Account setting to This account. Click Next and search for the policy created in the previous step on the Add permissions page. Select that policy and click Next. In the Select Trusted Entities, replace the default root account in the Principal/AWS section with the ARN of the IAM Service Account you created in the previous step. Enter a memorable name for the role (e.g. proximl-cloubender-role), and add any tags that are required by your organizations policies. Click Create Role to continue.

tip

If your IAM configuration does not allow you to edit the Trusted Entities policy. Select Custom Trust Policy instead of An AWS Account on the first step and use a policy like the following:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "Statement1",
			"Effect": "Allow",
			"Principal": {
			    "AWS": "<full ARN of user created previously>"
			},
			"Action": "sts:AssumeRole"
		}
	]
}

Once the role is created, click on its name to view the details. Copy the role ARN, as this will be required in the next step.

Enable AWS Provider

To enable the AWS provider, click Enable Provider in the actions menu of the Providers list, or select AWS from the Enabled Provider menu if you have no current providers configured. On the provider configuration form, enter the 9 digit Account ID that you will deploy trainML in the Account Number field and the IAM role ARN from the previous step into the IAM Role field. You also need to enter the Access Key and Secret Access Key of the service account credentials created earlier. If enabling through a third-party, or for extra security, you can PGP encrypt the key and secret prior to uploading them to this form. The trainML PGP public key can be downloaded here. To encrypt a field, you can use a command like the following:

curl -s -o pgp.asc https://app.proximl.ai/.well-known/pgp.asc
echo -n "<secret field value here>" | gpg --encrypt --armor --recipient-file pgp.asc

You must run the second command for each field you wish to encrypt.

Once the form is complete, click Submit to enable the provider. Once the account information is validated, the Create Region button will appear.

Regions

All providers are organized by regions, which represent the physical location of the resources. For cloud providers, you can create one trainML region per cloud region. For the physical provider, you should have at least one region per physical site, but are not limited to one.

Physical Regions

Regions of the physical provider are designed to organize resources based on physical proximity. If you have multiple data centers, even if they are connected over a high-speed WAN link, you should create a region for each data center. All nodes within the same region must be able to communicate with each other using a private (RFC 1918) address. You can create multiple regions for the same data center if you wish apply different access control policies for different nodes. For example, if you have some nodes you want dedicated to your organization and others that you want to earn credits by selling spare compute cycles to other trainML customers, you can create two different regions to control that.

To create a new region, select Create Region button from actions menu of the Physical provider.

Name: A unique name for this region. trainML recommends that region names only contain characters allowed in DNS hostnames.
Storage Mode: Select Local if the region will only contain stand-alone compute nodes. Select Central if you have purchased a central storage controller from trainML. Central is required to use Device nodes.
Allow Public Jobs: If enabled, CloudBender will use the spare capacity of your nodes in this region to service other customer's jobs. When your node runs jobs for other customers, your account will accumulate credits based on job credit calculation in exchange for the use of your resources. If disabled, only projects that are owned by your account will send jobs to the nodes in this region. Regardless of if this setting is enabled, jobs in projects owned by your account will always prefer to run on the nodes in your regions.

GCP Regions

Currently, only ten google cloud regions have most of the common GPU types, so these are they only regions you can configure with CloudBender. To add a new region to CloudBender, select Create Region button from actions menu of the Google Cloud provider. The create region form will query the region quotas in your GCP project. If this process fails, ensure the service account you used when creating the provider has the correct permissions.

Select a region from the list under Region Name. When a region is selected, you will see the quotas relevant to setting up GPU nodes. The quotas show both the limit and the amount of the limit currently used. If the region you selected does not have sufficient quota, request a quota increase from the GCP console and try again after the increase has been approved.

info

Currently, only On-Demand instances are supported, but Preemptible and Committed are coming soon.

The number of runnable jobs in a region is the minimum of:

the available GPUs of that type
CPUs divided by 6 for P100, 8 for V100/T4, or 12 for A100
local SSD / 375

For example, if you have a V100 quota of 8, but a Local SSD or CPU quota of 0, you cannot run any V100 jobs. If you have a very high Local SSD and CPU quota, but a V100 quota of 1, you can only run 1 job with a single V100. If your region runs out of quota, your job will either fail or run in another CloudBender location, depending on the project configuration.

All GCP regions require a storage node to cache the jobs, models, checkpoints, and datasets locally in the region. Select the amount of storage you expect to require in this region from the Storage Class field. Since the storage node will run continuously no matter how many jobs are running, trainML recommends that you purchase committed use discounts for the instance and storage once you are confident in your deployment requirements.

When you submit the new region form, CloudBender will automatically provision the storage node of the selected size. Once the storage node is ready, you can begin creating jobs in the region.

AWS Regions

Currently, trainML maintains a limited set of regions containing the CloudBender AMIs. If you need a region that is not available in the list, contact us. To add a new region to CloudBender, select Create Region button from actions menu of the AWS provider. The create region form will query the region quotas in your AWS account. If this process fails, ensure the service account you used when creating the provider has the correct permissions.

Select a region from the list under Region Name. The Allow Public Jobs field indicates if you want other proxiML users to pay for the use of any spare capacity you provision to earn additional credits. The Storage Mode field indicates if you will use a dedicated server to cache data like model weights for faster performance. Local is appropriate for small, single server deployments. Central creates a dedicated storage node automatically during region provisioning.

Click Submit to create the region.

Nodes

The actual systems that run CloudBender workloads are considered Nodes. Nodes are categorized by the function they serve. Compute Nodes are systems with attached GPUs and actually run the jobs. Storage Nodes cache the job, model, checkpoint, and dataset data from the CloudBender persistence layer into their region to accelerate job provisioning.

Compute

Compute nodes are the GPU resources managed by CloudBender. In the case of the Physical provider, each GPU-enabled server you connect to CloudBender is a compute node. For cloud providers, jobs can create ephemeral compute nodes (using On-Demand or Preemptible GPU capacity) or can run on permanent compute nodes that use committed/reserved instances to save money. Once a node is connected to CloudBender, it can no longer be accessed outside of the trainML interface. CloudBender secures the system down to the boot process to ensure no unauthorized data access is possible.

Cloud compute nodes require no additional configuration. However, physical systems must meet certain requirements based on the desired regional configuration and you must perform some manual steps as part of the node provisioning process.

Hardware Requirements

The following requirements must be satisfied for all systems managed by CloudBender:

All GPUs in the system must have the same model type (e.g. A100, RTX 3090).
2+ NVMe drives with a minimum total size of 1 TB per 2 GPUs (e.g. 1 TB for a 1 GPU system, 1 TB for a 2 GPU system, 2 TB for a 3 GPU system, etc.)
2.0 compatible TPM installed (or 2.0 compatible fTPM enabled)

The following additional requires must be satisfied for systems to sell their spare capacity to the network:

The total system memory must be at least 2x the total GPU VRAM. For example, if the system has 4 40 GB A100s, the system must have at least 320 GB of RAM.
At least 8x PCIe lanes for every GPU (not applicable for SXM connected GPUs)
At least 2 cores per GPU for PCIe 3.0 GPUs or 4 cores per GPU for PCIe 4.0 GPUs.

If your region is using the local storage mode (the default for physical regions), the system must also have at least 3 2TB SATA (SSD or 7200RPM HDD) hard drives. If your region is using central storage mode, the system must use a minimum 10GB Ethernet connection, 100GB preferred.

Physical Server Provisioning

To add a physical server as a CloudBender compute node, contact us to request a new boot drive and allow us to validate your request. Once your request is approved, we will ship a USB drive to your office. When you receive the drive, prepare the system by performing the following activities in your servers BIOS configuration.

Ensure the TPM is enabled and clear it
Disable CSM (Compatibility support mode)
Disable Secure Boot (NVIDIA systems only)

Refer to your motherboard manual on how to perform the above tasks.

Plug the USB drive into the system and reboot. Once the system boots, it will detect that no trainML Operating System is installed and prompt you to install. Type 'yes' to continue. After a few minutes, installation will complete and you will be prompted to restart the system to continue configuration. The system will restart at least one time during the subsequent configuration process.

warning

Never remove the USB drive from the system once the operating system is installed. The system will not be able to boot without it.

The system is ready to be added once you see only the trainML banner with a print out of the trainML minion_id and some other diagnostic information. Write down the minion_id and go to the CloudBender Resources Dashboard.

tip

If the region is configured to use the Central storage mode, you must add the storage node before you can add compute nodes.

Select View Region from the action menu of the new node's physical region. Once on the region dashboard, click the Add button on the Compute Nodes grid. Enter a name for the new node and the minion id from the console. If the system is in a region behind a commercial corporate firewall, UPnP should be disabled. If it is behind a residential internet connection or other firewall managed by your ISP, leave UPnP enabled. If you want the system to mine cryptocurrency while the GPUs are idle, set mining to Enabled. Cryptocurrency mining requires that you configure valid wallet addresses in your account. Click Submit to add the new node.

You will be navigated to the region's node dashboard. The new node will finalize its provisioning process, and may restart again. When the node is ready, it will be in maintenance mode until you are ready to activate it. To activate the new node, select it from the list and click Toggle Maintenance. The status will change to Active and the node is now ready to service jobs.

Devices

Devices are edge inference nodes that contain a system-on-chip (SOC) accelerator and permanently run a single, always-on inference model. Currently supported devices include:

NVIDIA Jetson AGX Xavier
NVIDIA Jetson Xavier NX
NVIDIA Jetson AGX Orin
NVIDIA Jetson Orin NX
NVIDIA Jetson Orin Nano

CloudBender compatible devices can be purchased through trainML. Contact us for a quote.

caution

Devices are only supported in Physical regions configured with a centralized storage node. If you already have compute nodes in a region configured with local storage, you must create a new region to add devices.

Obtain the device minion_id from the sticker on device or by attaching a display to the device. Select View Region from the action menu of the new devices's physical region. Once on the region dashboard, click the Add button on the Devices grid. Enter a name for the new device, the minion id, and click Submit.

You will be navigated back to the region's dashboard. The new device will finalize its provisioning process, and may restart again. When the process is complete, the device will automatically enter the Active state. Once it is active, you must set the desired device configuration to run the inference model. Once you have created a Device Configuration, select Set Device Config from the action menu. Select the desired device configuration and click Select. Once the configuration is set, deploy the inference model to the device by selecting it on the grid and clicking Deploy Config on the toolbar, or select Deploy Latest Config from the device action menu.

When the deployment is complete, the Inference Status will show runnning, and the configuration status will indicate the last date it was deployed. While running, the inference job will have access to the SoC accelerator and any video or media devices plugged into the device.

Device Configurations

Device Configurations allow you to share the inference model configuration across many devices in a region. Device Configurations are region specific because they allow you to integrate the device with regional resources like datastores. Multiple devices can use the same configuration, but a single device can only run one configuration at a time.

To add a new Device Configuration, click the Manage Device Configs button on the Devices toolbar. Click Add New to create a new configuration.

Configuration Name (required): A unique name for this device configuration. Names must be unique within regions but can be reused across regions.

Image Name (required): The docker image to run as the inference container. The same repositories are supported as when using a Customer Provided Image for a job environment.

Model (required): Select the trainML Model containing the code that will run the inference task.

Checkpoints: Add any trainML Checkpoints that the model's inference code requires.

Start Command (required): The command to run in the model code's root directory to start running the inference task.

tip

Devices job commands must be designed to run continously in the foreground. The inference task will be restarted if the device is rebooted, but will not automatically restart if the command itself exits.

Datastore: Select a regional datastore to mount to the TRAINML_OUTPUT_PATH location of the running container to read or save private data.

Datastore Path: The subdirectory inside the regional datastore to mount.

Environment Variable: Configure any environment variables to set in the inference container.

Attached Keys: Add any Third-Party Key credentials to the inference container required for successful code execution.

Storage

Storage nodes are recommended for all cloud provider regions. Contact us if you would like to use a centralized storage node in a physical region to free up the compute node's local disks.

Configuration of the storage node is automatic in cloud provider regions. To add a storage node to a Physical region, ensure the region is configured with Storage Mode set to Central. Attach a display to the storage node and connect it to the network. Once it has started, you should see the trainML banner on the screen, which displays the trainML minion_id and some other diagnostic data. Write down the minion_id and go to the CloudBender Resources Dashboard.

Select View Region from the action menu of the new node's physical region. Once on the region dashboard, click the Add button on the Storage Nodes grid. Enter a name for the new node and the minion id from the console. If the system is in a region behind a commercial corporate firewall, UPnP should be disabled. If it is behind a residential internet connection or other firewall managed by your ISP, leave UPnP enabled.

info

Only one storage node per region is currently supported. If you already have a storage node configured in a region, it must be removed before a new one can be added. Alternatively, you can replace the existing storage node with the new one by selecting Replace Node from the actions menu.

You will be navigated back to the region's dashboard. The new node will finalize its provisioning process, and may restart again. When the process is complete, the node will automatically enter the Active state, and you can begin to add compute nodes or devices to the region.

Regional Resources

info

Regional resources require a Team or higher feature plan.

Regional resources allow trainML jobs to utilize data and provide services in a specific region. If you attach a job to any regional resource, the resource reservation system will ensure that the job will only start in that specific region.

tip

Regional resources are only accessible from compute nodes or devices. Ensure that at least one is configured before adding regional resources

caution

Since using a regional resource will constrain the available GPUs to only those that exist in that region, be sure the job specification requests a GPU type that exists in the region or the job will never start.

Datastores

Datastores allow jobs to connect to data storage infrastructure that is local to a region. This avoids having to upload the data as a trainML Dataset, Checkpoint, or as input data to an Inference Job. Datastores can also be used as the output data location for training/inference jobs or mounted to Notebook and Endpoint jobs to provide additional scratch space or access to additional data at runtime. Datastores are ideal for data that is too large or too sensitive to be uploaded to the trainML platform.

To add a datastore, select View Region from the action menu of the new datastore's region. Once on the region dashboard, click the Add button on the Datastores toolbar.

Name: A unique name for this datastore.

Type: The type of datastore. Different datastore types require different configuration options.

NFS: Select if adding a NFS server. See additional requirements for the required server export configuration.
SMB/CIFS: Select if adding a Windows or Samba file share server.
Elastic Block Store (AWS Only): Select to create EBS backed storage resources. (Only create one of these per region).
S3 (AWS Only): - Select to add an S3 bucket as the datastore.
Google Persistent Disk (GCP Only): Select to create Persistent Disk backed storage resources. (Only create one of these per region).
Google Cloud Storage (GCP Only): Select to add a Cloud Storage bucket as the datastore.

Address/URI: The hostname, network name, bucket URL, or IP address of the datastore server.

Root Path: The directory on the server to act as the root for this datastore. Use "/" to enable access to the entire datastore. If you specify a subdirectory as the datastore root path, subsequent jobs, datasets, and checkpoints will not be able to access data above that subdirectory and their path specification will be relative to this root path. For example, if the datastore exposes the following directories:

- dir1
  - subdir1
- dir2
  - subdir2
  - subdir3

Setting /dir2 as the Root Path will prevent access to dir1 or any of its subdirectories. When configuring a job to mount subdir2 as the input data, specific /subdir2 as the input path, NOT /dir2/subdir2. The datastore system automatically concatenates the datastore root path with the requested subdirectory path.

Username (SMB Only): The username to connect to the datastore with.

Password (SMB Only): The password for the user.

Role (AWS Only): The role ARN with permission to access the resource (see below for role configuration details)

Credentials (GCP Only): The JSON key file of a service account permission to access the resource (see below for service account configuration details)

The default user that containers run as is root. Since NFS passes through the user ID for all operations, all file access actions on an NFS datastore will show up as the root user of the NFS client. Since allowing the client's root user to pass through to the server is a signifcant security risk, trainML recommends that all NFS exports used as trainML datastores enable root_squash to convert client IDs to the anonymous ID. When this is enabled, if the NFS share is used for output data, you must set the anonuid and anonguid settings on the export to a user/group that has write access to the exported directory. Because of this, trainML recommends not reusing existing exports for trainML job output locations, but create a dedicated export with these settings.

AWS Datastore Requirements

info

Resources accessed using the AWS datastores are not required to be the the same AWS account. However, they must be in the same AWS Region as the CloudBender nodes.

All AWS-specific datastores require a role with access to the configured resources. This role will be assumed by the region node when the datastore is utilized using the node's instance profile credentials. The process to setup the role is similar to configuring AWS project credentials, however, no IAM user or access keys are required.

tip

The role creation must be performed after the region deployment is complete. Otherwise, there will be no instance role to specify in the Trusted Entities configuration.

As a first step, obtain the region's instance role from the proxiML Region Dashboard, or from the IAM console. If proxiML is managing the region setup, the role will be named proximl-instance-role and the ARN will be arn:aws:iam::<your account ID>:role/proximl-instance-role.

First, create a policy with the required resource access like the example here AWS project credentials. Next, create a new role and attach the policy to it. When creating the role, select An AWS Account as the Trusted Entity Type and leave the Account setting to This account. Click Next and search for the policy created in the previous step on the Add permissions page. Select that policy and click Next. In the Select Trusted Entities, replace the default root account in the Principal/AWS section with the proxiML instance role above.

tip

If your IAM configuration does not allow you to edit the Trusted Entities policy. Select Custom Trust Policy instead of An AWS Account on the first step and use a policy like the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Statement1",
      "Effect": "Allow",
      "Principal": {
        "AWS": "<full ARN of instance role>"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Once the role is created, click on its name to view the details. Use this role ARN, when asked for the role in the datastore form.

GCP Datastore Requirements

info

Resources accessed using the GCP datastores are not required to be the the same GCP project. However, they must be in the same GCP Region as the CloudBender nodes.

All GCP-specific datastores require a service account with access to the configured resources. There are two options for configuring this access.

Add the roles to the proxiML instance service account (by default proximl-instance-sa@<project_id>.iam.gserviceaccount.com).
Create a service account with access to required resources and provide the JSON key file as part of datastore configuration.

Option 1 should only be used for small, single instance deployments, as all proxiML instances share the same instance service account. It also only allows access to resources in the same GCP project.

The process for configuring an Option 2 service accout is the same as configuring GCP project credentials. Input the resulting JSON key file in the datastore form when creating the datastore.

Services

Services allow jobs to host local applications or services that are expected to be available at a specific hostname and port. This enables multiple local applications to be deployed on the same compute rig by fixing each application to a dedicated port, as well as allows the same application to transparently migrate between multiple compute rigs by dynamically updating the dedicated local hostname.

To add a reservation, select View Region from the action menu of the new reservations's region. Once on the region dashboard, click the Add button on the Services grid.

Name: A unique name for this reservation.

Type: Only port reservations are currently supported.

Hostname: The hostname to publish with mDNS.

Port: The port the endpoint should listen on.

tip

Dynamic hostname publishing is currently only supported with mDNS, which is a link-local protocol. If the systems accessing the endpoint are on a different IP subnet, you will need an mDNS Gateway for the hostname to resolve properly.

Billing and Credits

Usage Fee

Running jobs on your own resources incurs an hourly usage fee that varies based on the type of GPU. The usage fees are as follows:

Development Only GPUs (GeForce, Radeon) - 0.05 credits/hr
Professional GPUs (Quadro/RTX A6000, Radeon Pro) - 0.10 credits/hr
Datacenter GPUs (Tesla, Radeon Instinct) - 0.25 credits/hr

Federated Training/Inference Fee

Cross-project Federated jobs incur an additional hourly fee that is irrespective of the GPU type or number of GPUs per worker.

Federated Inference (Inference Jobs, Endpoints) - 1.85 credits/hr
Federated Training (Training Jobs) - 2.65 credits/hr

Unlike the usage fee, which is charged to the owner of the project, this fee is charged to the job creator.

Device Fee

Devices incur a fixed monthly fee of 15 credits/month based on how many devices are configured in CloudBender. The fee calculation begins as soon as the device is added to CloudBender and continues until it is removed. Offline devices still incur the monthly fee. The fee is calculated and charged daily based on the total number of devices configured that day.

For example, if you have 10 devices configured in cloudbender, you will be charged 5 (15 * 10 / 30) credits per day in a month with 30 days. If you start a day with 10 devices, add 5 more during the day, you will be charged 7.5 (15 * 15 / 30) credits, no matter what time of day you add the devices. If you start the day with 10 devices, and remove 5 devices during the day, you will be charged 5 (15 * 10 / 30) credits on that day, and 2.5 (15 * 5 / 30) credits on subsequent days. If you start a day with 10 devices, add and remove a device 5 times, you will be charged 7.5 (15 * 15 / 30) credits, since each addition represents a unique device, and 5 credits on subsequent days (since the devices were all removed by EOD).

Earning Credits

If you configure your CloudBender region to allow other customers to consume your spare resources, you can earn credits that can be used to offset your own trainML resource costs or paid out in cash. When you service another customer's job, you receive a credit at the end of each hour the job has been running. When a job stops, you receive the credit for the partial time the job ran. The customer pays the full hourly price for the GPU. From this, trainML deducts the standard usage fee based on the GPU type, and splits the remaining credits evenly with you.

For example, if you add a system with A100 GPUs, other trainML customers will pay 2.78 credits per hour per GPU. At the end of each hour a job runs, you will receive 1.265 credits ((2.78 - 0.25) / 2) per GPU.

Payment Modes

Credits - All accumulated credits will be refunded to your billing account every 24 hours. There is no fee associated with this transfer.

Stripe (coming soon) - Receive cash payouts of your credit balance through Stripe Connect

Providers​

Physical​

Google Cloud​

Account Requirements​

Project Configuration​

Service Account​

Enable GCP Provider​

AWS​

Account Configuration​

IAM Service Account​

IAM role​

Enable AWS Provider​

Regions​

Physical Regions​

GCP Regions​

AWS Regions​

Nodes​

Compute​

Hardware Requirements​

Physical Server Provisioning​

Devices​

Device Configurations​

Storage​

Regional Resources​

Datastores​

NFS Share Requirements​

AWS Datastore Requirements​

GCP Datastore Requirements​

Services​

Billing and Credits​

Usage Fee​

Federated Training/Inference Fee​

Device Fee​

Earning Credits​

Payment Modes​

Providers

Physical

Google Cloud

Account Requirements

Project Configuration

Service Account

Enable GCP Provider

AWS

Account Configuration

IAM Service Account

IAM role

Enable AWS Provider

Regions

Physical Regions

GCP Regions

AWS Regions

Nodes

Compute

Hardware Requirements

Physical Server Provisioning

Devices

Device Configurations

Storage

Regional Resources

Datastores

NFS Share Requirements

AWS Datastore Requirements

GCP Datastore Requirements

Services

Billing and Credits

Usage Fee

Federated Training/Inference Fee

Device Fee

Earning Credits

Payment Modes