Skip to main content

Secure Data Processing With Regional Datastores

· 2 min read

CloudBender™ now allows you directly connect to regional datastores without having to copy that data into or out of trainML persistent storage.

Motivation

trainML Datasets make it really easy to load and reuse training data across trainML 's scalable compute platform. However, sometimes data is either too large or too senstive to upload as a dataset. CloudBender customers can now directly mount local datasets to jobs they run in their own CloudBender regions by configuring Datastores. This allows customers to reuse their existing data storage infrastructure as well as ensure all data access occurs over a local lan connection, providing both additional convenience and security.

How It Works

info

Regional datastores require a Team or higher feature plan.

In order to configure and utilize regional datastores, you must first have a CloudBender region configured with at least one compute node. From that region's dashboard, click the Add button on the Datastores toolbar. Fill out the datastore form with the information for the datastore in your region. Currently, only NFS and SMB/CIFS datastore types are supported.

Once the datastore has been added, you can select Regional Datastore as the source type when creating datasets or as the input or output data location for jobs. Additionally, unlike other data output types, regional datastores can be attached to both Notebook and Endpoint jobs.

Regional datastores can be used to run inference tasks as new files get created in a specific directory. For an example of how to use trainML Endpoints with regional datastores, see our example repository here.

Using the SDK

Once a datastore has been defined in one of your CloudBender regions. Jobs and datasets in projects you own can be configured to access them by specifying regional as the data source type. The source uri field must be set to the datastore ID you wish to utilize. To obtain the datastore ID, use the web UI and unhide the "ID" column using the Columns button on the datastore table. Source options must be set to a dictionary with the path key set to the subdirectory within the datastore to utilize.

job = await trainml.jobs.create(
name="Regional Datastores Inference",
type="inference",
...
data=dict(
input_type="regional",
input_uri="5d29dcae-b1aa-4629-be67-548db5a141a1",
input_options=dict(path="/input"),
output_type="regional",
output_uri="09eec266-c9ce-47ab-ab55-d463f905e6f3",
output_options=dict(path="/output"),
),
)