Skip to main content

Google Cloud Storage Integration Released

· 3 min read

trainML training jobs can now run on data stored in Google Cloud Storage and upload their results to the same or another bucket. GCP access credentials can also be attached to notebooks and training job workers to provide them with easy access to other GCP services.

How It Works

To use the GCP storage option, you must first create a GCP service account for trainML to use and upload the credentials to the platform.

Add GCP Service Account Credentials

warning

Never provide trainML (or anyone for that matter) credentials for an Google service account with admin privileges.

Create a new service account in the GCP project that contains the data or services you want the trainML platform to interact with. When creating the account, ensure you configure permissions very narrowly and allow access only to the specific data or services needed for the model training process. For example, if you wanted to download the data from the /data path of the input-data-bucket bucket. You should assign the Storage Object Viewer role with a condition of type Name and operator of Starts With and a value of projects/_/buckets/input-data-bucket/objects/data/. If you want to upload data to the /results path of the artifacts-bucket, You should assign the Storage Object Viewer role with a condition of type Name and operator of Starts With and a value of projects/_/buckets/artifacts-bucket, as well as assign the Storage Object Creator role with a condition of type Name and operator of Starts With and a value of projects/_/buckets/artifacts-bucket/objects/results/. The reason full read access is required for the output bucket is because gsutil requires bucket level read access in order to copy objects. For more details about condition and resource names on buckets and objects, review the GCP documentation.

Once the service account is created, create and download the service account key JSON file. Go to the trainML Account Profile page and select GCP from the Add menu under Third-Party Keys. Click the Upload Json File button, select the file you JSON file you downloaded, and click the check button.

Using GCP Credentials in a Training Job

Once you configure GCP credentials in your account, create a new training jobs, specifying the resources required, model code, worker command. For Input Data Type, select GCP from the dropdown. Input the full bucket URL of the data you want to download, e.g. gs://input-data-bucket/data. For the Output Type, select GCP and specify the full bucket URL of the location you want to store the model artifacts, e.g. gs://artifacts-bucket/results. The trainML platform will automatically download the data you requested and upload the workers results to the bucket path you specified when training is complete.

Additionally, if the worker script uses some GCP resources, you can select GCP from the Add Third-Party Keys to Workers. In order to use the credentials, you must first activate them in the job environment by including the following command in your script prior to accessing GCP services:

gcloud auth activate-service-account --key-file ${GOOGLE_APPLICATION_CREDENTIALS}

Alternatively, if you're using the Python SDK directly, you can activate the service account credentials using the from_service_account_json function and specify the location of the key file using the environment variable GOOGLE_APPLICATION_CREDENTIALS.