trainML training jobs can now run on data stored in Google Cloud Storage and upload their results to the same or another bucket. GCP access credentials can also be attached to notebooks and training job workers to provide them with easy access to other GCP services.
How It Works
To use the GCP storage option, you must first create a GCP service account for trainML to use and upload the credentials to the platform.
Add GCP Service Account Credentials
Never provide trainML (or anyone for that matter) credentials for an Google service account with admin privileges.
Create a new service account in the GCP project that contains the data or services you want the trainML platform to interact with. When creating the account, ensure you configure permissions very narrowly and allow access only to the specific data or services needed for the model training process. For example, if you wanted to download the data from the /data
path of the input-data-bucket
bucket. You should assign the Storage Object Viewer
role with a condition of type Name
and operator of Starts With
and a value of projects/_/buckets/input-data-bucket/objects/data/
. If you want to upload data to the /results
path of the artifacts-bucket
, You should assign the Storage Object Viewer
role with a condition of type Name
and operator of Starts With
and a value of projects/_/buckets/artifacts-bucket
, as well as assign the Storage Object Creator
role with a condition of type Name
and operator of Starts With
and a value of projects/_/buckets/artifacts-bucket/objects/results/
. The reason full read access is required for the output bucket is because gsutil requires bucket level read access in order to copy objects. For more details about condition and resource names on buckets and objects, review the GCP documentation.
Once the service account is created, create and download the service account key JSON file. Go to the trainML Account Profile page and select GCP
from the Add
menu under Third-Party Keys. Click the Upload Json File
button, select the file you JSON file you downloaded, and click the check button.
Using GCP Credentials in a Training Job
Once you configure GCP credentials in your account, create a new training jobs, specifying the resources required, model code, worker command. For Input Data Type
, select GCP
from the dropdown. Input the full bucket URL of the data you want to download, e.g. gs://input-data-bucket/data
. For the Output Type
, select GCP
and specify the full bucket URL of the location you want to store the model artifacts, e.g. gs://artifacts-bucket/results
. The trainML platform will automatically download the data you requested and upload the workers results to the bucket path you specified when training is complete.
Additionally, if the worker script uses some GCP resources, you can select GCP
from the Add Third-Party Keys to Workers
. In order to use the credentials, you must first activate them in the job environment by including the following command in your script prior to accessing GCP services:
gcloud auth activate-service-account --key-file ${GOOGLE_APPLICATION_CREDENTIALS}
Alternatively, if you're using the Python SDK directly, you can activate the service account credentials using the from_service_account_json function and specify the location of the key file using the environment variable GOOGLE_APPLICATION_CREDENTIALS
.