trainML Jobs on Google Cloud Platform Instances
The trainML platform now supports creating jobs and datasets on GPUs hosted by Google Cloud Platform (GCP).
How It Works
GCP instances are an opt-in feature that must be enabled for your account. Contact trainML support to enable it.
GCP instances can be accessed by selecting GCP
from the Provider
dropdown in the main side panel navigation bar. If you do not see this dropdown, contact trainML support to enable it. Click the Create
button from the notebook dashboard. In the GPU Type
field, you will see the available GCP instance types. You can create a job like normal and once the job is running, interact with it exactly as you would with a trainML provider instance.
When a given provider is selected, you will only see the jobs and datasets that exist in that provider in the job and dataset dashboards. This is because the jobs and datasets for each provider are separate and cannot interact with each other. GCP jobs can only use datasets that have been created in the GCP provider, and vice versa. trainML jobs cannot be transferred or copied to GCP instances. If you have a dataset you need to use for jobs in both environments, you will need to create it in both providers and will incur storage charges for each.
GCP Instance Types
The following GPU types are available from the GCP provider:
T4
: 16 GB of GPU memory, 7.5 TFLOPS of single-precision (FP32), 0.2 TFLOPS of half-precision (FP16)P100
: 16 GB of GPU memory, 8 TFLOPS of single-precision (FP32), 16 TFLOPS of half-precision (FP16)V100
: 16 GB of GPU memory, 15 TFLOPS of single-precision (FP32), 25.5 TFLOPS of half-precision (FP16)A100
: 40 GB of GPU memory, 39 TFLOPS of single-precision(FP32), 62.5 TFLOPS of half-precision (FP16)
GCP instances automatically select the necessary amount of CPUs and memory based on the number of GPUs per worker to ensure optimal GPU utilization without overprovisioning. All instance storage uses provisioned SSD. The minimum disk space per job worker is 50 GB, most of which is consumed by the operating system and deep learning frameworks.
Notable Differences
- Create, start, and stop times are significantly longer (sometimes several minutes) with the GCP provider. This is a constraint from GCP itself, as GPU accelerated instances take significant time to provision in their environment.
- When creating datasets, you are required to specify the disk size to allocation for the dataset. This size will be the amount use to calculate storage billing for both this dataset and for jobs that you attach the dataset to. If the data exceeds the amount specified, the dataset creation task will terminate when the storage is full and your dataset will not be complete.
- Datasets added to jobs increase the allocated job storage by the size of the dataset. For example, if you select 60 GB for the worker disk size, and attach a 100 GB dataset, the total storage used by each worker is 160 GB.
- The performance of the provisioned SSD storage depends on the size of the storage. If you need higher IOP performance, try increasing the disk size of the workers.