trainML Documentation
  • Docs
  • Tutorials
  • Blog
  • Login/Signup

›All Blog Posts

All Blog Posts

  • CloudBender
  • NVIDIA NGC Catalog Integration
  • Collaborative Projects for Resource Sharing
  • Customer Provided Job Environments
  • REST Endpoints for Inference
  • Automatic Dependency Installation
  • Consolidated Account Billing
  • Load Model Code Directly From Your Laptop
  • Start Training Models With One Line
  • RTX 3090 (BFGPU) Instances Now Available
  • Build Full Machine Learning Pipelines with trainML Inference Jobs
  • Store Training Results Directly on the trainML Platform
  • Dataset Viewing
  • Stay Modern with Python 3.8 Job Environments
  • Downloadable Log Extracts for Jobs and Datasets
  • Automate Training with the trainML Python SDK
  • trainML Jobs on Google Cloud Platform Instances
  • Spawn Training Jobs Directly From Notebooks
  • Easy Notebook Forking For Rapid Experimentation
  • Making Datasets More Flexible and Expanding Environment Options
  • Kaggle Datasets and API Integration
  • Centralized, Real-Time Training Job Worker Monitoring
  • Free to Use Public Datasets
  • Major UI Overhaul and Direct Notebook Access
  • Load Data Once, Reuse Infinitely
  • Serverless Deep Learning On Private Git Repositories
  • Google Cloud Storage Integration Released
  • Skip the Cloud Data Transfers with Local Storage
  • Web (HTTP/FTP) Data Downloads Plus Auto-Extraction of Archives

trainML Jobs on Google Cloud Platform Instances

January 22, 2021

trainML

trainML

The trainML platform now supports creating jobs and datasets on GPUs hosted by Google Cloud Platform (GCP).

How It Works

GCP instances are an opt-in feature that must be enabled for your account. Contact trainML support to enable it.

GCP instances can be accessed by selecting GCP from the Provider dropdown in the main side panel navigation bar. If you do not see this dropdown, contact trainML support to enable it. Click the Create button from the notebook dashboard. In the GPU Type field, you will see the available GCP instance types. You can create a job like normal and once the job is running, interact with it exactly as you would with a trainML provider instance.

When a given provider is selected, you will only see the jobs and datasets that exist in that provider in the job and dataset dashboards. This is because the jobs and datasets for each provider are separate and cannot interact with each other. GCP jobs can only use datasets that have been created in the GCP provider, and vice versa. trainML jobs cannot be transferred or copied to GCP instances. If you have a dataset you need to use for jobs in both environments, you will need to create it in both providers and will incur storage charges for each.

GCP Instance Types

The following GPU types are available from the GCP provider:

  • T4: 16 GB of GPU memory, 7.5 TFLOPS of single-precision (FP32), 0.2 TFLOPS of half-precision (FP16)
  • P100: 16 GB of GPU memory, 8 TFLOPS of single-precision (FP32), 16 TFLOPS of half-precision (FP16)
  • V100: 16 GB of GPU memory, 15 TFLOPS of single-precision (FP32), 25.5 TFLOPS of half-precision (FP16)
  • A100: 40 GB of GPU memory, 39 TFLOPS of single-precision(FP32), 62.5 TFLOPS of half-precision (FP16)

GCP instances automatically select the necessary amount of CPUs and memory based on the number of GPUs per worker to ensure optimal GPU utilization without overprovisioning. All instance storage uses provisioned SSD. The minimum disk space per job worker is 50 GB, most of which is consumed by the operating system and deep learning frameworks.

Notable Differences

  • Create, start, and stop times are significantly longer (sometimes several minutes) with the GCP provider. This is a constraint from GCP itself, as GPU accelerated instances take significant time to provision in their environment.
  • When creating datasets, you are required to specify the disk size to allocation for the dataset. This size will be the amount use to calculate storage billing for both this dataset and for jobs that you attach the dataset to. If the data exceeds the amount specified, the dataset creation task will terminate when the storage is full and your dataset will not be complete.
  • Datasets added to jobs increase the allocated job storage by the size of the dataset. For example, if you select 60 GB for the worker disk size, and attach a 100 GB dataset, the total storage used by each worker is 160 GB.
  • The performance of the provisioned SSD storage depends on the size of the storage. If you need higher IOP performance, try increasing the disk size of the workers.
Tweet
Recent Posts
  • How It Works
  • GCP Instance Types
  • Notable Differences
trainML Documentation
Docs
Getting StartedTutorials
Legal
Privacy PolicyTerms of Use
Copyright © 2022 trainML, LLC, All rights reserved