Skip to main content

Kaggle Datasets and API Integration

· 4 min read

Customers using trainML to compete in Kaggle competitions or using public Kaggle datasets for analysis can now directly populate trainML datasets from Kaggle competitions or datasets, as well as automatically load their Kaggle account credentials into notebook and training jobs to use for competition or kernel submissions.

How It Works

To enable Kaggle integration in the trainML platform, you must first generate a Kaggle API token. Instructions to generate a new token can be found here. If you are already using the Kaggle CLI tool on your local computer, the API token is usually located at $HOME/.kaggle/kaggle.json.

Adding Kaggle API Credentials

Once you have the kaggle.json file for your account, you can upload it to the trainML platform in the Third-Party Keys section of the Settings page. You can access this page by clicking on the Settings menu option on the sidebar or by clicking your account name in the upper right side of the toolbar. Navigate to the Third-Party Keys section and click the Add button. Select Kaggle from the list. Click the Upload Json File button that appears next to the trophy icon and select the kaggle.json file from your local computer. Click the blue checkmark button to upload the file. If the file is successfully uploaded, you should see Credentials File: kaggle.json next to the trophy icon. For security reasons, you cannot download the file again once uploaded, you can only remove or upload a new file.

Populating a Dataset from Kaggle

Click the Datasets option on the sidebar and click the Create button to open the new dataset form. Enter a name for the new dataset in the Name field and select Kaggle from the Source Type dropdown. From the Type field, select Competition if you are downloading the data for a competition you have entered or Dataset if you are downloading other public or personal datasets.

caution

You can only download competition datasets if you have already read and accepted the rules through the Kaggle website

For the Path field, you must specify the short name Kaggle uses the competition or the datasets. The two easiest ways to find this short name are:

  1. The URL path of the competition or dataset you wish to download. For example, if you are viewing this dataset on the 2020 US Election in your web browser, the URL in your address bar is https://www.kaggle.com/unanimad/us-election-2020. If you want to download this dataset into trainML, specify unanimad/us-election-2020 in the Path field, specifically, the URL component after www.kaggle.com/. If you are viewing the Mechanisms of Action competition in your web browser, the URL in your address bar is https://www.kaggle.com/c/lish-moa. If you want to download this competition's data into trainML, specify lish-moa in the Path field, specifically, the URL component after www.kaggle.com/c/
  2. Viewing the API command from the Kaggle web interface. For datasets, if you click the triple dot button on the far right side of the Kaggle Dataset menu bar, next to the New Notebook button, there is a Copy API command button. If you click this for this dataset on the 2020 US Election, it will copy kaggle datasets download -d unanimad/us-election-2020 into your clipboard. If you want to download this dataset into trainML, specify unanimad/us-election-2020 in the Path field, specifically, the command component after download -d. For a competition, if you click the Data tab on the Kaggle Competition menu bar, right above the Data Explorer, it will list the API command to download the datasets. If you are viewing the Mechanisms of Action competition, you will see kaggle competitions download -c lish-moa. If you want to download this competition's data into trainML, specify lish-moa in the Path field, specifically, the command after download -c

Click Create to submit the form. The trainML platform will automatically download the dataset and extract it for use on subsequent jobs. Once the dataset is Ready, you can attach it to any number of notebooks or training jobs concurrently.

Adding Kaggle Credentials to a Notebook

To give a notebook or training job access to your Kaggle account, select the Kaggle option from the Third-Party Keys field in the Environment section of the job form. Once the job is started, the kaggle CLI will be automatically configured to utilize these credentials. Additional instructions for interacting with Kaggle competitions using the kaggle CLI can be found here.