Use Google Cloud Storage in Python to Backup and Load your AI/ML Data

Amazon S3 simple object storage represents the industry standard for cheap storage of large files. All modern cloud vendors do offer S3 alternative services that offer the same access and nearly the same conditions in terms of features and prices.

One can say that S3 represents the backbone of modern cloud storages offering a scalable, high-performance storage for a wide range of use cases, such as backups, disaster recovery, big data analytics, and archiving.

This blog post shows how to access Google GCP S3 alternative service (called Google Cloud Storage) by using a simple Python script in order to store and receive a file.

The idea here is to use Google Cloud Storage for exporting data from an observability platform for the purpose of offline AI/ML training.

Install Google GCP client CLI

As a first step you need to install the Google GCP CLI that will allow you to create a credentials file that we will then need to access Google Cloud Storage with our application Python script.

Read about how to download and install the Google Cloud CLI

The Python to Google Storage import/export script

Now we will implement a simplistic Python script for importing and exporting data to Google Cloud Storage.

Refer to the Google Cloud Storage API help page for the full details of the available API.

See the below Python script that loads the ADC credentials JSON file and offers two functions for uploading any given data into a Google Cloud Store bucket and another function for loading the data back from the given bucket:

# Imports the Google Cloud client library
from google.cloud import storage
import os  
import sys

# Locate your Google Cloud ADC credential JSON file
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=".../gcp/application_default_credentials.json"

def upload(project, bucket, blobname, data):
    print("upload")
    # Initialize the client by specifying your Google GCP project id
    storage_client = storage.Client(project=project)

    # Opens the existing bucket
    bucket = storage_client.bucket(bucket)

    blob = bucket.blob(blobname)

    with blob.open("w") as f:
        f.write(data)

    print(f"Data stored in bucket: {bucket.name}.")

def load(project, bucket, blobname):
    print("load")
    # Initialize the client by specifying your Google GCP project id
    storage_client = storage.Client(project=project)

    bucket = storage_client.bucket(bucket)

    blob = bucket.blob(blobname)

    with blob.open("r") as f:
        print(f.read())

upload("myplayground-3", "train-data-22342343242", "train.csv", "test, test, test")

load("myplayground-3", "train-data-22342343242", "train.csv")

You can also download the Python file from GitHub.

If we run the script, we will get following error informing us that the script is missing the necessary authorization file:

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application.

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application.

google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application.

You can easily check if a local authorization file was already set or not by checking the environment variable ‘GOOGLE_APPLICATION_CREDENTIALS’:

print(os.environ['GOOGLE_APPLICATION_CREDENTIALS']).

In Python you can set the environment variable within your Python program to the GCP credentials file:

import os 

os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/path/to/file.json"

Create the ADC credential JSON file

To run the script we need to first generate the ADC credential JSON file by using the authorization mechanism of the Google Cloud CLI.

Create the credentials file

Create your own GCP ADC credentials file for your local development environment by using credentials associated with your Google Account.

Your browser is opened and shows you the Google Cloud login screen. After a successful login, a credentials file is created and stored into a local json file.

See the gcloud command line process below: 

./gcloud auth application-default login

Your browser has been opened to visit:

https://accounts.google.com/oauth2/auth?………&code_challenge_method=S256

Credentials saved to file: [/…/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project “playground” was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.

Summary

Google Cloud Storage offers a convenient way to persist and share large amounts of data at a moderate price. It’s perfect for keeping your AI/ML training data or to store a trained AI/ML model.

By using the dedicated language clients, such as Python, it’s pretty simple to upload and download your training sets and to work with Google Cloud Storage as the data backend.