Using Google Cloud TPU for Image Classification

There’s a common thread that connects Google services such as Google Search, Street View, Google Photos and Google Translate: they all use Google’s Tensor Processing Unit, or TPU, to accelerate their neural network computations behind the scenes.  Why TPUs? In short, Google has found that the TPU delivered 15–30X higher performance and 30–80X higher performance-per-watt than contemporary CPUs and GPUs. That’s a big performance improvement with lower energy consumption.

TPUs were announced in 2016, and are now available in Beta.  Since the initial announcement of the TPU, improvements are being made with the latest, third-generation TPUs, announced in May 2018.  Google announced that these processors themselves are twice as powerful as the second generation TPUs, and would be deployed in pods with four times as many chips as the preceding generation.  This results in an eight-fold increase in performance per pod compared to the second generation TPU deployment.

If your ML models are large and train on large amounts of data, then using TPUs can mean faster and more efficient processing.  Think models that train for weeks or months – using TPUs for such tasks can greatly improve speed. We’ve worked with customers with millions of images that we trained on custom CNN models and have noticed a marked improvement in performance.  More details and metrics around performance improvement will be provided in an upcoming blog post.

A bit about our experience implementing models on TPU.  Training using CPU or GPU via CloudML was straightforward, however, to run the same model on TPU needs some minor modifications.  Writing the base model using the Estimator class, however, makes things much easier. We highly recommend the use of the Estimator class as it allows for the distributed training of a model within Cloud ML: Something that’s critical for data-intensive model training. This also allows you to quickly change between CPU and GPU with a simple configuration change.

If one has used the Estimator, then the steps to convert that model to use TPU are described below:

  1. Follow your standard data processing steps to create TFRecords.
  2. Setup VM with TPU support – more details here.
    • Login to your Google Cloud Console and open a cloud shell, run the command and verify the configuration:
      ctpu print-config
    • Run this command to create a VM with Cloud TPU services:
      ctpu up
    • This creates an SSH connection to the VM and commands can be executed from the cloud shell.
  3. Make the changes outlined below to your Estimator based code.  This document provides you with more details.
    • Import TPU specific libraries.
    • Use the CrossShardOptimizer function to wrap the optimizer.
    • Define the model_fn and return a TPUEstimator specification.
    • To run the model on Cloud TPU, you need the TPU gRPC address, which you can get using
      tf.contrib.cluster_resolver.python.training.TPUClusterResolver

      Then define an Estimator compatible configuration.
    • Create the TPUEstimator object using configuration and model data.
  4. Copy modified code to tpu_model.py on the VM.
  5. Execute the script:
    python tpu_model.py”
  6. Once the model executes you can tear down the VM. Run the two commands listed below:
    • Exit from VM by typing:
      exit
    • Then execute:
      ctpu delete

 

Here is a good starting point to help you troubleshoot any issues you may run into when migrating your model code to TPU.  If you are a CloudML user then there’s good news: TPUs are supported on CloudML and you can use either a default basic tier or set up your own custom machines. Using CloudML has the added benefit of distributed model training.

If you are looking to leverage the power of TPUs, here are a few things to think about:

  1. Will your model benefit from using TPUs? Does it have many matrix computations? Do you have large models with very large effective batch sizes?
  2. How is your current model code written? Is it TensorFlow? Does it use the Estimator class?  Do you use Keras? What are the code changes to be implemented to make it TPU ready?
  3. Where does the model execute currently?  On custom VMs or on CloudML? With TPUs is it better to continue on custom VMs or move to CloudML?

 

At SpringML, We can help walk you through these discussion items, and if using TPUs makes sense then we have the expertise to guide you through that process.