How to Benchmark and Increase the Performance of BigQuery Using the BQ Storage API

Why BQ Storage API?

Google Big Query (BQ) is a serverless, highly scalable and cost-effective Cloud Data Warehouse. While BQ is very powerful for running operations when data resides within BQ, there is significant overhead when getting data out of BQ and moving it to external analytical systems such as: SAS, TensorFlow and Scikit-learn.

Although it is possible to gain access from these analytical systems to BQ, for example SAS/ACCESS Interface to Google BigQuery or through ODBC drivers, there are limitations that hinder performance. This is because the underlying REST API’s of BQ upon which these systems have been built do not have adequate throughput for the significant volumes of data required to execute most analytics jobs at speed. This pain point is now being addressed with the beta release of the new BQ Storage API. BQ Storage API lets the client have access to the underlying storage of BQ, enabling the data volume throughput to be significantly higher than the basic access to the BQ REST APIs.

BQ Storage API

The BQ Storage API provides interfaces to access data in Arrow, Avro and Panda’s data frame in memory. While many applications can leverage the in-memory data representation, some applications may not support these data formats yet and, in some cases, because of the ephemeral nature of compute containers, it might be prudent for those applications to have a snapshot of data to be made available on the file system. These scenarios vary with compute environments, customer requirements, size of data and cost considerations.

Options to Get Data from BQ

The following are the three options to get data from BQ to the application:

  • Option 1 – BQ REST API: Application makes a BQ REST API call gets data and converts to a pandas data frame
  • Option 2 – BQ to Disk: A tool orchestrates data from BQ to GCS and from GCS to persistent disk using gsutil
    • This option only makes it available on disk and the application would then have to read it into memory from the local file system
    • Time taken is a sum of two components: BQ to GCS + GCS to persistent disk
  • Option 3 – BQ Storage API: Application makes a call to read data into memory from BQ Storage API, the streams of data is then converted to a pandas data frame

BigQuery Storage API

Benchmarking Performance on the BQ Storage API

Core Compete did a benchmark to evaluate which of the above options would be optimal. The following are the results of the benchmark:

Compute Environments

For 1GB, 2GB and 4 GB: Machine Type – n1-standard-8 (8 vCPUs, 30 GB memory, 40GB persistent disk)

For 20 GB:  Machine Type – n1-standard-32 (32 vCPUs, 120 GB memory, 60GB persistent disk)

Benchmark for reading into Memory:

Option Size of Data in BQ RAM Usage To pandas Data Frame (in memory representation)
BQ REST API 1GB 3.6 GB 1703 seconds
BQ Storage API 1GB 3.7 GB 19 seconds
BQ Storage API 2 GB 6 GB 32 seconds
BQ Storage API 4 GB 10 GB 60 seconds
BQ Storage API 20 GB 43 GB 255 seconds

Observations:

BQ Storage API is significantly faster than BQ REST API. BQ REST API (Option 1) was not executed for 2,4 and 20 GB as it was taking too long. For 1 GB file BQ REST API is 88 times slower than BQ Storage API. BQ to Disk (Option 2) is not applicable, as it gets data to disk. The application would then have to read the data from disk into memory.

Memory consumption is high for both BQ Storage API and BQ REST API – almost twice the size of the data.

Benchmark for reading data from BQ and writing to Persistent Disk in CSV format:

Option Size of Data in BQ RAM Usage Total time taken to Write to Disk (CSV format)
BQ REST API 1GB 3.6 GB 1858 seconds
BQ to Disk 1 GB NA 39 seconds
BQ Storage API 1GB 3.7 GB 175 seconds
BQ to Disk 2 GB NA 102 seconds
BQ Storage API 2 GB 6 GB 351 seconds
BQ to Disk 4 GB NA 103 seconds
BQ Storage API 4 GB 10 GB 707 seconds
BQ to Disk 20 GB NA 117 seconds
BQ Storage API 20 GB 43 GB 3505 seconds

Observations:

BQ to Disk is the fastest when it comes to writing data to disk. The storage API write to disk from pandas could be more efficient. BQ to Disk is faster than the time taken by BQ Storage API to read data into memory. Memory consumption is higher for BQ Storage API – almost twice the size of the data.

It could be debated that depending on the application reading the data – for larger data sets – it would be ideal for compute environments to read data to disk and then load into memory. Applications like SAS Viya (Persistent Disk), Data Proc (GCS) can read from persistent systems into memory much faster than what the storage APIs offer for larger data sets.  Persistent data also provides a fall back, in cases of ephemeral compute – where data might have to be reloaded into memory.

Summary

  • BQ Storage API provides a way forward to read data, for working with large data sets and for providing significantly faster performance than with the BQ REST API.
  • The BQ to Disk option provides a faster alternative to get BQ data on a persistent system. This would be ideal for applications which are yet to integrate with BQ Storage API. For larger volumes of data upwards of 50 GB – use of BQ to Disk followed up with loading data from disk into memory could be a faster option.
  • Memory consumption of BQ Storage API is higher and almost runs to two times the size of the data. We also observe that performance of BQ Storage API is linear.
  • Data governance is a key component. These patterns can be enforced for the BQ Storage API, BQ Rest API and BQ to Disk options.

References:

  • https://friendliness.dev/2019/07/29/bigquery-arrow/
  • https://github.com/googleapis/google-cloud-python/tree/master/bigquery_storage

About the Authors

Sumanth Yamala is a Principal Architect at Core Compete.

James Miles is a Data Engineer at Core Compete.