How to Benchmark and Increase the Performance of BigQuery Using the BQ Storage API

Why BQ Storage API?

Google Big Query (BQ) is a serverless, highly scalable and cost-effective Cloud Data Warehouse. While BQ is very powerful for running operations when data resides within BQ, there is significant overhead when getting data out of BQ and moving it to external analytical systems such as: SAS, TensorFlow and Scikit-learn.

Although it is possible to gain access from these analytical systems to BQ, for example SAS/ACCESS Interface to Google BigQuery or through ODBC drivers, there are limitations that hinder performance. This is because the underlying REST API’s of BQ upon which these systems have been built do not have adequate throughput for the significant volumes of data required to execute most analytics jobs at speed. This pain point is now being addressed with the beta release of the new BQ Storage API. BQ Storage API lets the client have access to the underlying storage of BQ, enabling the data volume throughput to be significantly higher than the basic access to the BQ REST APIs.

BQ Storage API

The BQ Storage API provides interfaces to access data in Arrow, Avro and Panda’s data frame in memory. While many applications can leverage the in-memory data representation, some applications may not support these data formats yet and, in some cases, because of the ephemeral nature of compute containers, it might be prudent for those applications to have a snapshot of data to be made available on the file system. These scenarios vary with compute environments, customer requirements, size of data and cost considerations.

Options to Get Data from BQ

The following are the three options to get data from BQ to the application:

  • Option 1 – BQ REST API: Application makes a BQ REST API call gets data and converts to a pandas data frame
  • Option 2 – BQ to Disk: A tool orchestrates data from BQ to GCS and from GCS to persistent disk using gsutil
    • This option only makes it available on disk and the application would then have to read it into memory from the local file system
    • Time taken is a sum of two components: BQ to GCS + GCS to persistent disk
  • Option 3 – BQ Storage API: Application makes a call to read data into memory from BQ Storage API, the streams of data is then converted to a pandas data frame

BigQuery Storage API

Benchmarking Performance on the BQ Storage API

Core Compete did a benchmark to evaluate which of the above options would be optimal. The following are the results of the benchmark:

Compute Environments

For 1GB, 2GB and 4 GB: Machine Type – n1-standard-8 (8 vCPUs, 30 GB memory, 40GB persistent disk)

For 20 GB:  Machine Type – n1-standard-32 (32 vCPUs, 120 GB memory, 60GB persistent disk)

Benchmark for reading into Memory:

OptionSize of Data in BQRAM UsageTo pandas Data Frame (in memory representation)
BQ REST API1GB3.6 GB1703 seconds
BQ Storage API1GB3.7 GB19 seconds
BQ Storage API2 GB6 GB32 seconds
BQ Storage API4 GB10 GB60 seconds
BQ Storage API20 GB43 GB255 seconds


BQ Storage API is significantly faster than BQ REST API. BQ REST API (Option 1) was not executed for 2,4 and 20 GB as it was taking too long. For 1 GB file BQ REST API is 88 times slower than BQ Storage API. BQ to Disk (Option 2) is not applicable, as it gets data to disk. The application would then have to read the data from disk into memory.

Memory consumption is high for both BQ Storage API and BQ REST API – almost twice the size of the data.

Benchmark for reading data from BQ and writing to Persistent Disk in CSV format:

OptionSize of Data in BQRAM UsageTotal time taken to Write to Disk (CSV format)
BQ REST API1GB3.6 GB1858 seconds
BQ to Disk1 GBNA39 seconds
BQ Storage API1GB3.7 GB175 seconds
BQ to Disk2 GBNA102 seconds
BQ Storage API2 GB6 GB351 seconds
BQ to Disk4 GBNA103 seconds
BQ Storage API4 GB10 GB707 seconds
BQ to Disk20 GBNA117 seconds
BQ Storage API20 GB43 GB3505 seconds


BQ to Disk is the fastest when it comes to writing data to disk. The storage API write to disk from pandas could be more efficient. BQ to Disk is faster than the time taken by BQ Storage API to read data into memory. Memory consumption is higher for BQ Storage API – almost twice the size of the data.

It could be debated that depending on the application reading the data – for larger data sets – it would be ideal for compute environments to read data to disk and then load into memory. Applications like SAS Viya (Persistent Disk), Data Proc (GCS) can read from persistent systems into memory much faster than what the storage APIs offer for larger data sets.  Persistent data also provides a fall back, in cases of ephemeral compute – where data might have to be reloaded into memory.


  • BQ Storage API provides a way forward to read data, for working with large data sets and for providing significantly faster performance than with the BQ REST API.
  • The BQ to Disk option provides a faster alternative to get BQ data on a persistent system. This would be ideal for applications which are yet to integrate with BQ Storage API. For larger volumes of data upwards of 50 GB – use of BQ to Disk followed up with loading data from disk into memory could be a faster option.
  • Memory consumption of BQ Storage API is higher and almost runs to two times the size of the data. We also observe that performance of BQ Storage API is linear.
  • Data governance is a key component. These patterns can be enforced for the BQ Storage API, BQ Rest API and BQ to Disk options.



About the Authors

Sumanth Yamala is a Principal Architect at Core Compete.

James Miles is a Data Engineer at Core Compete.