How to Benchmark and Increase the Performance of BigQuery Using the BQ Storage API
Why BQ Storage API?
Google Big Query (BQ) is a serverless, highly scalable and cost-effective Cloud Data Warehouse. While BQ is very powerful for running operations when data resides within BQ, there is significant overhead when getting data out of BQ and moving it to external analytical systems such as: SAS, TensorFlow and Scikit-learn.
Although it is possible to gain access from these analytical systems to BQ, for example SAS/ACCESS Interface to Google BigQuery or through ODBC drivers, there are limitations that hinder performance. This is because the underlying REST API’s of BQ upon which these systems have been built do not have adequate throughput for the significant volumes of data required to execute most analytics jobs at speed. This pain point is now being addressed with the beta release of the new BQ Storage API. BQ Storage API lets the client have access to the underlying storage of BQ, enabling the data volume throughput to be significantly higher than the basic access to the BQ REST APIs.
BQ Storage API
The BQ Storage API provides interfaces to access data in Arrow, Avro and Panda’s data frame in memory. While many applications can leverage the in-memory data representation, some applications may not support these data formats yet and, in some cases, because of the ephemeral nature of compute containers, it might be prudent for those applications to have a snapshot of data to be made available on the file system. These scenarios vary with compute environments, customer requirements, size of data and cost considerations.
Options to Get Data from BQ
The following are the three options to get data from BQ to the application:
- Option 1 – BQ REST API: Application makes a BQ REST API call gets data and converts to a pandas data frame
- Option 2 – BQ to Disk: A tool orchestrates data from BQ to GCS and from GCS to persistent disk using gsutil
- This option only makes it available on disk and the application would then have to read it into memory from the local file system
- Time taken is a sum of two components: BQ to GCS + GCS to persistent disk
- Option 3 – BQ Storage API: Application makes a call to read data into memory from BQ Storage API, the streams of data is then converted to a pandas data frame
Benchmarking Performance on the BQ Storage API
Core Compete did a benchmark to evaluate which of the above options would be optimal. The following are the results of the benchmark:
Compute Environments
For 1GB, 2GB and 4 GB: Machine Type – n1-standard-8 (8 vCPUs, 30 GB memory, 40GB persistent disk)
For 20 GB: Machine Type – n1-standard-32 (32 vCPUs, 120 GB memory, 60GB persistent disk)
Benchmark for reading into Memory:
Option | Size of Data in BQ | RAM Usage | To pandas Data Frame (in memory representation) |
BQ REST API | 1GB | 3.6 GB | 1703 seconds |
BQ Storage API | 1GB | 3.7 GB | 19 seconds |
BQ Storage API | 2 GB | 6 GB | 32 seconds |
BQ Storage API | 4 GB | 10 GB | 60 seconds |
BQ Storage API | 20 GB | 43 GB | 255 seconds |
Observations:
BQ Storage API is significantly faster than BQ REST API. BQ REST API (Option 1) was not executed for 2,4 and 20 GB as it was taking too long. For 1 GB file BQ REST API is 88 times slower than BQ Storage API. BQ to Disk (Option 2) is not applicable, as it gets data to disk. The application would then have to read the data from disk into memory.
Memory consumption is high for both BQ Storage API and BQ REST API – almost twice the size of the data.
Benchmark for reading data from BQ and writing to Persistent Disk in CSV format:
Option | Size of Data in BQ | RAM Usage | Total time taken to Write to Disk (CSV format) |
BQ REST API | 1GB | 3.6 GB | 1858 seconds |
BQ to Disk | 1 GB | NA | 39 seconds |
BQ Storage API | 1GB | 3.7 GB | 175 seconds |
BQ to Disk | 2 GB | NA | 102 seconds |
BQ Storage API | 2 GB | 6 GB | 351 seconds |
BQ to Disk | 4 GB | NA | 103 seconds |
BQ Storage API | 4 GB | 10 GB | 707 seconds |
BQ to Disk | 20 GB | NA | 117 seconds |
BQ Storage API | 20 GB | 43 GB | 3505 seconds |
Observations:
BQ to Disk is the fastest when it comes to writing data to disk. The storage API write to disk from pandas could be more efficient. BQ to Disk is faster than the time taken by BQ Storage API to read data into memory. Memory consumption is higher for BQ Storage API – almost twice the size of the data.
It could be debated that depending on the application reading the data – for larger data sets – it would be ideal for compute environments to read data to disk and then load into memory. Applications like SAS Viya (Persistent Disk), Data Proc (GCS) can read from persistent systems into memory much faster than what the storage APIs offer for larger data sets. Persistent data also provides a fall back, in cases of ephemeral compute – where data might have to be reloaded into memory.
Summary
- BQ Storage API provides a way forward to read data, for working with large data sets and for providing significantly faster performance than with the BQ REST API.
- The BQ to Disk option provides a faster alternative to get BQ data on a persistent system. This would be ideal for applications which are yet to integrate with BQ Storage API. For larger volumes of data upwards of 50 GB – use of BQ to Disk followed up with loading data from disk into memory could be a faster option.
- Memory consumption of BQ Storage API is higher and almost runs to two times the size of the data. We also observe that performance of BQ Storage API is linear.
- Data governance is a key component. These patterns can be enforced for the BQ Storage API, BQ Rest API and BQ to Disk options.
References:
- https://friendliness.dev/2019/07/29/bigquery-arrow/
- https://github.com/googleapis/google-cloud-python/tree/master/bigquery_storage
About the Authors
Sumanth Yamala is a Principal Architect at Core Compete.
James Miles is a Data Engineer at Core Compete.