A Fast Approach to Building Pivot Table / Transpose Functionality into BigQuery
Transposing or pivoting a table is a common operation in data analysis, however, BigQuery does not have this as a built-in capability. The most popular approach on StackOverflow is sub-optimal and doesn’t scale well on big data. To remove this inefficient, time consuming obstacle, this article explains a faster approach – “flash pivot” – to building pivot table functionality into BQ.
Pivoting a table means to convert all unique rows of a column into separate columns of their own. This is also referred as transposing a column or converting from long format to wide format. In some use cases like time series analysis, this process facilitates selection of only interesting columns and correlating them. Many data processing libraries provide built-in functionality for pivoting or transposing: Pandas has pivot_table, and Spark has pivot.
Optimizing The Approach
Consider the following example of pivoting a table. Below, a Source: “long_table” is being transformed into Target: “wide_table” on column “class”. We’ll build upon this example to demonstrate the pivot solutions on BigQuery.
The generally accepted approach on the internet is to group by the id column and aggregate over the value column passing a condition check for the “class” column. For each column in the wide table, the query runs an “if condition” on the pivot column of long_table and then aggregates the value column grouped by the id. The sample query uses the ANY_VALUE aggregation function, and MAX, AVG, Sum, etc. can be used instead if required.
The problem is that this approach gets very slow when dealing with large amounts of data and can take hours to execute on datasets of Terabyte scale.
Fast Approach: The Flash Pivot!
Since the major reason for the poor performance with the first is the “if condition”, we need to look for a different approach that is more efficient. In this approach, we create two intermediate tables, a wide_ranked table and a long_array_aggregated table, as shown below.
First, to create the wide_ranked table, we select all distinct values from the class column of long_table and rank them. Once ranked, we create the wide format like the query shown in Approach 1. Since the amount of data here is way smaller, this part gets done fast. We use a dummy column `groupby_only_col` to group the ranked classes. Note that the dummy column is not in the final select clause.
Next, to create the long_array_aggregated table, we first create a cross join between the distinct id and distinct classes to create all possible combinations of id and classes and then rank the classes. Next, we do a right join with the long_table on the id to identify all the missing class for every id. And finally, we group the results by id, and aggregate the values into array ordered by the ranked classes.
Now, let’s look at how this has helped. By creating these two intermediate tables, we have one table with arrays of value grouped by id, and other table in wide format that can be used to access specific indices of the value array. The final step is to do a join between these two tables by accessing the correct index of values array for every column.
The wide_ranked subquery, and the final select statement can be dynamically generated using either a programming language, or a legacy sql function group_concat_unquoted
By breaking up the workflow in BigQuery into these pivot tables, “Flash pivot” performs 10 times faster than other approach on a dataset of size ~1TB (2 Billion rows) while pivoting a column into 1500 new columns. Try it out on your datasets and unleash the “flash” pivot!
About the Authors
Ankur Saxena is a GCP certified Professional Data Engineer, and Professional Cloud Architect who specializes in building and orchestrating big data ETL pipelines on cloud. He is passionate about optimizing data processing using Pandas, Spark and SQL. He is an anime enthusiast and likes playing Tennis.