Simplifying the Complex: BigQuery Recursive CTEs for Data Transformation
Photo by Glenn Carstens-Peters on Unsplash
Business Intelligence Analysts juggle a diverse range of datasets. Some reside at the granular unit level, while others are pre-aggregated at the item or order level. Aggregating data upwards (unit -> item -> order) is a breeze. But when it comes to disaggregating data downwards (order -> item -> unit), things can get tricky.
This article dives into BigQuery's powerful tool for tackling this challenge: Recursive Common Table Expressions (CTEs). We'll explore how to leverage them to disaggregate data downwards, specifically from the item level to the unit level.
Imagine you have an order items dataset containing details like order_id
, sku
, items_ordered
, order_total
, line_total
. Here's a sample of how this data might look for a single order:
order_id | sku | items_ordered | order_total | line_total |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 12345-white-xs | 1 | 129 | 14.5 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 45678-black-l | 3 | 129 | 48.0 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 90123-yellow-m | 1 | 129 | 22.5 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 13764-blue-s | 4 | 129 | 44.0 |
As you can see, items_ordered
represents the total quantity of each item ordered (aggregated data), and order_total
reflects the sum of line_total
for the entire order (also aggregated). Our goal is to disaggregate this data and reach the level of individual units ordered (assuming each item represents a single unit).
Disaggregation Approach:
The disaggregation process involves two key techniques:
Recursive CTE: This CTE iterates until a counter (
id
) reachesitems_ordered
value, effectively creating one row for each unit within an order item.Window Function: We use the
SUM(1) OVER (PARTITION BY order_id, sku)
window function to calculate the unit price by dividing theline_total
by the total number ofitems_ordered
for a specificsku
within the order.
WITH
RECURSIVE
loops AS (
SELECT order_id
,sku
,items_ordered
,order_total
,line_total
,1 AS id
FROM sample
UNION ALL
SELECT order_id
,sku
,items_ordered
,order_total
,line_total
,id + 1
FROM loops
WHERE id < items_ordered),
sample AS (
SELECT 'C852F724-25A4-41A3-AA7A-BA2E4809EBE7' AS order_id, '12345-white-xs' AS sku, 1 AS items_ordered, 129 AS order_total, 14.5 AS line_total
UNION ALL SELECT 'C852F724-25A4-41A3-AA7A-BA2E4809EBE7', '45678-black-l', 3, 129 AS order_total, 48
UNION ALL SELECT 'C852F724-25A4-41A3-AA7A-BA2E4809EBE7', '90123-yellow-m', 1, 129 AS order_total, 22.5
UNION ALL SELECT 'C852F724-25A4-41A3-AA7A-BA2E4809EBE7', '13764-blue-s', 4, 129 AS order_total, 44),
final AS (
SELECT * EXCEPT(id, items_ordered)
,line_total / SUM(1) OVER (PARTITION BY order_id, sku) AS unit_total
FROM loops
ORDER BY 1,2)
SELECT *
FROM final;
Code Breakdown:
The code demonstrates these techniques through three CTEs:
sample
CTE: This CTE defines the sample order items data.loops
CTE (Recursive): This CTE iterates, creating a row for each unit within an order item. It stops iterating whenid
reachesitems_ordered
.final
CTE: This CTE selects desired columns fromloops
and calculates theunit_total
using the window function. It then sorts the data byorder_id
andsku
.
Output and Conclusion:
The final output shows the disaggregated data at the unit level, including order_id
, sku
, order_total
, line_total
, and the calculated unit_total
for each unit. This allows for a more granular analysis of individual items within an order.
order_id | sku_id | order_total | line_total | unit_total |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 12345-white-xs | 129 | 14.5 | 14.5 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 13764-blue-s | 129 | 44.0 | 11.0 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 13764-blue-s | 129 | 44.0 | 11.0 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 13764-blue-s | 129 | 44.0 | 11.0 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 13764-blue-s | 129 | 44.0 | 11.0 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 45678-black-l | 129 | 48.0 | 16.0 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 45678-black-l | 129 | 48.0 | 16.0 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 45678-black-l | 129 | 48.0 | 16.0 |
C852F724-25A4-41A3-AA7A-BA2E4809EBE7 | 90123-yellow-m | 129 | 22.5 | 22.5 |