Simplifying the Complex: BigQuery Recursive CTEs for Data Transformation

·

3 min read

Business Intelligence Analysts juggle a diverse range of datasets. Some reside at the granular unit level, while others are pre-aggregated at the item or order level. Aggregating data upwards (unit -> item -> order) is a breeze. But when it comes to disaggregating data downwards (order -> item -> unit), things can get tricky.

This article dives into BigQuery's powerful tool for tackling this challenge: Recursive Common Table Expressions (CTEs). We'll explore how to leverage them to disaggregate data downwards, specifically from the item level to the unit level.

Imagine you have an order items dataset containing details like order_id, sku, items_ordered, order_total, line_total. Here's a sample of how this data might look for a single order:

order_idskuitems_orderedorder_totalline_total
C852F724-25A4-41A3-AA7A-BA2E4809EBE712345-white-xs112914.5
C852F724-25A4-41A3-AA7A-BA2E4809EBE745678-black-l312948.0
C852F724-25A4-41A3-AA7A-BA2E4809EBE790123-yellow-m112922.5
C852F724-25A4-41A3-AA7A-BA2E4809EBE713764-blue-s412944.0

As you can see, items_ordered represents the total quantity of each item ordered (aggregated data), and order_total reflects the sum of line_total for the entire order (also aggregated). Our goal is to disaggregate this data and reach the level of individual units ordered (assuming each item represents a single unit).

Disaggregation Approach:

The disaggregation process involves two key techniques:

  1. Recursive CTE: This CTE iterates until a counter (id) reaches items_ordered value, effectively creating one row for each unit within an order item.

  2. Window Function: We use the SUM(1) OVER (PARTITION BY order_id, sku) window function to calculate the unit price by dividing the line_total by the total number of items_ordered for a specific sku within the order.

WITH
  RECURSIVE 
    loops AS (
      SELECT order_id
            ,sku
            ,items_ordered
            ,order_total
            ,line_total
            ,1 AS id
      FROM sample
      UNION ALL
      SELECT order_id
            ,sku
            ,items_ordered
            ,order_total
            ,line_total
            ,id + 1
      FROM loops
      WHERE id < items_ordered),

    sample AS (
        SELECT 'C852F724-25A4-41A3-AA7A-BA2E4809EBE7' AS order_id, '12345-white-xs' AS sku, 1 AS items_ordered, 129 AS order_total, 14.5 AS line_total
        UNION ALL SELECT 'C852F724-25A4-41A3-AA7A-BA2E4809EBE7', '45678-black-l', 3, 129 AS order_total, 48
        UNION ALL SELECT 'C852F724-25A4-41A3-AA7A-BA2E4809EBE7', '90123-yellow-m', 1, 129 AS order_total, 22.5
        UNION ALL SELECT 'C852F724-25A4-41A3-AA7A-BA2E4809EBE7', '13764-blue-s', 4, 129 AS order_total, 44),

    final AS (
      SELECT * EXCEPT(id, items_ordered)
            ,line_total / SUM(1) OVER (PARTITION BY order_id, sku) AS unit_total
        FROM loops
      ORDER BY 1,2)

  SELECT *
  FROM final;

Code Breakdown:

The code demonstrates these techniques through three CTEs:

  • sample CTE: This CTE defines the sample order items data.

  • loops CTE (Recursive): This CTE iterates, creating a row for each unit within an order item. It stops iterating when id reaches items_ordered.

  • final CTE: This CTE selects desired columns from loops and calculates the unit_total using the window function. It then sorts the data by order_id and sku.

Output and Conclusion:

The final output shows the disaggregated data at the unit level, including order_id, sku, order_total, line_total, and the calculated unit_total for each unit. This allows for a more granular analysis of individual items within an order.

order_idsku_idorder_totalline_totalunit_total
C852F724-25A4-41A3-AA7A-BA2E4809EBE712345-white-xs12914.514.5
C852F724-25A4-41A3-AA7A-BA2E4809EBE713764-blue-s12944.011.0
C852F724-25A4-41A3-AA7A-BA2E4809EBE713764-blue-s12944.011.0
C852F724-25A4-41A3-AA7A-BA2E4809EBE713764-blue-s12944.011.0
C852F724-25A4-41A3-AA7A-BA2E4809EBE713764-blue-s12944.011.0
C852F724-25A4-41A3-AA7A-BA2E4809EBE745678-black-l12948.016.0
C852F724-25A4-41A3-AA7A-BA2E4809EBE745678-black-l12948.016.0
C852F724-25A4-41A3-AA7A-BA2E4809EBE745678-black-l12948.016.0
C852F724-25A4-41A3-AA7A-BA2E4809EBE790123-yellow-m12922.522.5