BigQuery Recursive CTEs for Data Transformation

·

3 min read

Business Intelligence Analysts juggle a diverse range of datasets. Some reside at the granular unit level, while others are pre-aggregated at the item or order level. Aggregating data upwards (unit -> item -> order) is a breeze. But when it comes to disaggregating data downwards (order -> item -> unit), things can get tricky.

This article dives into BigQuery's powerful tool for tackling this challenge: Recursive Common Table Expressions (CTEs). We'll explore how to leverage them to disaggregate data downwards, specifically from the item level to the unit level.

Imagine you have an order_items dataset containing details like order_id, sku, items_ordered, and line_total. Here's a sample of how this data might look for a single order:

order_idskuitems_orderedline_total
C852F72412345-white-xs114.5
C852F72445678-black-l348.0
C852F72490123-yellow-m122.5
C852F72413764-blue-s444.0

As you can see, items_ordered represents the total quantity of each item ordered (aggregated data), and line_total reflects the sum of the entire order line(also aggregated). Our goal is to disaggregate this data and reach the level of individual units ordered (assuming each item represents a single unit).

Disaggregation Approach:

The disaggregation process involves two key techniques:

  1. Recursive CTE: This CTE iterates until a counter (id) reaches items_ordered value, effectively creating one row for each unit within an order item.

  2. Window Function: We use the SUM(1) OVER (PARTITION BY order_id, sku) window function to calculate the unit price by dividing the line_total by the total number of items_ordered for a specific sku within the order.

WITH
  RECURSIVE 
    loops AS (
      SELECT order_id
            ,sku
            ,items_ordered
            ,line_total
            ,1 AS id
      FROM sample
      UNION ALL
      SELECT order_id
            ,sku
            ,items_ordered
            ,line_total
            ,id + 1
      FROM loops
      WHERE id < items_ordered),

    sample AS (
        SELECT 'C852F724' AS order_id, '12345-white-xs' AS sku, 1 AS items_ordered, 14.5 AS line_total
        UNION ALL SELECT 'C852F724', '45678-black-l', 3, 48
        UNION ALL SELECT 'C852F724', '90123-yellow-m', 1, 22.5
        UNION ALL SELECT 'C852F724', '13764-blue-s', 4, 44),

    final AS (
      SELECT * EXCEPT(id, items_ordered, line_total)
            ,line_total / SUM(1) OVER (PARTITION BY order_id, sku) AS unit_total
        FROM loops
      ORDER BY 1,2)

  SELECT *
  FROM final;

The code demonstrates the disaggregation process through three CTEs:

  • sample CTE: This CTE defines the sample order items data.

  • loops CTE (Recursive): This CTE iterates, creating a row for each unit within an order item. It stops iterating when id reaches items_ordered.

  • final CTE: This CTE selects desired columns from loops and calculates the unit_total using the window function. It then sorts the data by order_id and sku.

Output and Conclusion:

The final output shows the disaggregated data at the unit level. This allows for a more granular analysis of individual items within an order.

order_idskuunit_total
C852F72412345-white-xs14.5
C852F72413764-blue-s11.0
C852F72413764-blue-s11.0
C852F72413764-blue-s11.0
C852F72413764-blue-s11.0
C852F72445678-black-l16.0
C852F72445678-black-l16.0
C852F72445678-black-l16.0
C852F72490123-yellow-m22.5

Subscribe to our newsletter

Read articles from Decoding Data for Business directly inside your inbox. Subscribe to the newsletter, and don't miss out.