Flattening categories

Input

When dealing with categories it is common to encounter data that may be under such format :

idtoppings

1

salad;tomatoes;onions

2

cheese;bacon

3

null

Output

Ideally, the structure we would like to have to enable data analysis on such data format would be the following :

idtoppings

1

salad

1

tomatoes

1

onions

2

cheese

2

bacon

Example query

Let's see an example in order to achieve this result :

-- this is our fake database
WITH
  database AS (
    SELECT
      1 AS id,
      "salad;tomatoes;onions" AS toppings
    UNION ALL
    SELECT
      2 AS id,
      "cheese;bacon" AS toppings
    UNION ALL
    SELECT
      3 AS id,
      null AS toppings
  )
  
-- this is the output of the model
-- 1. we start by selecting our two columns "id" and "toppings"

-- 2. then we split the toppings using the ; delimiter character
--    this will generatate an array
--    https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#split

-- 3. Then we unnest the array, and flatten it using unnest and cross join
--    this will give us the expected output format
--    https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays
SELECT
  id,
  toppings
FROM
  database
  CROSS JOIN UNNEST(SPLIT(toppings, ";")) as toppings

Last updated