Flattening categories

Input

When dealing with categories it is common to encounter data that may be under such format :

toppings

salad;tomatoes;onions

cheese;bacon

null

Output

Ideally, the structure we would like to have to enable data analysis on such data format would be the following :

toppings

salad

tomatoes

onions

cheese

bacon

Example query

Let's see an example in order to achieve this result :

-- this is our fake database
WITH
  database AS (
    SELECT
      1 AS id,
      "salad;tomatoes;onions" AS toppings
    UNION ALL
    SELECT
      2 AS id,
      "cheese;bacon" AS toppings
    UNION ALL
    SELECT
      3 AS id,
      null AS toppings
  )
  
-- this is the output of the model
-- 1. we start by selecting our two columns "id" and "toppings"

-- 2. then we split the toppings using the ; delimiter character
--    this will generatate an array
--    https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#split

-- 3. Then we unnest the array, and flatten it using unnest and cross join
--    this will give us the expected output format
--    https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays
SELECT
  id,
  toppings
FROM
  database
  CROSS JOIN UNNEST(SPLIT(toppings, ";")) as toppings

with
  database as (
    select
      1 as id,
      'salad;tomatoes;onions' as toppings
    union all
    select
      2 as id,
      'cheese;bacon' as toppings
    union all
    select
      3 as id,
      null as toppings
  )
select d.id, f.value::string as toppings
from database d,
lateral flatten
(input=>split(d.toppings, ';')) f

PreviousUseful SQL operations

Last updated 2 years ago