When dealing with categories it is common to encounter data that may be under such format :
id
toppings
1
salad;tomatoes;onions
2
cheese;bacon
3
null
Output
Ideally, the structure we would like to have to enable data analysis on such data format would be the following :
id
toppings
1
salad
1
tomatoes
1
onions
2
cheese
2
bacon
Example query
Let's see an example in order to achieve this result :
-- this is our fake database
WITH
database AS (
SELECT
1 AS id,
"salad;tomatoes;onions" AS toppings
UNION ALL
SELECT
2 AS id,
"cheese;bacon" AS toppings
UNION ALL
SELECT
3 AS id,
null AS toppings
)
-- this is the output of the model
-- 1. we start by selecting our two columns "id" and "toppings"
-- 2. then we split the toppings using the ; delimiter character
-- this will generatate an array
-- https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#split
-- 3. Then we unnest the array, and flatten it using unnest and cross join
-- this will give us the expected output format
-- https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#flattening_arrays
SELECT
id,
toppings
FROM
database
CROSS JOIN UNNEST(SPLIT(toppings, ";")) as toppings
with
database as (
select
1 as id,
'salad;tomatoes;onions' as toppings
union all
select
2 as id,
'cheese;bacon' as toppings
union all
select
3 as id,
null as toppings
)
select d.id, f.value::string as toppings
from database d,
lateral flatten
(input=>split(d.toppings, ';')) f