🧐SQL Fanout

If you're a SQL user, some of the first SQL concepts you probably learned about were joins and aggregate functions (such as COUNT and SUM). One thing that is not always taught is how these two concepts can interact and sometimes produce incorrect results. In this article we'll discuss what to look out for, the concept of a "fanout," and why it matters to SQL writers.

This is one of the reason why there is an old wisdom that "JOINs are tricky to get right", even for experience SQL writer.

Starting with a simple join

Let's start off with a simple example, where we'll join together a couple of tables. Our first table will show our customers' names and the number of visits each customer has made to our e-commerce website:

customer

Our second table will include all the orders that those customers have placed. You can see that each order is linked to the customer who placed it by the customer's ID.

order

Joining these tables together in SQL would be pretty simple:

SELECT
    *
FROM
    customer
LEFT JOIN order
USING (customer_id)

The result of that query would be this table:

Aggregate functions gone bad

Now that we have a joined table, we need to be careful about how we use aggregate functions like COUNT and SUM.

Aggregate functions on a single table

Let's consider the customer table all by itself again. If we want to know the total number of customers, we can execute a simple query like this:

SELECT
    COUNT(*) as count
FROM   customer

SQL will count up the rows in the table as follows:

customer

We'll get a count of 3, which is correct 👍

Or, if we want to know the total number of customer visits, we can execute another straightforward query like this:

SELECT
    SUM(visits) as sum
FROM
    customer

SQL will add up the number of visits in the table as follows:

customer

We'll get a result of 8, which is also correct 👍

Aggregate functions on the joined table

So far, so good. However, if we try to use the same aggregate functions on either of our joined tables, we'll start to see incorrect results.

Running a basic count on the joined table, we will no longer get the correct number of customers:

SELECT
    COUNT(*) as count
FROM
    customer
LEFT JOIN order
USING (customer_id)

SQL will count up the rows in the table as follows:

We'll get a result of 4, even though there are really only 3 customers. You can see that Jean is counted twice.

Similarly, if we try to sum the number of visits, we will no longer get the correct result:

SELECT
    SUM(visits) as sum
FROM
    customer
LEFT JOIN order
USING (customer_id)

SQL will add up the number of visits in the table as follows:

We'll get a result of 10, even though there are only 8 visits. Jean's 2 visits are added twice.

A fanout happened while you weren't looking

In the example we've been looking at, the primary table (customer) had only three rows. The "primary table" is the table that is in the FROM clause of our SQL queries. After the join, we now have 4 rows. Since the joined table has more rows than the primary table, we say that a fanout has occurred.

To avoid a fanout, we can write the join in the opposite order. So, instead of this:

SELECT *
FROM customer
LEFT JOIN order
USING (customer_id)

We would do this:

SELECT *
FROM order
LEFT JOIN customer
USING (customer_id)

Now our joined table looks like this:

The eagle-eyed observer will recognize that this is exactly like our other joined table, except for the order of the columns. That is true, and it's why most writers of SQL think of our two different joins as exactly the same.

However, when we do the join in this order, we do not have a fanout. Our original table (order) had four rows, and our joined table also has four rows, so there has been no fanout.

Herein lies a key point: To help avoid fanouts, begin your joins with the most granular table.

Fanouts, schmanouts, who cares?

In both the fanout and the non-fanout cases, we still need to worry about the accuracy of aggregate functions. However, there is a subtle difference in the type of problems we're going to see:

  • No fanout 😨 — You can trust aggregate functions on your primary table but not necessarily on your joined tables.

  • Fanout 😱 — You cannot necessarily trust aggregate functions on either your primary table or your joined tables.

To drive home this point, let's take a look at our previous examples.

Fanout example

If you'll recall from the previous example, the following join resulted in a fanout, because while the customer table only had three rows, the joined table had four rows.

SELECT *
FROM customer
LEFT JOIN order
USING (customer_id)

Since we are in a fanout situation, we cannot trust that aggregate functions will work on the primary table (customer). As we saw previously, SUM(visits) will give us a value of 10, even though only 8 visits have actually occurred.

No fanout example

When we reversed the join, we did not get a fanout, because the order table had four rows, and the joined table also had four rows.

SELECT *
FROM order
LEFT JOIN customer
USING (customer_id)

Without any fanout, we can trust that aggregate functions will work on the primary table (order).

For example, SUM(amount) will give us a value of 250.00, which is the correct amount of money collected 👍

However, we can't trust any aggregation done on the secondary table (customer), such as SUM(visits) = 10 (❌)

Two friendly joins, one frenemy join

If we want to avoid fanouts, it's important that we understand the three different types of joins.

One-to-one (friendly)

If one row of your primary table only ever matches up with one and only one row of your joined table, you have a one-to-one join. This type of join will not result in a fanout, and aggregate functions will be accurate no matter where you use them.

Example: Suppose you have a person table and a DNA table. Since only one person can be matched with one DNA record and one DNA record with one person, this is a one-to-one join.

Many-to-one / Belongs to (friendly)

If many rows of your primary table match up with the same row in your joined table, you have a many-to-one join. This type of join will also not result in a fanout, and aggregate functions will at least be accurate on the primary table.

Example: Suppose you have a order table and a customer table. Since many order can be made from a single customer, you have a many-to-one relationship.

One-to-many (frenemy)

If one row of your primary table can match up with multiple rows in your joined table, you have a one-to-many join. This type of join can result in a fanout, and aggregate functions are not necessarily accurate anywhere.

Example: Suppose you have a customer table and an order table. Since one customer can have more than one order, this is a one-to-many join.

A fanout witch hunt

Understand your join type

The first, and preferred, method to check for a fanout is to understand the type of join that is occurring. One-to-one and many-to-one joins won't ever result in a fanout. However, if you know you are in a one-to-many situation, then there will always be the risk of a fanout. Even if a fanout has not already occurred, it will be a risk in the future if new rows are added.

Count rows before and after the join

The second method you can use to check for a fanout is to query a COUNT before and after the join. The queries would look like this:

SELECT COUNT(*)
FROM   my_primary_table

SELECT    COUNT(*)
FROM      my_primary_table
LEFT JOIN my_joined_table
USING (my_join_column)

If the count increases between the two queries, we know that a fanout has occurred. Since we're looking for an increase, it's important that the second query use a LEFT JOIN. We don't want to artificially decrease the number of rows being reported just because a row in the primary table doesn't have a corresponding row in the joined table.

Unfortunately, this method cannot tell you if there is a risk of a future fanout. To know that, you need to understand the type of join that is occurring. This is lies in how defined is your business models and what is the functional relationships between your tables.

Whaly to the rescue

If you're one of the lucky folks who use Whaly, we got you covered.

Thanks to:

  • the type of the relationships (Has Many / Belongs To / One to One) that you defined when creating your models

  • to the Primary Keys that can help deduplicate the records in your tables

Whenever Whaly needs to generate a SQL query that will contains a fan-out, our SQL engine generator will avoid it by generating dynamics sub-queries that are using the DISTINCT keywords to avoid getting false results for your aggregates calculations.

This way, any JOINs that are generated through Explorations and Related Data configured inside it will be safe 🤗

Quick summary

To summarize everything we've just covered:

  • Aggregate functions like SUM and COUNT can misbehave if used against joined tables.

  • If a join has a one-to-one relationship, aggregate functions will work just fine.

  • If a join has a many-to-one relationship, aggregate functions will work on the primary table but might not work on the joined tables.

  • If a join has a one-to-many relationship, aggregate functions may not work anywhere.

  • If you're using Whaly and joining tables in your Explore, you don't need to think about any of that again 🤗 🐳

Last updated