🧐SQL Fanout
Last updated
Last updated
If you're a SQL user, some of the first SQL concepts you probably learned about were joins and aggregate functions (such as COUNT
and SUM
). One thing that is not always taught is how these two concepts can interact and sometimes produce incorrect results. In this article we'll discuss what to look out for, the concept of a "fanout," and why it matters to SQL writers.
This is one of the reason why there is an old wisdom that "JOINs are tricky to get right", even for experience SQL writer.
Let's start off with a simple example, where we'll join together a couple of tables. Our first table will show our customers' names and the number of visits each customer has made to our e-commerce website:
customer
customer_id | first_name | last_name | visits |
---|---|---|---|
Our second table will include all the orders that those customers have placed. You can see that each order is linked to the customer who placed it by the customer's ID.
order
order_id | amount | customer_id |
---|---|---|
Joining these tables together in SQL would be pretty simple:
The result of that query would be this table:
Now that we have a joined table, we need to be careful about how we use aggregate functions like COUNT
and SUM
.
Let's consider the customer table all by itself again. If we want to know the total number of customers, we can execute a simple query like this:
SQL will count up the rows in the table as follows:
customer
We'll get a count of 3, which is correct 👍
Or, if we want to know the total number of customer visits, we can execute another straightforward query like this:
SQL will add up the number of visits in the table as follows:
customer
We'll get a result of 8, which is also correct 👍
So far, so good. However, if we try to use the same aggregate functions on either of our joined tables, we'll start to see incorrect results.
Running a basic count on the joined table, we will no longer get the correct number of customers:
SQL will count up the rows in the table as follows:
We'll get a result of 4, even though there are really only 3 customers. You can see that Jean is counted twice.
Similarly, if we try to sum the number of visits, we will no longer get the correct result:
SQL will add up the number of visits in the table as follows:
We'll get a result of 10, even though there are only 8 visits. Jean's 2 visits are added twice.
In the example we've been looking at, the primary table (customer) had only three rows. The "primary table" is the table that is in the FROM
clause of our SQL queries. After the join, we now have 4 rows. Since the joined table has more rows than the primary table, we say that a fanout has occurred.
To avoid a fanout, we can write the join in the opposite order. So, instead of this:
We would do this:
Now our joined table looks like this:
The eagle-eyed observer will recognize that this is exactly like our other joined table, except for the order of the columns. That is true, and it's why most writers of SQL think of our two different joins as exactly the same.
However, when we do the join in this order, we do not have a fanout. Our original table (order) had four rows, and our joined table also has four rows, so there has been no fanout.
Herein lies a key point: To help avoid fanouts, begin your joins with the most granular table.
In both the fanout and the non-fanout cases, we still need to worry about the accuracy of aggregate functions. However, there is a subtle difference in the type of problems we're going to see:
No fanout 😨 — You can trust aggregate functions on your primary table but not necessarily on your joined tables.
Fanout 😱 — You cannot necessarily trust aggregate functions on either your primary table or your joined tables.
To drive home this point, let's take a look at our previous examples.
If you'll recall from the previous example, the following join resulted in a fanout, because while the customer table only had three rows, the joined table had four rows.
Since we are in a fanout situation, we cannot trust that aggregate functions will work on the primary table (customer). As we saw previously, SUM(visits)
will give us a value of 10, even though only 8 visits have actually occurred.
When we reversed the join, we did not get a fanout, because the order table had four rows, and the joined table also had four rows.
Without any fanout, we can trust that aggregate functions will work on the primary table (order).
For example, SUM(amount)
will give us a value of 250.00, which is the correct amount of money collected 👍
However, we can't trust any aggregation done on the secondary table (customer), such as SUM(visits)
= 10 (❌)
If we want to avoid fanouts, it's important that we understand the three different types of joins.
If one row of your primary table only ever matches up with one and only one row of your joined table, you have a one-to-one join. This type of join will not result in a fanout, and aggregate functions will be accurate no matter where you use them.
Example: Suppose you have a person table and a DNA table. Since only one person can be matched with one DNA record and one DNA record with one person, this is a one-to-one join.
If many rows of your primary table match up with the same row in your joined table, you have a many-to-one join. This type of join will also not result in a fanout, and aggregate functions will at least be accurate on the primary table.
Example: Suppose you have a order table and a customer table. Since many order can be made from a single customer, you have a many-to-one relationship.
If one row of your primary table can match up with multiple rows in your joined table, you have a one-to-many join. This type of join can result in a fanout, and aggregate functions are not necessarily accurate anywhere.
Example: Suppose you have a customer table and an order table. Since one customer can have more than one order, this is a one-to-many join.
The first, and preferred, method to check for a fanout is to understand the type of join that is occurring. One-to-one and many-to-one joins won't ever result in a fanout. However, if you know you are in a one-to-many situation, then there will always be the risk of a fanout. Even if a fanout has not already occurred, it will be a risk in the future if new rows are added.
The second method you can use to check for a fanout is to query a COUNT
before and after the join. The queries would look like this:
If the count increases between the two queries, we know that a fanout has occurred. Since we're looking for an increase, it's important that the second query use a LEFT JOIN
. We don't want to artificially decrease the number of rows being reported just because a row in the primary table doesn't have a corresponding row in the joined table.
Unfortunately, this method cannot tell you if there is a risk of a future fanout. To know that, you need to understand the type of join that is occurring. This is lies in how defined is your business models and what is the functional relationships between your tables.
If you're one of the lucky folks who use Whaly, we got you covered.
Thanks to:
the type of the relationships (Has Many / Belongs To / One to One) that you defined when creating your models
to the Primary Keys that can help deduplicate the records in your tables
Whenever Whaly needs to generate a SQL query that will contains a fan-out, our SQL engine generator will avoid it by generating dynamics sub-queries that are using the DISTINCT keywords to avoid getting false results for your aggregates calculations.
This way, any JOINs that are generated through Explorations and Related Data configured inside it will be safe 🤗
To summarize everything we've just covered:
Aggregate functions like SUM
and COUNT
can misbehave if used against joined tables.
If a join has a one-to-one relationship, aggregate functions will work just fine.
If a join has a many-to-one relationship, aggregate functions will work on the primary table but might not work on the joined tables.
If a join has a one-to-many relationship, aggregate functions may not work anywhere.
If you're using Whaly and joining tables in your Explore, you don't need to think about any of that again 🤗 🐳
customer_id | first_name | last_name | visits | order_id | amount |
---|---|---|---|---|---|
customer_id | visits | sum | first_name | last_name |
---|---|---|---|---|
customer_id | visits | sum | first_name | last_name | order_id | amount | customer_id |
---|---|---|---|---|---|---|---|
order_id | amount | customer_id | first_name | last_name | visits |
---|---|---|---|---|---|
customer_id | first_name | last_name | visits | order_id | amount |
---|---|---|---|---|---|
order_id | amount | customer_id | first_name | last_name | visits |
---|---|---|---|---|---|
1
Jean
de la Fontaine
2
2
Émile
Zola
2
3
René
Descartes
4
1
25.00
1
2
50.00
1
3
75.00
2
4
100.00
3
1
Jean
de la Fontaine
2
1
25.00
1
Jean
de la Fontaine
2
2
50.00
2
Émile
Zola
2
3
75.00
3
René
Descartes
4
4
100.00
count
customer_id
first_name
last_name
visits
+ 1
1
Jean
de la Fontaine
2
+ 1
2
Émile
Zola
2
+ 1
3
René
Descartes
4
Results ⤵️
count
3 (✅)
1
2
+ 2
Jean
de la Fontaine
2
2
+ 2
Émile
Zola
3
4
+ 4
René
Descartes
Results ⤵️
sum
8 (✅)
customer_id
first_name
last_name
visits
order_id
amount
customer_id
+ 1
1
Jean
de la Fontaine
2
1
25.00
1
+ 1
1
Jean
de la Fontaine
2
2
50.00
1
+ 1
2
Émile
Zola
2
3
75.00
2
+ 1
3
René
Descartes
4
4
100.00
3
Results ⤵️
count
4 (❌)
1
2
+ 2
Jean
de la Fontaine
1
25.00
1
1
2
+ 2
Jean
de la Fontaine
2
50.00
1
2
2
+ 2
Émile
Zola
3
75.00
2
3
4
+ 4
René
Descartes
4
100.00
3
Results ⤵️
SUM
10 (❌)
1
25.00
1
Jean
de la Fontaine
2
2
50.00
1
Jean
de la Fontaine
2
3
75.00
2
Émile
Zola
2
4
100.00
3
René
Descartes
4
1
Jean
de la Fontaine
2
1
25.00
1
Jean
de la Fontaine
2
2
50.00
2
Émile
Zola
2
3
75.00
3
René
Descartes
4
4
100.00
1
25.00
1
Jean
de la Fontaine
2
2
50.00
1
Jean
de la Fontaine
2
3
75.00
2
Émile
Zola
2
4
100.00
3
René
Descartes
4