9 Joining Tables

So far, every query you’ve written has drawn from a single table. That’s fine for questions like “what products are low on stock?” But most real business questions can’t be answered from one table alone. “Which suppliers deliver the products we’re running low on?” requires information from both the products table and the suppliers table. “What are our best-selling product categories?” needs order_details, products, and categories. The data you need is spread across multiple tables, and joining them is how you bring it together.

This chapter explains why data is split across tables in the first place, introduces the concept of keys that link tables together, and teaches you the SQL JOIN syntax to combine tables in your queries. By the end, you’ll be writing multi-table queries that traverse the Northwind supply chain from supplier to customer.

9.1 Why Data Lives in Multiple Tables

It might seem simpler to put everything in one giant table: every order, with the customer’s full address, the product name and description, the supplier’s phone number, the shipper’s details, all in a single row. Some spreadsheet-based workflows do exactly this. But this approach creates serious problems as data grows.

Imagine an order that includes three products. In a single-table design, you’d repeat the customer’s name, address, and phone number three times, once per line item. If that customer places 200 orders over the year, their information is duplicated hundreds of times. Now multiply that by thousands of customers and you’ve wasted enormous storage on redundant data. Worse, if the customer moves and you need to update their address, you have to find and update every row that mentions them. Miss one, and now your database disagrees with itself about where that customer lives.

Relational databases solve this by splitting data into separate tables organized around distinct entities. Customers live in the customers table, one row per customer. Orders live in the orders table, one row per order. Products, suppliers, shippers, each gets its own table. Instead of duplicating customer information in every order row, the orders table stores only a customer_id that references the full customer record in the customers table. This reference system eliminates redundancy: the customer’s address exists in exactly one place, and updating it there updates it everywhere.

The trade-off is that answering multi-entity questions requires combining tables, which is precisely what joins do.

9.2 Keys: The Links Between Tables

Two concepts make the reference system work.

A primary key is a column (or combination of columns) that uniquely identifies each row in a table. In the products table, product_id is the primary key: every product has a different ID, and that ID never changes. Primary keys guarantee that you can always point to exactly one row without ambiguity.

A foreign key is a column in one table that references the primary key of another table. In the products table, supplier_id is a foreign key that references supplier_id in the suppliers table. It says “this product is supplied by the company in row X of the suppliers table.” Foreign keys are the links that connect your tables into a web of relationships.

Here’s a concrete example. The products table might contain:

product_id	product_name	supplier_id	category_id	unit_price
1	Chai	1	1	18.00
2	Chang	1	1	19.00
3	Aniseed Syrup	1	2	10.00

And the suppliers table:

supplier_id	company_name	country
1	Exotic Liquids	UK
2	New Orleans Cajun	USA

The supplier_id column in products connects to the supplier_id column in suppliers. Products 1, 2, and 3 all have supplier_id = 1, meaning they’re all supplied by “Exotic Liquids” in the UK. Notice how the supplier’s name and country appear once in the suppliers table rather than being repeated in every product row.

9.2.1 Relationship Types

The connections between tables follow a few common patterns that are worth naming, because recognizing them helps you predict how joins will behave.

A one-to-many relationship is the most common. One supplier has many products. One customer has many orders. One order has many line items. The “one” side has the primary key, and the “many” side has the foreign key. When you join these tables, each row on the “one” side can match multiple rows on the “many” side.

A one-to-one relationship means each row in one table matches exactly one row in another. These are less common, but you see them when a table is split for organizational reasons (like separating sensitive employee data from general employee information).

A many-to-many relationship means entities on both sides can relate to multiple entities on the other. Consider students and courses: each student takes many courses, and each course has many students. You can’t represent this with a single foreign key on either side. Instead, you need a junction table (also called a bridge table or associative table) that sits between them. The Northwind order_details table is exactly this: it connects orders to products, because one order can contain many products, and one product can appear in many orders. Each row in order_details represents one specific product within one specific order, with foreign keys to both tables.

output

orders ──┐
         ├── order_details ──┐
products ┘                   │
                             └── Each row: one product in one order

Understanding these patterns helps you predict how many rows a join will produce, which, as you’ll see later in this chapter, is one of the most important things to get right.

9.2.2 Exercises

Look at the Northwind tables: categories, products, suppliers, orders, order_details, customers. For each pair listed below, identify which is the “one” side and which is the “many” side, and name the foreign key column that connects them. (a) categories and products. (b) customers and orders. (c) orders and order_details.
Why can’t the relationship between orders and products be represented with a single foreign key on either table? What would go wrong if you added a product_id column to the orders table?
Imagine Northwind adds an employee_territories table where each row links an employee to a territory they cover. An employee can cover many territories, and a territory can be covered by many employees. What kind of relationship is this, and what would the junction table’s columns look like?

Solutions

1. (a) categories=one, products=many, FK: category_id in products. (b) customers=one, orders=many, FK: customer_id in orders. (c) orders=one, order_details=many, FK: order_id in order_details.

2. Many-to-many: one order has many products, one product appears in many orders. Adding product_id to orders would limit each order to one product. The junction table order_details resolves this.

3. Many-to-many relationship. Junction table employee_territories would have: employee_id (FK to employees), territory_id (FK to territories), and optionally a composite primary key of both.

9.3 INNER JOIN: Matching Rows

The INNER JOIN is the most common join type and the one you’ll use most often. It combines rows from two tables where the join condition is met. Rows that don’t have a match in both tables are excluded from the result.

Let’s answer the question: “What are the names and suppliers of our products?”

duckdb

SELECT
    p.product_name,
    p.unit_price,
    s.company_name AS supplier,
    s.country AS supplier_country
FROM products AS p
INNER JOIN suppliers AS s
    ON p.supplier_id = s.supplier_id;

Several things are happening here that deserve attention.

The FROM products AS p establishes products as the first table and gives it the alias p. The INNER JOIN suppliers AS s brings in the second table with alias s. Table aliases keep your queries readable when working with multiple tables, because writing p.product_name is much cleaner than products.product_name, especially as queries grow.

The ON p.supplier_id = s.supplier_id is the join condition: it tells the database which rows to match. For each product, the database finds the supplier row where the supplier IDs match and combines them into a single result row.

The SELECT clause uses the table aliases to specify which table each column comes from. This is necessary when two tables have columns with the same name (both tables have supplier_id), but it’s good practice even when names are unique, because it documents where each piece of data originates.

9.3.1 How INNER JOIN Works

Think of an INNER JOIN as a handshake between two tables. Every row in the left table extends its hand (the join column value), and every row in the right table extends its hand. Only when two hands match (the values are equal) do those rows combine into a result row.

If a product has supplier_id = 99 but no supplier with that ID exists, that product is excluded from the results. Similarly, if a supplier exists but no products reference it, that supplier doesn’t appear. An INNER JOIN only returns the intersection: rows that match on both sides.

output

Products Table          Suppliers Table
supplier_id: 1  ───────► supplier_id: 1  ✓ Match! Combined in result
supplier_id: 1  ───────► supplier_id: 1  ✓ Match! Combined in result
supplier_id: 2  ───────► supplier_id: 2  ✓ Match! Combined in result
supplier_id: 99 ───────► (no match)     ✗ Product excluded
                        supplier_id: 5  ✗ Supplier excluded (no products)

9.3.2 Exercises

Write a query that joins products to categories and returns the product name, unit price, and category name. Sort by category name, then by price descending within each category.
The products table has 250 rows. If you INNER JOIN products to suppliers, would you expect the result to have exactly 250 rows, fewer than 250, or more than 250? Explain your reasoning, then run the query and check. (Hint: think about whether every product has a valid supplier and whether any supplier maps to multiple products.)
Write a query that answers: “Which products does each supplier provide, and how much are they worth in total inventory?” Join products to suppliers, compute unit_price * units_in_stock as inventory_value, and sort by supplier name.

Solutions

SELECT p.product_name, p.unit_price, c.category_name
FROM products AS p
INNER JOIN categories AS c ON p.category_id = c.category_id
ORDER BY c.category_name, p.unit_price DESC;

2. Exactly 250 rows,every product has a valid supplier_id, so all match. Each product matches exactly one supplier (many-to-one), so no row multiplication.

SELECT s.company_name AS supplier, p.product_name, p.unit_price, p.units_in_stock,
    p.unit_price * p.units_in_stock AS inventory_value
FROM products AS p
INNER JOIN suppliers AS s ON p.supplier_id = s.supplier_id
ORDER BY s.company_name, p.product_name;

9.4 LEFT JOIN: Keeping Everything from One Side

Sometimes you want to keep all rows from one table even if there’s no match in the other. The LEFT JOIN (also written LEFT OUTER JOIN) returns all rows from the left table and matching rows from the right table. Where there’s no match, the right table’s columns are filled with NULL.

This is essential for questions like “show me all customers and their orders, including customers who haven’t placed any orders yet”:

duckdb

SELECT
    c.company_name,
    c.country,
    o.order_id,
    o.order_date
FROM customers AS c
LEFT JOIN orders AS o
    ON c.customer_id = o.customer_id
ORDER BY o.order_date;

Customers with orders appear with their order information populated. Customers without orders still appear in the results, but order_id and order_date are NULL for those rows.

This distinction between INNER JOIN and LEFT JOIN matters more than it might initially seem. Using INNER JOIN when you should have used LEFT JOIN silently drops rows from your results. If you’re counting total customers and some have no orders, an INNER JOIN through the orders table will give you the wrong count. Building the habit of asking “do I want to keep unmatched rows?” before choosing your join type will prevent subtle data loss bugs.

When the Data Doesn’t Show the Difference

In the Northwind dataset, every customer has placed at least one order, so INNER JOIN and LEFT JOIN between customers and orders return the same rows. Run both and compare the counts to verify this for yourself. This is actually common in well-maintained production databases, but it doesn’t mean LEFT JOIN is unnecessary. Real-world data is messier: new customers exist before their first order, data migrations leave gaps, and upstream systems don’t always enforce referential integrity. Always choose your join type based on what should happen logically, not on what the current data happens to contain. The exercises below ask you to use the LEFT JOIN plus IS NULL pattern anyway, because the pattern matters even when the current result is empty.

9.4.1 Finding Unmatched Rows

A practical pattern combines LEFT JOIN with a NULL check to find rows that don’t have a match:

duckdb

-- Customers who have never placed an order
SELECT
    c.company_name,
    c.country
FROM customers AS c
LEFT JOIN orders AS o
    ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;

This works because unmatched rows have NULL in all columns from the right table. Filtering for NULL in the right table’s primary key gives you exactly the “orphaned” rows from the left table.

9.5 Other Join Types

While INNER JOIN and LEFT JOIN cover the vast majority of analytical work, three other join types exist and are worth understanding.

RIGHT JOIN is the mirror of LEFT JOIN: it keeps all rows from the right table and matches from the left. In practice, you can always rewrite a RIGHT JOIN as a LEFT JOIN by swapping the table order, and most analysts prefer LEFT JOIN for consistency.

FULL OUTER JOIN keeps all rows from both tables, filling NULL where there’s no match on either side. This is useful for reconciliation queries (“show me everything from both systems, highlighting what’s only in one”), but it’s rare in everyday analytical work.

CROSS JOIN produces every possible combination of rows from both tables. If one table has 10 rows and the other has 50, the result has 500 rows. This is occasionally useful (generating all possible product-store combinations, for example) but dangerous if done accidentally on large tables.

duckdb

-- CROSS JOIN example: every category paired with every shipper
-- Useful for building a "coverage matrix"
SELECT
    c.category_name,
    s.company_name AS shipper
FROM categories AS c
CROSS JOIN shippers AS s;

Joins Are Not Set Operations

You’ll see joins explained with Venn diagrams everywhere online. While they provide a quick visual, they’re technically misleading, because joins are row-matching operations that create new combined rows, not set intersections. A single customer row can match many order rows, producing many combined rows, something a Venn diagram can’t represent. A cross join between 4 rows and 4 rows produces 16 rows, which is clearly not a set overlap.

If you want to think in terms of Venn diagrams, save that mental model for actual set operations (UNION, INTERSECT, EXCEPT) from the previous chapter, where both sides represent complete rows of the same structure and the operations truly combine, intersect, or subtract those rows. For joins, think “row matching” instead.

9.5.1 Exercises

Write a query that finds all customers who have never placed an order. Return the company name and country. How many are there?

This query is supposed to count how many orders each customer has placed, including customers with zero orders. Find the bug:

duckdb

SELECT
    c.company_name,
    COUNT(*) AS order_count
FROM customers AS c
LEFT JOIN orders AS o
    ON c.customer_id = o.customer_id
GROUP BY c.company_name
ORDER BY order_count;

(Hint: what does COUNT(*) count for customers with no orders? What should you count instead?)

What would a CROSS JOIN between categories (15 rows) and shippers (8 rows) produce? How many result rows? When would this “every combination” behavior be useful in a business context?
Explain in your own words why INNER JOIN and LEFT JOIN are not set operations, even though online tutorials often draw them as Venn diagrams. Use the one-to-many relationship between customers and orders as your example.

Solutions

SELECT c.company_name, c.country
FROM customers AS c
LEFT JOIN orders AS o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;

Returns 0 rows in Northwind (all customers have orders), but the pattern is correct.

2. Bug: COUNT(*) counts all rows including the LEFT JOIN row with NULLs. A customer with 0 orders would show count=1. Fix: use COUNT(o.order_id) which skips NULLs.

3. CROSS JOIN produces 15 × 8 = 120 rows. Useful for building coverage matrices (e.g., which category-shipper combinations have been used).

4. Joins are row-matching operations that CREATE new combined rows. If customer “ABC” has 15 orders, INNER JOIN produces 15 result rows,a Venn diagram can’t show this multiplication. Set operations combine complete rows without multiplication.

9.6 Multi-Table Queries

Real analytical questions often require more than two tables. The Northwind data model connects suppliers to products, products to order details, order details to orders, and orders to customers. Traversing this chain lets you answer questions that span the entire supply chain.

“What products did each customer order, and who supplies them?”

duckdb

SELECT
    c.company_name AS customer,
    p.product_name AS product,
    s.company_name AS supplier,
    od.quantity,
    od.unit_price * od.quantity AS line_total
FROM customers AS c
INNER JOIN orders AS o
    ON c.customer_id = o.customer_id
INNER JOIN order_details AS od
    ON o.order_id = od.order_id
INNER JOIN products AS p
    ON od.product_id = p.product_id
INNER JOIN suppliers AS s
    ON p.supplier_id = s.supplier_id
ORDER BY c.company_name, p.product_name;

This query chains four joins to connect five tables. Read it from the FROM clause down: start with customers, join to their orders, join to the line items in those orders, join to the product details, join to the supplier. Each join adds one more piece of context to the result.

Notice that order_details is the junction table connecting orders to products. Without it, there would be no way to know which products are in which orders. The join path orders → order_details → products traverses a many-to-many relationship by going through the junction table one step at a time.

When writing multi-table queries, building incrementally is the key. Don’t write all four joins at once. Start with two tables and verify the results make sense, then add the third, then the fourth. This approach catches mistakes early and builds your confidence in the data.

Building Queries Incrementally

Write multi-table queries like this:

Start with FROM table1 and check the row count.
Add one join: INNER JOIN table2 ON .... Check the row count. Did it increase? Decrease? Is that expected?
Add the next join. Check again.
Only add WHERE, ORDER BY, and the final SELECT columns after you’ve verified the joins produce the right foundation.

If your row count unexpectedly explodes, you likely have a many-to-many join (one row matching multiple rows on the other side). If it drops, you might need a LEFT JOIN instead of INNER JOIN.

9.7 Row Multiplication: A Common Trap

Understanding how joins affect row counts is one of the most important, and most overlooked, aspects of SQL.

An INNER JOIN between two tables with a one-to-many relationship (one customer, many orders) produces one result row for each match. If customer “ABC” has 15 orders, that customer appears in 15 result rows. This is correct and expected: each row represents a different order placed by the same customer.

But if you then aggregate without accounting for this multiplication, you get wrong answers. Consider this tempting but incorrect query:

duckdb

-- WRONG: customer info is repeated per order, inflating the count
SELECT
    c.company_name,
    c.country,
    COUNT(*) AS order_count
FROM customers AS c
INNER JOIN orders AS o
    ON c.customer_id = o.customer_id
INNER JOIN order_details AS od
    ON o.order_id = od.order_id
GROUP BY c.company_name, c.country;

This counts order detail lines, not orders, because each order can have many line items. If you wanted to count orders, you’d need COUNT(DISTINCT o.order_id) instead of COUNT(*).

The lesson: whenever you join tables, think about the grain of your result set. Each join can change the level of detail each row represents. A join from orders to order details changes your grain from one-row-per-order to one-row-per-line-item.

Check Your Row Count After Every Join

In professional work, the habit of checking your row count before and after every join will save you more debugging time than any other practice. If you started with 200,000 orders and after joining to order_details you have 631,000 rows, that’s expected,you’ve moved from order grain to line-item grain. But if you expected 200,000 and got 200,000, make sure you didn’t accidentally drop rows with an INNER JOIN where a LEFT JOIN was needed. Running SELECT COUNT(*) at each stage of a multi-table query takes seconds and prevents hours of debugging downstream.

9.8 Self-Joins

A table can be joined to itself. This sounds odd, but it’s useful when rows within the same table have relationships to each other. The employees table in Northwind includes a reports_to column that references another employee’s ID, representing the management hierarchy.

duckdb

-- Who reports to whom?
SELECT
    e.employee_name AS employee,
    e.title AS employee_title,
    m.employee_name AS manager,
    m.title AS manager_title
FROM employees AS e
LEFT JOIN employees AS m
    ON e.reports_to = m.employee_id;

The trick is giving the same table two different aliases (e for employees, m for managers) so the database can distinguish between the two roles. The LEFT JOIN ensures the top-level manager (who reports to nobody) still appears, with NULL in the manager columns.

9.8.1 Exercises

Using the self-join on employees, find the top-level manager (the person whose reports_to is NULL). Then write a query that counts how many direct reports each manager has.
Why does the self-join example use LEFT JOIN instead of INNER JOIN? What would change in the results if you switched to INNER JOIN?

Solutions

-- Top-level manager
SELECT e.employee_name, e.title FROM employees AS e WHERE e.reports_to IS NULL;
-- Direct reports per manager
SELECT m.employee_name AS manager, COUNT(*) AS direct_reports
FROM employees AS e
INNER JOIN employees AS m ON e.reports_to = m.employee_id
GROUP BY m.employee_name
ORDER BY direct_reports DESC;

2. LEFT JOIN ensures the top-level manager (reports_to IS NULL) appears in results with NULL in manager columns. INNER JOIN would exclude them because there’s no matching manager row.

9.9 Chapter Exercises

These exercises require you to combine join concepts to answer business questions that span the Northwind data model. Build each query incrementally, verifying row counts at each step.

Supplier performance. Write a query that shows each supplier’s company name, country, the number of distinct products they supply, and the average unit price of those products. Sort by product count descending. Which supplier offers the most products?
The full supply chain. Write a query that traces the complete path from supplier to customer for orders placed in July 2024. Include the supplier name, product name, customer name, order date, quantity, and line total (unit_price * quantity from order_details). This requires joining five tables. Build it one join at a time.
Row count prediction. Before running each query, predict how many rows the result will have, then verify. (a) INNER JOIN between customers (5,000 rows) and orders (200,000 rows). (b) LEFT JOIN between customers (5,000 rows) and orders (200,000 rows). (c) CROSS JOIN between categories (15 rows) and suppliers (60 rows). Explain any differences between (a) and (b).
Finding orphans. Write a query that identifies any products whose supplier_id doesn’t match any row in the suppliers table. Then write a second query that finds any suppliers who don’t have any products listed. What do these “orphan” checks tell you about data quality?

Grain awareness. A colleague writes this query to compute total revenue per customer:

duckdb

SELECT
    c.company_name,
    SUM(od.unit_price * od.quantity) AS total_revenue
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
INNER JOIN order_details AS od ON o.order_id = od.order_id
GROUP BY c.company_name
ORDER BY total_revenue DESC;

Is this query correct? Explain why or why not by tracing the grain at each join step. Then modify it to also show the number of distinct orders each customer placed (not the number of line items).

Category revenue report. Write a query that answers: “What is the total revenue for each product category?” You’ll need to join categories, products, and order_details. Then extend it to show the percentage of total company revenue that each category represents. (Hint: you can compute the overall total using a subquery.)
Debug this join. A colleague wrote this query to find “customers who have never placed an order,” but it returns zero rows. They conclude that every customer has ordered. Their conclusion happens to be correct for Northwind, but their SQL has two bugs that would cause it to return zero rows regardless of the data, even if thousands of customers had never ordered. Find both bugs and write a corrected version.
```
duckdb
```
```
SELECT c.company_name
FROM customers AS c
LEFT JOIN orders AS o ON c.customer_id = o.customer_id
WHERE o.order_id != NULL;
```

Join type prediction. Without running them, predict the row count for each query below. Then run them and explain any discrepancies between your prediction and the actual result.

duckdb

-- Query A
SELECT COUNT(*)
FROM orders AS o
INNER JOIN employees AS e ON o.employee_id = e.employee_id;

-- Query B
SELECT COUNT(*)
FROM employees AS e
LEFT JOIN orders AS o ON e.employee_id = o.employee_id;

-- Query C
SELECT COUNT(*)
FROM employees AS e
LEFT JOIN orders AS o ON e.employee_id = o.employee_id
WHERE o.order_id IS NULL;

Explain what business question Query C answers and why it’s a useful pattern.

Reconciliation with FULL OUTER JOIN. Imagine you’re comparing two lists: products that were ordered in June 2024 versus products that were shipped in June 2024. These lists won’t perfectly overlap, since some orders placed in June might not ship until July, and some shipments in June might fulfill orders placed in May. Build this analysis in three steps: (a) Write a subquery that finds all distinct product_id values from order_details joined to orders where order_date falls in June 2024. (b) Write a second subquery that finds all distinct product_id values from order_details joined to orders where shipped_date falls in June 2024. (c) FULL OUTER JOIN these two subqueries on product_id and use CASE to classify each product as 'Ordered Only' (ordered but not shipped that month), 'Shipped Only' (shipped but not ordered that month), or 'Both'. How many products fall into each category? Explain why INNER JOIN wouldn’t work for this analysis.

Solutions

SELECT s.company_name, s.country, COUNT(DISTINCT p.product_id) AS product_count,
    ROUND(AVG(p.unit_price), 2) AS avg_price
FROM suppliers AS s
INNER JOIN products AS p ON s.supplier_id = p.supplier_id
GROUP BY s.company_name, s.country
ORDER BY product_count DESC;

SELECT s.company_name AS supplier, p.product_name, c.company_name AS customer,
    o.order_date, od.quantity, od.unit_price * od.quantity AS line_total
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
INNER JOIN order_details AS od ON o.order_id = od.order_id
INNER JOIN products AS p ON od.product_id = p.product_id
INNER JOIN suppliers AS s ON p.supplier_id = s.supplier_id
WHERE o.order_date >= '2024-07-01' AND o.order_date < '2024-08-01'
ORDER BY c.company_name, p.product_name;

3. (a) ~200,000 (all orders match a customer). (b) ~200,000 (same,all customers have orders, so LEFT JOIN adds no extra rows). (c) 900 (15 × 60). (a) and (b) are identical because there are no orphaned customers.

-- Products without valid supplier
SELECT p.product_name, p.supplier_id FROM products AS p
LEFT JOIN suppliers AS s ON p.supplier_id = s.supplier_id WHERE s.supplier_id IS NULL;
-- Suppliers without products
SELECT s.company_name FROM suppliers AS s
LEFT JOIN products AS p ON s.supplier_id = p.supplier_id WHERE p.product_id IS NULL;

Both return 0 rows in Northwind,data is clean.

5. The query is correct. Grain trace: customers (1 row/customer) → orders (1 row/order) → order_details (1 row/line item). SUM aggregates line item totals correctly. To add distinct order count: COUNT(DISTINCT o.order_id) AS order_count.

SELECT cat.category_name, ROUND(SUM(od.unit_price * od.quantity), 2) AS total_revenue,
    ROUND(100.0 * SUM(od.unit_price * od.quantity) / SUM(SUM(od.unit_price * od.quantity)) OVER (), 1) AS pct_of_total
FROM categories AS cat
INNER JOIN products AS p ON cat.category_id = p.category_id
INNER JOIN order_details AS od ON p.product_id = od.product_id
GROUP BY cat.category_name
ORDER BY total_revenue DESC;

7. Two bugs: (a) != NULL is invalid,NULL != NULL evaluates to NULL, not TRUE, so WHERE filters out every row regardless of the data. NULL comparisons must use IS NULL / IS NOT NULL. (b) Even if you fix the syntax to IS NOT NULL, the logic is inverted,IS NOT NULL finds customers with orders, not without. To find customers who have never ordered, use WHERE o.order_id IS NULL. Correct version: SELECT c.company_name FROM customers AS c LEFT JOIN orders AS o ON c.customer_id = o.customer_id WHERE o.order_id IS NULL; (Returns 0 rows in Northwind because every customer has placed at least one order, but the pattern is correct.)

8. Query A: ~200,000 (all orders have valid employee_id values). Query B: ~200,000 (same, plus any orderless employees get 1 row each). Query C: employees with zero orders. This pattern (“LEFT JOIN + IS NULL”) finds unmatched rows on the left side.

WITH ordered AS (
    SELECT DISTINCT od.product_id
    FROM order_details AS od
    INNER JOIN orders AS o ON od.order_id = o.order_id
    WHERE o.order_date >= '2024-06-01' AND o.order_date < '2024-07-01'
),
shipped AS (
    SELECT DISTINCT od.product_id
    FROM order_details AS od
    INNER JOIN orders AS o ON od.order_id = o.order_id
    WHERE o.shipped_date >= '2024-06-01' AND o.shipped_date < '2024-07-01'
)
SELECT
    COALESCE(ord.product_id, shp.product_id) AS product_id,
    CASE
        WHEN ord.product_id IS NOT NULL AND shp.product_id IS NOT NULL THEN 'Both'
        WHEN ord.product_id IS NOT NULL THEN 'Ordered Only'
        ELSE 'Shipped Only'
    END AS status
FROM ordered AS ord
FULL OUTER JOIN shipped AS shp ON ord.product_id = shp.product_id
ORDER BY status, product_id;

INNER JOIN would only return “Both” products, silently dropping any product that appears in only one list,which is exactly the information this reconciliation is designed to surface.

9.10 Summary

Data is split across tables to eliminate redundancy and ensure consistency: each entity (customers, products, suppliers) lives in exactly one table, connected to related entities through key references. Primary keys uniquely identify rows within a table, and foreign keys reference primary keys in other tables, creating the links that joins traverse. Relationships between tables follow patterns: one-to-many (one supplier, many products), one-to-one, and many-to-many (orders to products, connected through the order_details junction table).

INNER JOIN combines rows where the join condition matches, excluding unmatched rows from both sides, while LEFT JOIN keeps all rows from the left table and fills NULL for unmatched right-side columns. This distinction is critical for preserving complete counts and detecting missing relationships. Multi-table queries chain joins to traverse relationships across the data model, and building these queries incrementally, one join at a time with row-count verification, prevents subtle errors.

Joins are row-matching operations, not set operations, and Venn diagrams are misleading for understanding them. Row multiplication occurs naturally when joining one-to-many relationships, and understanding the resulting grain of your data, the level of detail each row represents, is essential for correct aggregations.

9.11 Glossary

CROSS JOIN: A join that produces the Cartesian product of two tables, combining every row from the left table with every row from the right table. No join condition is used.
Foreign Key: A column in one table that references the primary key of another table, establishing a link between the two entities and enabling joins.
FULL OUTER JOIN: A join that returns all rows from both tables, filling NULL where there is no match on either side. Used for reconciliation and completeness checks.
Grain: The level of detail that each row in a query result represents. A table at “order grain” has one row per order; joining to order details changes the grain to one row per line item.
INNER JOIN: A join that returns only the rows where the join condition is satisfied in both tables. Unmatched rows from either side are excluded.
Join Condition: The expression in the ON clause that specifies how rows from two tables should be matched, typically an equality comparison between a foreign key and a primary key.
Junction Table: A table that connects two other tables in a many-to-many relationship by holding foreign keys to both. Also called a bridge table or associative table. The order_details table is a junction table connecting orders and products.
LEFT JOIN: A join that returns all rows from the left table and matching rows from the right table. Where no match exists, right-side columns are filled with NULL.
Many-to-Many Relationship: A relationship where entities on both sides can relate to multiple entities on the other side. Requires a junction table to implement in a relational database.
One-to-Many Relationship: A relationship where one entity on one side corresponds to multiple entities on the other side. The “many” side holds the foreign key referencing the “one” side’s primary key.
Primary Key: A column (or combination of columns) that uniquely identifies each row in a table. No two rows can have the same primary key value, and the value cannot be NULL.
Self-Join: A join where a table is joined to itself, using two different aliases. Useful for hierarchical data or comparing rows within the same table.
Table Alias: A short name assigned to a table in a query using the AS keyword, making multi-table queries more readable by replacing long table names with brief identifiers.