8 Querying Data
Your manager at Northwind Traders has a simple request: “Tell me about our product catalog. What do we sell, what does it cost, and what’s running low in the warehouse?” This is the kind of question that drives real analytical work, and answering it requires you to be precise about what data you want, how to filter it, and how to organize the results.
This chapter teaches the core SQL statements that make that precision possible. You’ll learn to select specific columns, filter rows by condition, sort results, handle missing values, compute summaries like totals and averages, and combine results from multiple queries using set operations. By the end, you’ll be able to take a business question, translate it into SQL, and deliver a clean answer, all working within a single table.
8.1 The Structure of a SQL Query
SQL is a declarative language. Unlike Python or Java, where you write step-by-step instructions telling the computer how to do something, SQL describes what you want and lets the database figure out how to get it. You say “give me all products under $20, sorted by price,” and the database decides whether to use an index, scan the table, or employ some other optimization. This is a fundamentally different way of thinking about computation, and it takes some adjustment if you’re coming from procedural programming.
The most common SQL statement is SELECT, which retrieves data from one or more tables. A basic query has this structure:
DuckDB
SELECT columns
FROM table
WHERE conditions
ORDER BY columns
LIMIT count;Each of these clauses plays a specific role, and you’ll learn them one at a time. But here’s something crucial to understand from the start: the order you write SQL is not the order the database executes it.
You write SELECT first, but the database processes FROM first (to know which table to look at), then WHERE (to filter rows), then SELECT (to choose columns), then ORDER BY (to sort), then LIMIT (to cap the output). This execution order explains many otherwise-confusing SQL behaviors, and we’ll return to it throughout this chapter.
execution-order
Written Order: SELECT → FROM → WHERE → ORDER BY → LIMIT
Execution Order: FROM → WHERE → SELECT → ORDER BY → LIMIT
Keep this mental model handy. When a query doesn’t behave the way you expect, the execution order is often the explanation.
8.2 SELECT and FROM: Choosing What to See
The simplest useful query asks for specific columns from a table:
DuckDB
SELECT
product_name,
unit_price,
units_in_stock
FROM products;This returns three columns from every row in the products table. Notice that we’re choosing which information to see, not loading the entire table. In a column-oriented database like DuckDB, this distinction matters for performance: the engine only reads the columns you asked for.
To see everything, you can use SELECT *:
DuckDB
SELECT * FROM products;The * means “all columns.” This is useful for exploration, but in production queries and scripts, you should always name your columns explicitly. When someone reads your SQL six months from now (including future you), SELECT * doesn’t communicate what data actually matters to the analysis.
DuckDB allows you to write FROM without SELECT, which is convenient during exploration:
DuckDB
-- [DUCKDB EXTENSION] FROM-first syntax
FROM products;This is equivalent to SELECT * FROM products. It’s useful when you’re poking around a new table and want to see what’s there. Throughout this book, we’ll use the full SELECT ... FROM syntax for clarity in teaching, and the shorthand when we’re just exploring. This is a DuckDB convenience, not standard SQL, so it won’t work in PostgreSQL or MySQL.
8.2.1 Column Aliases
Sometimes column names are cryptic or you want to label computed results. The AS keyword creates an alias:
DuckDB
SELECT
product_name AS product,
unit_price AS price_usd,
units_in_stock AS inventory
FROM products;The alias changes the column name in the output without changing anything in the underlying table. This is purely cosmetic, but it makes results much easier to read, especially when you start computing new columns.
8.2.2 Computing New Columns
SQL lets you create new columns from expressions:
DuckDB
SELECT
product_name,
unit_price,
units_in_stock,
unit_price * units_in_stock AS inventory_value
FROM products;The inventory_value column doesn’t exist in the table. It’s computed on the fly for each row by multiplying price and stock. This is one of SQL’s strengths: you can derive new information without modifying the original data.
8.2.3 DISTINCT: Removing Duplicates
When you want to see the unique values in a column, DISTINCT eliminates duplicate rows:
DuckDB
SELECT DISTINCT country
FROM suppliers;This returns each country exactly once, regardless of how many suppliers are in each country. DISTINCT applies to the entire row, so SELECT DISTINCT country, city FROM suppliers returns unique country-city combinations.
8.2.4 Exercises
Write a query that returns the
company_name,contact_name, andcountrycolumns from thecustomerstable.Write a query that computes an
inventory_valuefor each product (unit price times units in stock) and also aninventory_weightcolumn (weight per unit times units in stock). Use appropriate aliases for all computed columns.How many distinct countries do Northwind’s customers come from? Write the query, then adjust it to find how many distinct city-country combinations there are. Why is the second number larger?
Predict the output of this query without running it. Then run it to check.
DuckDB
SELECT DISTINCT category_id FROM products ORDER BY category_id;
1.
SELECT company_name, contact_name, country FROM customers;Returns 5,000 rows.
2.
SELECT product_name, unit_price, units_in_stock, weight_kg, unit_price * units_in_stock AS inventory_value, weight_kg * units_in_stock AS inventory_weight FROM products;3.
SELECT COUNT(DISTINCT country) FROM customers;Returns ~21 countries. A separate query for SELECT COUNT(*) FROM (SELECT DISTINCT city, country FROM customers); returns more rows because multiple cities exist per country.
4. Prediction: the 15 distinct category_id values in ascending order, one per row.
8.3 WHERE: Filtering Rows
SELECT chooses columns. WHERE chooses rows. It evaluates a condition for every row and only keeps the rows where the condition is true.
DuckDB
SELECT
product_name,
unit_price
FROM products
WHERE unit_price < 20;This returns only the products that cost less than $20. The comparison operators work as you’d expect: = (equal), != or <> (not equal), <, >, <=, >=.
SQL uses = for equality comparisons, not ==. If you’re coming from Python, this will trip you up at first. In SQL, WHERE country = 'USA' is correct. WHERE country == 'USA' is a syntax error in most databases (though DuckDB is forgiving and accepts both).
8.3.1 Combining Conditions
Use AND and OR to combine multiple conditions:
DuckDB
-- Products under $20 with low stock
SELECT
product_name,
unit_price,
units_in_stock
FROM products
WHERE unit_price < 20
AND units_in_stock < 30;DuckDB
-- Products from supplier 1 or supplier 2
SELECT
product_name,
supplier_id
FROM products
WHERE supplier_id = 1
OR supplier_id = 2;When mixing AND and OR, use parentheses to be explicit about grouping. SQL evaluates AND before OR (like multiplication before addition in math), which can produce unexpected results:
DuckDB
-- Without parentheses: AND binds tighter, which may surprise you
SELECT product_name, unit_price, supplier_id
FROM products
WHERE supplier_id = 1 OR supplier_id = 2 AND unit_price < 20;
-- With parentheses: your intent is clear
SELECT product_name, unit_price, supplier_id
FROM products
WHERE (supplier_id = 1 OR supplier_id = 2) AND unit_price < 20;Always use parentheses when combining AND and OR. Future readers of your code (and the database engine) will thank you.
8.3.2 Useful Filtering Patterns
IN checks membership in a list, replacing multiple OR conditions:
DuckDB
-- Instead of: WHERE country = 'USA' OR country = 'UK' OR country = 'Germany'
SELECT company_name, country
FROM suppliers
WHERE country IN ('USA', 'UK', 'Germany');BETWEEN checks a range (inclusive on both ends):
DuckDB
SELECT product_name, unit_price
FROM products
WHERE unit_price BETWEEN 10 AND 30;LIKE matches text patterns, where % represents any sequence of characters and _ represents a single character:
DuckDB
-- Products whose names start with "Ch"
SELECT product_name
FROM products
WHERE product_name LIKE 'Ch%';IS NULL and IS NOT NULL check for missing values, which deserve their own section.
8.4 NULL: The Absence of Data
NULL is one of SQL’s most important and most misunderstood concepts. A NULL value means “unknown” or “missing”, not zero, not empty string, not false. It means the data simply doesn’t exist.
In the Northwind database, the shipped_date column in the orders table contains NULL for orders that haven’t been shipped yet. This isn’t the same as the shipped date being blank or zero. It means the shipment hasn’t happened, so no date exists to record.
The critical rule is: NULL is not equal to anything, including itself. This means ordinary comparisons don’t work:
DuckDB
-- This does NOT find orders with missing shipped dates
SELECT order_id, order_date, shipped_date
FROM orders
WHERE shipped_date = NULL; -- Returns nothing! NULL = NULL is not TRUE, it's NULL
-- This is correct
SELECT order_id, order_date, shipped_date
FROM orders
WHERE shipped_date IS NULL;Any arithmetic or comparison involving NULL produces NULL. If shipped_date is NULL for some order, then shipped_date - order_date (the number of days to ship) is also NULL for that order, not zero. This propagation of unknowns is logically sound (you can’t subtract a date from an unknown and get a known result), but it catches people off guard constantly.
Most programming languages have two-value boolean logic: TRUE and FALSE. SQL has three: TRUE, FALSE, and NULL (unknown). The WHERE clause only keeps rows where the condition evaluates to TRUE, meaning rows where the condition is NULL are filtered out just like rows where it’s FALSE. This is why WHERE shipped_date = NULL returns nothing: the comparison yields NULL, not TRUE.
The COALESCE function provides a fallback value when data might be NULL:
DuckDB
SELECT
order_id,
order_date,
COALESCE(shipped_date, DATE '2999-12-31') AS effective_ship_date
FROM orders;COALESCE takes any number of arguments and returns the first non-NULL value. Here, orders that haven’t shipped yet get the date 2999-12-31 as a stand-in. That date is a sentinel value: a deliberately impossible value that stands in for “this hasn’t happened yet.” It’s your primary tool for handling missing data in queries.
Why use a far-future date instead of NULL? Because some systems and calculations can’t handle NULL gracefully. If you’re sorting orders by ship date to prioritize the oldest unshipped ones, NULL values sort unpredictably across databases, but 2999-12-31 reliably sorts to the bottom. If you’re computing date differences, NULL propagates into more NULLs, but a sentinel gives you a number you can filter on later.
You’ll encounter sentinel values constantly in real-world databases. Common examples include 9999-12-31 or 2999-12-31 for “no end date” in employee records or contracts, -1 for “unknown” in integer ID columns, and 0000-00-00 in older MySQL systems. They predate widespread NULL support and persist because many legacy systems and ETL pipelines were built around them. The trade-off is that sentinels make NULL-safe operations easier but require everyone to know the convention, and they’ll silently corrupt your aggregates if you forget to exclude them (imagine computing average shipping time when some orders have a ship date of 2999-12-31). Modern practice generally prefers NULL with explicit handling over sentinel values, but you need to recognize both approaches.
8.4.1 Exercises
Write a query that finds all products with a unit price between $15 and $25 that are not discontinued. Return the product name, unit price, and units in stock.
Find all customers in either Germany, France, or Brazil whose contact title contains the word “Manager” (use
LIKE). How many are there?This query has a bug. What’s wrong, and what does it actually return?
DuckDB
SELECT order_id, order_date, shipped_date FROM orders WHERE shipped_date != NULL;Write a query that returns all orders with a
ship_statuscolumn that shows theshipped_dateif it exists or'Not yet shipped'if it’sNULL(hint:CASEis a better fit here thanCOALESCE, because the result mixes a date and a string,tryCOALESCE(shipped_date, 'Not yet shipped')first and see what happens). Also compute ashipping_dayscolumn asshipped_date - order_date. What happens toshipping_daysfor orders that haven’t shipped? How would you useCOALESCEto provide a fallback for that column?Explain, in one sentence each, the difference between these three expressions:
WHERE shipped_date = '2024-01-15',WHERE shipped_date IS NULL, andWHERE shipped_date != '2024-01-15'. Which orders does the third expression exclude that might surprise you?
1.
SELECT product_name, unit_price, units_in_stock FROM products WHERE unit_price BETWEEN 15 AND 25 AND discontinued = 0;2.
SELECT company_name, contact_name, contact_title, country FROM customers WHERE country IN ('Germany', 'France', 'Brazil') AND contact_title LIKE '%Manager%';3. Bug: != NULL evaluates to NULL for every row. WHERE only keeps TRUE rows, so this returns 0 rows. Fix: use WHERE shipped_date IS NOT NULL.
4. COALESCE(shipped_date, ‘Not yet shipped’) errors because DATE and VARCHAR can’t mix. Use CASE instead:
SELECT order_id, order_date, shipped_date,
CASE WHEN shipped_date IS NOT NULL THEN CAST(shipped_date AS VARCHAR) ELSE 'Not yet shipped' END AS ship_status,
shipped_date - order_date AS shipping_days
FROM orders;shipping_days is NULL for unshipped orders. Use COALESCE(shipped_date - order_date, INTERVAL '0 days') or a sentinel like COALESCE(shipped_date - order_date, -1) for a numeric fallback.
5. First: orders shipped on exactly that date. Second: unshipped orders only. Third: orders shipped on any OTHER date,but it EXCLUDES unshipped orders because NULL != '2024-01-15' evaluates to NULL, not TRUE.
8.5 ORDER BY: Sorting Results
Results from a SQL query have no guaranteed order unless you specify one. The ORDER BY clause sorts the output:
DuckDB
SELECT
product_name,
unit_price
FROM products
ORDER BY unit_price;By default, ORDER BY sorts in ascending order (smallest to largest, A to Z). Add DESC for descending order:
DuckDB
-- Most expensive products first
SELECT
product_name,
unit_price
FROM products
ORDER BY unit_price DESC;You can sort by multiple columns. The first column is the primary sort, the second breaks ties:
DuckDB
-- Sort by category, then by price within each category
SELECT
category_id,
product_name,
unit_price
FROM products
ORDER BY category_id, unit_price DESC;NULL values sort last in ascending order and first in descending order by default. You can control this with NULLS FIRST or NULLS LAST if needed.
8.6 LIMIT: Capping the Output
LIMIT restricts how many rows are returned. This is essential for exploration (you rarely want to scroll through 100,000 rows) and for answering “top N” questions:
DuckDB
-- The 10 most expensive products
SELECT
product_name,
unit_price
FROM products
ORDER BY unit_price DESC
LIMIT 10;Always pair LIMIT with ORDER BY. Without ORDER BY, LIMIT 10 gives you 10 arbitrary rows, which is meaningless for analysis.
8.7 Aggregate Functions: Summarizing Data
Individual rows tell you about specific products. Aggregate functions tell you about the collection. They collapse many rows into a single summary value.
The five fundamental aggregates are:
DuckDB
SELECT
COUNT(*) AS total_products,
AVG(unit_price) AS avg_price,
SUM(units_in_stock) AS total_inventory,
MIN(unit_price) AS cheapest,
MAX(unit_price) AS most_expensive
FROM products;COUNT(*) counts rows. AVG, SUM, MIN, and MAX operate on a specific column. All aggregate functions except COUNT(*) ignore NULL values, which is important: AVG(unit_price) computes the average over only the rows where unit_price is not NULL.
COUNT has a subtle distinction worth knowing:
DuckDB
-- COUNT(*) counts all rows, regardless of NULLs
SELECT COUNT(*) FROM orders;
-- COUNT(column) counts only non-NULL values in that column
SELECT COUNT(shipped_date) FROM orders;
-- COUNT(DISTINCT column) counts unique non-NULL values
SELECT COUNT(DISTINCT country) FROM customers;8.8 CASE: Conditional Logic
The CASE expression brings if-then logic into SQL. It evaluates conditions in order and returns the value for the first matching condition:
DuckDB
SELECT
product_name,
unit_price,
CASE
WHEN unit_price >= 50 THEN 'Premium'
WHEN unit_price >= 20 THEN 'Standard'
ELSE 'Budget'
END AS price_tier
FROM products
ORDER BY unit_price DESC;CASE is remarkably useful. You’ll use it to categorize data, create flag columns, handle conditional calculations, and build readable output. Notice the END keyword that closes the expression, and the optional ELSE that provides a default when no conditions match (without ELSE, unmatched rows get NULL).
You can use CASE inside aggregate functions to count or sum conditionally:
DuckDB
SELECT
COUNT(*) AS total_products,
COUNT(CASE WHEN discontinued = 1 THEN 1 END) AS discontinued_count,
COUNT(CASE WHEN units_in_stock = 0 THEN 1 END) AS out_of_stock_count
FROM products;8.8.1 Exercises
Write a query that returns the 5 cheapest products that are not discontinued. Include the product name and price.
Using aggregate functions, answer these questions in a single query: How many products does Northwind carry? What is the average unit price? What is the total value of all inventory (sum of price × stock for all products)?
Using
CASE, write a query that classifies each product’s stock status: “Out of Stock” ifunits_in_stockis 0, “Critical” if stock is at or below the reorder level, “Adequate” otherwise. Return the product name, units in stock, reorder level, and stock status. Sort the results so that out-of-stock and critical products appear first. (We’ll learn to count how many products fall into each category in the Analytical SQL chapter.)What is wrong with this query? Predict what error you’ll get, then run it to confirm.
DuckDB
SELECT product_name, AVG(unit_price) FROM products;Using aggregate functions and
CASEtogether, write a single query that computes: the total number of products, the number of discontinued products, the number of products currently out of stock (not discontinued butunits_in_stock = 0), and the average price of active (not discontinued) products. (Hint: useCASEexpressions insideCOUNTandAVGto compute conditional aggregates without grouping.)
1.
SELECT product_name, unit_price FROM products WHERE discontinued = 0 ORDER BY unit_price ASC LIMIT 5;2.
SELECT COUNT(*) AS total_products, ROUND(AVG(unit_price), 2) AS avg_price, ROUND(SUM(unit_price * units_in_stock), 2) AS total_inventory_value FROM products;3.
SELECT product_name, units_in_stock, reorder_level,
CASE
WHEN units_in_stock = 0 THEN 'Out of Stock'
WHEN units_in_stock <= reorder_level THEN 'Critical'
ELSE 'Adequate'
END AS stock_status
FROM products
ORDER BY CASE WHEN units_in_stock = 0 THEN 1 WHEN units_in_stock <= reorder_level THEN 2 ELSE 3 END;4. Error: product_name is not aggregated and not in GROUP BY. You can’t mix individual columns with aggregate functions without GROUP BY.
5.
SELECT COUNT(*) AS total, COUNT(CASE WHEN discontinued = 1 THEN 1 END) AS discontinued_count,
COUNT(CASE WHEN discontinued = 0 AND units_in_stock = 0 THEN 1 END) AS out_of_stock_active,
ROUND(AVG(CASE WHEN discontinued = 0 THEN unit_price END), 2) AS avg_active_price
FROM products;8.9 Set Operations: Combining Query Results
Sometimes you need to combine the results of two separate queries into one result set. SQL provides set operations for this, and if you’ve encountered set theory in a math course, the concepts will feel familiar: they map directly to union, intersection, and difference.
Unlike joins, which combine columns from different tables side by side, set operations stack rows from multiple queries on top of each other. The key requirement is that both queries must return the same number of columns with compatible data types.
8.9.1 UNION and UNION ALL
UNION combines the results of two queries and removes duplicate rows. UNION ALL does the same but keeps all rows, including duplicates.
DuckDB
-- [STANDARD SQL] Unique cities where we have either a supplier or a customer
SELECT city, country
FROM suppliers
UNION
SELECT city, country
FROM customers
ORDER BY country, city;This query answers: “In which cities do we have a business presence, either through a supplier or a customer?” The UNION removes any city-country pair that appears in both tables, giving you the unique set. If London, UK appears as both a supplier location and a customer location, you’d see it once.
Compare that with UNION ALL:
DuckDB
-- [STANDARD SQL] All cities, keeping duplicates from both tables
SELECT city, country
FROM suppliers
UNION ALL
SELECT city, country
FROM customers
ORDER BY country, city;Run both and compare the row counts. The difference tells you how many city-country pairs appear in both tables.
UNION ALL is significantly faster than UNION (often 3 to 4x) because it skips the duplicate detection and removal step. Use UNION ALL when duplicates are acceptable or when you know they don’t exist. In practice, UNION ALL is far more common in analytical work because you usually want to preserve all records.
In most analytical queries, you also want to know which table each row came from. Adding a label column makes the rows from each query distinct, so UNION ALL is the natural choice:
DuckDB
-- [STANDARD SQL] Labeled directory: every supplier and customer location
SELECT city, country, 'Supplier' AS entity_type
FROM suppliers
UNION ALL
SELECT city, country, 'Customer' AS entity_type
FROM customers
ORDER BY country, city;Notice that using UNION instead of UNION ALL here would make no difference: since the entity_type column is always different between the two queries, no two rows can be identical, so there are no duplicates for UNION to remove. When your queries produce structurally distinct rows like this, UNION ALL is always the right choice.
8.9.2 INTERSECT
INTERSECT returns only the rows that appear in both queries:
DuckDB
-- [STANDARD SQL] Countries where we have BOTH a supplier and a customer
SELECT country FROM suppliers
INTERSECT
SELECT country FROM customers
ORDER BY country;This answers: “Which countries represent both sides of our supply chain?” These are countries where Northwind both sources products and sells to customers, potentially interesting for logistics optimization.
8.9.3 EXCEPT
EXCEPT returns rows from the first query that don’t appear in the second:
DuckDB
-- [STANDARD SQL] Countries where we have customers but no suppliers
SELECT country FROM customers
EXCEPT
SELECT country FROM suppliers
ORDER BY country;This identifies countries where Northwind has customers but no local supplier. The order matters: A EXCEPT B gives you rows in A that aren’t in B, which is different from B EXCEPT A.
A particularly useful application of EXCEPT is data validation, verifying that two queries produce identical results:
DuckDB
-- [STANDARD SQL] Check if two queries return the same results
-- An empty result means the queries match
(SELECT product_id, product_name FROM products WHERE unit_price > 20)
EXCEPT
(SELECT product_id, product_name FROM products WHERE NOT unit_price <= 20);8.9.4 Set Operations and Venn Diagrams
You’ll often see joins explained with Venn diagrams online, but as we’ll discuss in the next chapter, that analogy is misleading for joins. Set operations, however, are exactly where Venn diagrams apply correctly. UNION is A ∪ B (everything in either circle), INTERSECT is A ∩ B (the overlap), and EXCEPT is A − B (in A but not in B). Both sides represent complete rows of the same structure, and the operations combine, overlap, or subtract those rows, which is precisely what Venn diagrams model.
8.9.5 Compatibility Rules
All set operations require the same number of columns in both queries, compatible data types in corresponding positions (you can’t union an integer column with a text column), and column names come from the first query only. ORDER BY applies to the entire combined result and appears once at the end, not within individual queries.
8.9.6 Exercises
Write a query using
INTERSECTto find cities that appear in both thesuppliersandcustomerstables. How many shared cities are there?Write a query using
EXCEPTto find countries where Northwind has customers but no suppliers. What does this suggest about Northwind’s supply chain?A colleague writes this query and gets an error. What’s wrong?
DuckDB
SELECT company_name, city FROM suppliers UNION SELECT city, country FROM customers;Write a query using
UNION ALLthat creates a combined directory of all supplier and customer contacts, with columns forcompany_name,contact_name,country, and a label column ('Supplier'or'Customer'). Why isUNION ALLmore appropriate thanUNIONhere?
1.
SELECT city FROM suppliers INTERSECT SELECT city FROM customers ORDER BY city;2.
SELECT country FROM customers EXCEPT SELECT country FROM suppliers;Returns 11 countries (Belgium, China, Denmark, India, Ireland, Italy, Netherlands, Norway, Poland, Portugal, Venezuela). These are markets where Northwind sells but has no local supplier, meaning all products must be imported from other countries.
3. Error: column types/semantics don’t align. company_name (from suppliers) paired with city (from customers) makes no logical sense even if types match. Set operations require matching column meaning, not just types.
4.
SELECT company_name, contact_name, country, 'Supplier' AS entity_type FROM suppliers
UNION ALL
SELECT company_name, contact_name, country, 'Customer' FROM customers
ORDER BY company_name;UNION ALL is better because the entity_type label makes every row unique, so UNION’s deduplication adds cost with no benefit.
8.10 Putting It All Together
Let’s return to the manager’s original question: “What do we sell, what does it cost, and what’s running low?” Now you have the vocabulary to answer precisely:
DuckDB
SELECT
product_name AS product,
unit_price AS price,
units_in_stock AS stock,
reorder_level AS reorder_point,
units_in_stock - reorder_level AS stock_above_reorder,
CASE
WHEN discontinued = 1 THEN 'Discontinued'
WHEN units_in_stock = 0 THEN 'OUT OF STOCK'
WHEN units_in_stock <= reorder_level THEN 'Reorder Now'
WHEN units_in_stock <= reorder_level * 1.5 THEN 'Running Low'
ELSE 'Adequate'
END AS stock_status
FROM products
ORDER BY stock_above_reorder;This single query retrieves product information, computes how far above (or below) the reorder point each product is, categorizes the stock status (including flagging discontinued items), and sorts so the most urgent items appear first. That’s a complete analytical deliverable in 14 lines of SQL.
To build your intuition, pick a specific product from the result set and trace through the CASE expression by hand. What are the values of discontinued, units_in_stock, and reorder_level for that product? Which WHEN branch does it match? What would happen if you reordered the WHEN clauses? Since CASE evaluates top-to-bottom and stops at the first match, clause order matters, and tracing through specific rows is the best way to verify that your logic handles every scenario correctly.
Read this query carefully, because it demonstrates something important about the whole game: the value isn’t in knowing that WHERE filters rows or ORDER BY sorts results. The value is in translating a business question into a precise, readable query that someone else can understand, verify, and build on. That translation, from human question to structured query, is the skill that matters.
8.11 Chapter Exercises
These exercises require combining multiple concepts from the chapter. Approach each one by identifying the business question first, then building the SQL step by step.
Inventory risk report. Write a single query that identifies all active (not discontinued) products where the current stock is below the reorder level. Return the product name, current stock, reorder level, and the deficit (reorder level minus stock). Sort by the deficit so the most urgent items appear first. This is a query a warehouse manager would actually run.
Price tier classification. Your manager asks: “Show me our product catalog organized by price tier.” Write a query that uses
CASEto classify each product into price tiers: “Budget” (under $10), “Standard” ($10 to $25), “Premium” ($25 to $50), and “Luxury” (over $50). Return the product name, unit price, and price tier. Sort by unit price descending. Then write a second query that uses conditional aggregation (CASE inside COUNT and AVG) to compute, in a single row, how many products fall into each tier and the overall average price of active products. This second query should not useGROUP BY.Execution order detective. Explain why this query produces an error, referencing the SQL execution order:
DuckDB
SELECT product_name, unit_price * units_in_stock AS inventory_value FROM products WHERE inventory_value > 1000;Then rewrite it so it works correctly.
NULL audit. Write a query against the
orderstable that counts, for each column that contains NULLs (shipped_date,shipper_id), how many rows haveNULLvalues and what percentage of the total that represents. This is a common data quality check. (Hint: you can compute everything in a single query usingCOUNT(*)andCOUNT(column)together.)Complete business directory. The Northwind CEO wants a single report listing every company Northwind does business with, including suppliers, customers, and shippers. For each company, show the company name, the type of relationship (
'Supplier','Customer', or'Shipper'), and the country (shippers don’t have a country in the data, so use'N/A'). Sort alphabetically by company name. Which set operation is the right choice here, and why?NULL propagation challenge. Predict the result of each expression before running it, then verify. For any that surprise you, explain why SQL produces that result.
DuckDB
SELECT NULL = NULL AS test_1, NULL != NULL AS test_2, NULL AND TRUE AS test_3, NULL OR TRUE AS test_4, NULL OR FALSE AS test_5, COALESCE(NULL, NULL, 'fallback') AS test_6, 5 + NULL AS test_7, 5 > NULL AS test_8;Then explain in your own words why
WHERE shipped_date != '2024-01-15'does not return orders whoseshipped_dateisNULL.Debug this report. A colleague wrote the following query to produce an inventory summary, but the results don’t look right. There are at least two problems. Find them all and write a corrected version.
DuckDB
SELECT product_name, unit_price AS price, unit_price * units_in_stock AS inventory_value, CASE WHEN inventory_value = 0 THEN 'Out of Stock' WHEN units_in_stock <= reorder_level THEN 'Reorder' ELSE 'OK' END AS status FROM products WHERE discontinued = 0 ORDER BY inventory_value;(Hints: think about execution order, and consider what
inventory_value = 0actually tests. Is a product with a $0 unit price but 100 units in stock really “Out of Stock”?)
1.
SELECT product_name, units_in_stock, reorder_level, reorder_level - units_in_stock AS deficit FROM products WHERE discontinued = 0 AND units_in_stock < reorder_level ORDER BY deficit DESC;2. First query classifies into Budget/Standard/Premium/Luxury with CASE on unit_price thresholds. Second query uses conditional aggregation: COUNT(CASE WHEN unit_price < 10 THEN 1 END) AS budget_count etc., with AVG(CASE WHEN discontinued = 0 THEN unit_price END).
3. Error: WHERE executes before SELECT, so inventory_value alias doesn’t exist yet. Fix: replace WHERE filter with WHERE unit_price * units_in_stock > 1000.
4.
SELECT COUNT(*) AS total_orders,
COUNT(*) - COUNT(shipped_date) AS null_shipped, ROUND(100.0 * (COUNT(*) - COUNT(shipped_date)) / COUNT(*), 2) AS pct_shipped_null,
COUNT(*) - COUNT(shipper_id) AS null_shipper, ROUND(100.0 * (COUNT(*) - COUNT(shipper_id)) / COUNT(*), 2) AS pct_shipper_null
FROM orders;5.
SELECT company_name, 'Supplier' AS type, country FROM suppliers
UNION ALL SELECT company_name, 'Customer', country FROM customers
UNION ALL SELECT company_name, 'Shipper', 'N/A' FROM shippers
ORDER BY company_name;UNION ALL is correct,entity_type makes rows unique, so dedup adds cost with no benefit.
6. test_1=NULL, test_2=NULL, test_3=NULL, test_4=TRUE, test_5=NULL, test_6=‘fallback’, test_7=NULL, test_8=NULL. Key insight: NULL poisons most operations except OR TRUE and COALESCE.
7. Two bugs: (a) CASE uses inventory_value alias which may not be available in CASE context, and (b) inventory_value = 0 is wrong,it catches $0 products with stock, not just out-of-stock items. Fix: use units_in_stock = 0 instead.
8.12 Summary
SQL queries follow a SELECT ... FROM ... WHERE ... ORDER BY ... LIMIT structure, but the database executes them in a different order: FROM first (find the table), then WHERE (filter rows), then SELECT (choose columns), then ORDER BY (sort), then LIMIT (cap output). Understanding this execution order prevents common errors. SELECT retrieves specific columns, and the AS keyword creates aliases for readability. ORDER BY sorts results (defaulting to ascending), and LIMIT caps the number of rows returned.
WHERE filters rows using comparison operators, AND/OR logic, IN for lists, BETWEEN for ranges, and LIKE for text patterns. NULL represents missing data and behaves differently from any concrete value: it is not equal to anything, propagates through operations, and requires IS NULL / IS NOT NULL for testing. SQL uses three-value logic (TRUE, FALSE, NULL), and WHERE only keeps rows where the condition is TRUE.
Aggregate functions (COUNT, SUM, AVG, MIN, MAX) collapse many rows into summary statistics, while CASE expressions add conditional logic, enabling categorization and conditional aggregation within a single query. Set operations (UNION, UNION ALL, INTERSECT, EXCEPT) combine results from multiple queries vertically, stacking rows rather than joining columns, with UNION ALL being the most common in analytical work because it preserves all rows and runs faster than UNION.
8.13 Glossary
- Aggregate Function
-
A function that takes multiple rows as input and returns a single summary value.
COUNT,SUM,AVG,MIN, andMAXare the five fundamental aggregates. - Alias
-
An alternative name assigned to a column or table in a query using the
ASkeyword, improving readability without modifying the underlying data. - CASE Expression
- A SQL construct that evaluates conditions in order and returns the value associated with the first matching condition, providing if-then logic within queries.
- Clause
-
A component of a SQL statement that performs a specific function.
SELECT,FROM,WHERE,ORDER BY, andLIMITare clauses. - COALESCE
-
A function that returns the first non-
NULLvalue from a list of arguments, commonly used to provide fallback values for missing data. - Declarative Language
- A programming paradigm where you describe what result you want rather than how to compute it. SQL is declarative: you specify the desired output, and the database determines the execution strategy.
- DISTINCT
- A keyword that eliminates duplicate rows from query results, returning only unique combinations of the selected columns.
- EXCEPT
- A set operation that returns rows from the first query that do not appear in the second query’s results. Equivalent to the set difference (A − B) in set theory.
- Execution Order
-
The sequence in which the database processes the clauses of a SQL statement, which differs from the order in which they are written. For the clauses covered in this chapter, the logical execution order is
FROM → WHERE → SELECT → ORDER BY → LIMIT. The full order, includingGROUP BYandHAVING(covered in Chapter 10), isFROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. - INTERSECT
- A set operation that returns only the rows that appear in both queries’ results. Equivalent to the intersection (A ∩ B) in set theory.
- NULL
-
A special marker in SQL representing the absence of a value, meaning “unknown” or “missing.”
NULLis not equal to anything, including itself, and requires special handling withIS NULLandIS NOT NULL. - Predicate
-
A condition in a
WHEREclause that evaluates toTRUE,FALSE, orNULLfor each row, determining which rows are included in the result. - Sentinel Value
-
A deliberately impossible or out-of-range value used as a stand-in for missing data, such as
9999-12-31for “no end date” or-1for “unknown ID.” Sentinel values predate widespreadNULLsupport and are common in legacy systems, but they can silently corrupt aggregations if not filtered out. - Set Operation
-
A SQL operation that combines the results of two queries vertically (stacking rows).
UNION,UNION ALL,INTERSECT, andEXCEPTare the four set operations, and they require compatible column structures in both queries. - UNION / UNION ALL
-
Set operations that combine rows from two queries.
UNIONremoves duplicate rows from the combined result, whileUNION ALLpreserves all rows including duplicates and is significantly faster.