10 Analytical SQL
Up to this point, you’ve been asking pointed questions: “Which products cost less than $20?” “Which supplier provides Chai?” These are retrieval questions, and they’re important. But the questions that drive business decisions are usually about patterns: How are sales trending over time? Which product categories generate the most revenue? Who are our top customers, and are they buying more or less than last quarter? Answering these questions requires aggregating, grouping, ranking, and comparing data, and that’s the territory this chapter covers.
This chapter introduces GROUP BY for categorized summaries, HAVING for filtering those summaries, subqueries and Common Table Expressions (CTEs) for building complex logic in stages, and window functions for computations that need both individual row detail and aggregate context. These are the tools that separate someone who can look things up in a database from someone who can analyze it.
10.1 GROUP BY: Summarizing by Category
You’ve already used aggregate functions like COUNT and AVG to compute a single summary across an entire table. GROUP BY lets you compute those same summaries within categories. Instead of “what’s the average product price?”, you can ask “what’s the average product price per category?”
duckdb
-- [STANDARD SQL]
SELECT
category_id,
COUNT(*) AS product_count,
ROUND(AVG(unit_price), 2) AS avg_price,
SUM(units_in_stock) AS total_stock
FROM products
GROUP BY category_id;This query divides the products table into groups (one group per category_id), then computes the aggregates independently within each group. The result has one row per category, not one row per product.
The rule is strict: every column in SELECT must either appear in GROUP BY or be inside an aggregate function. You can’t ask for product_name here because there are many product names within each category, and the database doesn’t know which one to show. This is the single most common GROUP BY error, and the error message you’ll see (something like “column must appear in GROUP BY clause or be used in an aggregate function”) will become familiar fast.
10.1.1 Grouping by Multiple Columns
You can group by multiple columns to create finer-grained summaries:
duckdb
-- [STANDARD SQL]
SELECT
s.country,
c.category_name,
COUNT(*) AS product_count,
ROUND(AVG(p.unit_price), 2) AS avg_price
FROM products AS p
INNER JOIN suppliers AS s ON p.supplier_id = s.supplier_id
INNER JOIN categories AS c ON p.category_id = c.category_id
GROUP BY s.country, c.category_name
ORDER BY s.country, c.category_name;Each unique combination of country and category_name gets its own row and its own aggregate values. This is how you build the kind of cross-tabulated summaries that populate business reports.
When every non-aggregate column in your SELECT should be in the GROUP BY (which is almost always the case), DuckDB lets you write:
duckdb
-- [DUCKDB EXTENSION] GROUP BY ALL infers grouping columns
SELECT
s.country,
c.category_name,
COUNT(*) AS product_count
FROM products AS p
INNER JOIN suppliers AS s ON p.supplier_id = s.supplier_id
INNER JOIN categories AS c ON p.category_id = c.category_id
GROUP BY ALL;GROUP BY ALL automatically groups by every non-aggregate column in SELECT. This is a DuckDB convenience, not standard SQL. The portable equivalent is to list every non-aggregate column explicitly in GROUP BY, as shown above.
10.2 HAVING: Filtering Groups
WHERE filters individual rows before grouping. HAVING filters groups after aggregation. The distinction maps directly to the execution order: WHERE runs before GROUP BY, so it operates on individual rows. HAVING runs after, so it operates on the aggregated results.
“Which countries have more than 3 suppliers?”
duckdb
-- [STANDARD SQL]
SELECT
country,
COUNT(*) AS supplier_count
FROM suppliers
GROUP BY country
HAVING COUNT(*) > 3
ORDER BY supplier_count DESC;You cannot use HAVING without GROUP BY, and you cannot use a column alias in HAVING in standard SQL (because HAVING is processed before SELECT assigns the alias). DuckDB is lenient and allows aliases in HAVING, but writing the full aggregate expression makes your code portable.
Here’s the complete logical execution order with all clauses you’ve learned so far:
output
FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT
This means WHERE can’t reference aggregate results (they don’t exist yet), and HAVING can’t reference column aliases from SELECT in standard SQL. When something doesn’t work, trace through this execution order to find out why.
10.2.1 Exercises
Write a query that counts how many products are in each category and computes the average unit price per category. Join to
categoriesto show the category name rather than the ID. Which category has the most products? Which has the highest average price?Write a query that finds all countries with more than 5 customers. Show the country and customer count, sorted by count descending.
This query is supposed to find categories where the average product price exceeds $25, but it has a bug. Find it and fix it:
duckdb
SELECT category_id, AVG(unit_price) AS avg_price FROM products WHERE AVG(unit_price) > 25 GROUP BY category_id;Explain the difference between
WHEREandHAVINGusing the execution order diagram. Why can’tWHEREfilter on aggregate results?
1.
SELECT c.category_name, COUNT(*) AS product_count, ROUND(AVG(p.unit_price), 2) AS avg_price
FROM products AS p
INNER JOIN categories AS c ON p.category_id = c.category_id
GROUP BY c.category_name
ORDER BY product_count DESC;2.
SELECT country, COUNT(*) AS customer_count FROM customers GROUP BY country HAVING COUNT(*) > 5 ORDER BY customer_count DESC;3. Bug: WHERE executes before GROUP BY, so aggregate functions don’t exist yet. Fix: move the condition to HAVING:
SELECT category_id, AVG(unit_price) AS avg_price FROM products GROUP BY category_id HAVING AVG(unit_price) > 25;4. Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. WHERE filters individual rows before any grouping happens, so aggregates haven’t been computed yet. HAVING runs after GROUP BY, when aggregates are available.
10.3 Subqueries: Queries Inside Queries
Sometimes you need the result of one query to drive another. A subquery is a complete SQL query nested inside another query. They come in two main forms.
10.3.1 Scalar Subqueries
A scalar subquery returns a single value and can be used anywhere you’d use a constant:
duckdb
-- [STANDARD SQL] Products priced above the overall average
SELECT
product_name,
unit_price
FROM products
WHERE unit_price > (
SELECT AVG(unit_price) FROM products
)
ORDER BY unit_price;The inner query computes the average price (a single number), and the outer query uses that number as the threshold. The database executes the inner query first, gets the result, and then uses it in the outer query.
10.3.3 Subqueries in FROM
You can use a subquery as a virtual table in the FROM clause:
duckdb
-- [STANDARD SQL] Customers with more than 10 orders, showing their total spend
SELECT
customer_orders.company_name,
customer_orders.order_count,
customer_orders.total_spent
FROM (
SELECT
c.company_name,
COUNT(DISTINCT o.order_id) AS order_count,
ROUND(SUM(od.unit_price * od.quantity), 2) AS total_spent
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
INNER JOIN order_details AS od ON o.order_id = od.order_id
GROUP BY c.company_name
) AS customer_orders
WHERE customer_orders.order_count > 10
ORDER BY customer_orders.total_spent DESC;This pattern works, but nesting subqueries quickly becomes hard to read. When your logic has multiple stages, Common Table Expressions are a better approach.
10.4 CTEs: Building Queries in Stages
A Common Table Expression (CTE) is a named, temporary result set defined at the beginning of a query using the WITH keyword. Think of it as creating a named variable that holds a table, then using that variable in your main query. CTEs make complex queries readable by breaking them into named, logical steps.
duckdb
-- [STANDARD SQL] Same query as above, but readable
WITH customer_orders AS (
SELECT
c.company_name,
c.customer_id,
COUNT(DISTINCT o.order_id) AS order_count,
ROUND(SUM(od.unit_price * od.quantity), 2) AS total_spent
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
INNER JOIN order_details AS od ON o.order_id = od.order_id
GROUP BY c.company_name, c.customer_id
)
SELECT
company_name,
order_count,
total_spent
FROM customer_orders
WHERE order_count > 10
ORDER BY total_spent DESC;The CTE customer_orders computes the aggregated customer data, and the main query filters and sorts it. The logic is identical to the nested subquery version, but the structure communicates your thinking: “first, I calculate each customer’s orders and spend; then, I filter to active customers.”
CTEs were introduced in the SQL:1999 standard and are supported by every major database. They’re one of the most universally useful features in SQL.
10.4.1 Chaining Multiple CTEs
CTEs shine when you need multiple preparation steps:
duckdb
-- [STANDARD SQL] Which product categories are most popular with our top customers?
WITH customer_spend AS (
-- Step 1: Total spend per customer
SELECT
o.customer_id,
SUM(od.unit_price * od.quantity) AS total_spent
FROM orders AS o
INNER JOIN order_details AS od ON o.order_id = od.order_id
GROUP BY o.customer_id
),
top_customers AS (
-- Step 2: Identify top 20% by spend
SELECT customer_id
FROM customer_spend
WHERE total_spent > (SELECT PERCENTILE_CONT(0.80) WITHIN GROUP (ORDER BY total_spent) FROM customer_spend)
),
top_customer_categories AS (
-- Step 3: What categories do top customers buy?
SELECT
cat.category_name,
COUNT(DISTINCT o.order_id) AS order_count,
ROUND(SUM(od.unit_price * od.quantity), 2) AS category_revenue
FROM top_customers AS tc
INNER JOIN orders AS o ON tc.customer_id = o.customer_id
INNER JOIN order_details AS od ON o.order_id = od.order_id
INNER JOIN products AS p ON od.product_id = p.product_id
INNER JOIN categories AS cat ON p.category_id = cat.category_id
GROUP BY cat.category_name
)
-- Final: Display the results
SELECT
category_name,
order_count,
category_revenue
FROM top_customer_categories
ORDER BY category_revenue DESC;Each CTE is a clearly named step in the analysis. Anyone reading this query can trace the logic: calculate spend, identify top customers, analyze their category preferences, display results. Try flattening this into nested subqueries and you’ll appreciate why CTEs exist.
The top_customers CTE introduces PERCENTILE_CONT(0.80) WITHIN GROUP (ORDER BY total_spent), which computes the 80th percentile of customer spending. This is a [STANDARD SQL] ordered-set aggregate function (from SQL:2003). The WITHIN GROUP (ORDER BY ...) clause tells the function how to sort the values before computing the percentile. PERCENTILE_CONT(0.5) gives the median. The syntax looks unusual compared to regular aggregate functions, but it appears frequently in analytical work when you need to identify thresholds based on data distribution rather than fixed cutoff values.
Use subqueries when the nested logic is simple and fits naturally in one place (like a scalar comparison in WHERE). Use CTEs when the logic has multiple stages, when you need to reference the same intermediate result more than once, or when readability matters, which is almost always. In professional work, CTEs are strongly preferred because they make query intent visible and make debugging easier (you can run each CTE independently to check its output).
10.4.2 Exercises
Write a query using a scalar subquery that finds all products priced more than twice the average product price. Return the product name, price, and how much above the average it is.
Rewrite the following nested subquery as a CTE. Which version is easier to read?
duckdb
SELECT supplier_summary.company_name, supplier_summary.product_count FROM ( SELECT s.company_name, COUNT(*) AS product_count FROM suppliers AS s INNER JOIN products AS p ON s.supplier_id = p.supplier_id GROUP BY s.company_name ) AS supplier_summary WHERE supplier_summary.product_count >= 5 ORDER BY supplier_summary.product_count DESC;Write a CTE-based query with two stages: (a) compute total revenue per product (
SUM(unit_price * quantity)fromorder_details), then (b) find the top 10 products by revenue. Show the product name and total revenue.A colleague argues, “CTEs are just subqueries with extra syntax, so why bother?” Write a brief response explaining when CTEs provide real value over subqueries. Give a concrete example.
1.
SELECT product_name, unit_price,
ROUND(unit_price - (SELECT AVG(unit_price) FROM products), 2) AS above_avg
FROM products
WHERE unit_price > 2 * (SELECT AVG(unit_price) FROM products)
ORDER BY unit_price DESC;2. CTE version:
WITH supplier_summary AS (
SELECT s.company_name, COUNT(*) AS product_count
FROM suppliers AS s
INNER JOIN products AS p ON s.supplier_id = p.supplier_id
GROUP BY s.company_name
)
SELECT company_name, product_count FROM supplier_summary
WHERE product_count >= 5 ORDER BY product_count DESC;The CTE separates aggregation from filtering, making the logic clearer.
3.
WITH product_revenue AS (
SELECT p.product_name, ROUND(SUM(od.unit_price * od.quantity), 2) AS total_revenue
FROM order_details AS od
INNER JOIN products AS p ON od.product_id = p.product_id
GROUP BY p.product_id, p.product_name
)
SELECT product_name, total_revenue FROM product_revenue ORDER BY total_revenue DESC LIMIT 10;4. CTEs add value when: (a) logic has multiple stages (each CTE is a named step), (b) you reference the same intermediate result more than once (write it once, use it twice), (c) debugging is needed (run each CTE independently). Example: computing customer spend, filtering top customers, then analyzing their purchases,three named CTEs vs. three levels of nesting.
10.5 Window Functions: Analysis Without Collapse
Window functions are not an advanced topic you can skip. They are a core analytical tool that you’ll use constantly in professional work. Ranking, running totals, period-over-period comparisons, and moving averages are all standard analytical requirements, and window functions are how SQL handles them. They were added to the SQL standard in 2003, and every major database supports them.
In real work, nobody frames requests in SQL terms. Your manager will say “show me each rep’s numbers next to the team average” or “rank our products by margin within each category” or “compare this month to last month.” The skill isn’t knowing window function syntax,it’s recognizing that the phrase “each row alongside a group summary” is a window function problem, and “compare to the previous period” is a LAG problem. As you work through these exercises, practice translating business questions into SQL patterns, because that translation step is where the real expertise lives.
The fundamental tension they solve: GROUP BY gives you summaries but destroys individual row detail. Window functions give you summaries alongside detail, keeping every row intact while adding computed context.
10.5.1 The Core Concept
Compare these two queries:
duckdb
-- [STANDARD SQL] GROUP BY: collapses to one row per category
SELECT
category_id,
AVG(unit_price) AS avg_price
FROM products
GROUP BY category_id;
-- [STANDARD SQL] Window function: keeps every product, adds category average alongside
SELECT
product_name,
category_id,
unit_price,
AVG(unit_price) OVER (PARTITION BY category_id) AS category_avg_price
FROM products;The GROUP BY query returns one row per category. The window function query returns every product row, with the category’s average price added as a new column. Both compute the same averages, but the window function preserves the individual product context.
The OVER clause is what makes a function a window function. Inside OVER, PARTITION BY defines the groups (like GROUP BY but without collapsing), and ORDER BY defines the sequence within each group.
10.5.2 Ranking Functions
Window functions are the standard way to rank, number, and order rows within groups.
duckdb
-- [STANDARD SQL] Rank products by price within each category
SELECT
category_id,
product_name,
unit_price,
ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY unit_price DESC) AS price_rank,
RANK() OVER (PARTITION BY category_id ORDER BY unit_price DESC) AS price_rank_with_ties
FROM products;ROW_NUMBER assigns a unique sequential number within each partition. RANK does the same, but ties get the same rank (with gaps: 1, 2, 2, 4). DENSE_RANK is like RANK but without gaps (1, 2, 2, 3).
A practical application: finding the top 3 most expensive products in each category:
duckdb
-- [STANDARD SQL] Top 3 products per category (portable approach)
SELECT *
FROM (
SELECT
c.category_name,
p.product_name,
p.unit_price,
ROW_NUMBER() OVER (PARTITION BY p.category_id ORDER BY p.unit_price DESC) AS rn
FROM products AS p
INNER JOIN categories AS c ON p.category_id = c.category_id
) AS ranked
WHERE rn <= 3
ORDER BY category_name, rn;The subquery pattern above is necessary in most databases because you can’t filter on window function results in WHERE (the window function hasn’t been computed yet at the WHERE stage). DuckDB provides QUALIFY, a clause that filters on window function results directly:
duckdb
-- [DUCKDB EXTENSION] QUALIFY filters window function results
SELECT
c.category_name,
p.product_name,
p.unit_price
FROM products AS p
INNER JOIN categories AS c ON p.category_id = c.category_id
QUALIFY ROW_NUMBER() OVER (PARTITION BY p.category_id ORDER BY p.unit_price DESC) <= 3
ORDER BY c.category_name, p.unit_price DESC;QUALIFY is to window functions what HAVING is to GROUP BY: a filter that runs after the computation. This is a DuckDB extension, not part of the SQL standard. The portable alternative is the subquery approach shown above.
10.5.3 LEAD and LAG: Looking at Adjacent Rows
LAG looks at the previous row, and LEAD looks at the next row within a partition. These are indispensable for time-series analysis.
duckdb
-- [STANDARD SQL] Monthly order analysis: compare each month to the previous
WITH monthly_orders AS (
SELECT
DATE_TRUNC('month', order_date) AS order_month,
COUNT(*) AS order_count,
ROUND(SUM(freight), 2) AS total_freight
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
order_month,
order_count,
LAG(order_count) OVER (ORDER BY order_month) AS prev_month_orders,
order_count - LAG(order_count) OVER (ORDER BY order_month) AS month_over_month_change,
total_freight
FROM monthly_orders
ORDER BY order_month;LAG(order_count) OVER (ORDER BY order_month) says “give me the order_count from the previous row, where ‘previous’ is defined by chronological order.” The first month has NULL for the previous value because there is no earlier row.
10.5.4 Running Totals and Moving Averages
Window functions with frame specifications compute running calculations:
duckdb
-- [STANDARD SQL] Running total of freight costs over time
WITH monthly_freight AS (
SELECT
DATE_TRUNC('month', order_date) AS order_month,
ROUND(SUM(freight), 2) AS monthly_freight
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
)
SELECT
order_month,
monthly_freight,
SUM(monthly_freight) OVER (ORDER BY order_month) AS cumulative_freight,
ROUND(
AVG(monthly_freight) OVER (
ORDER BY order_month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
), 2
) AS three_month_moving_avg
FROM monthly_freight
ORDER BY order_month;The ROWS BETWEEN 2 PRECEDING AND CURRENT ROW clause defines the window frame: the subset of rows within the partition that the aggregate considers. For a running total, the frame implicitly stretches from the start of the partition to the current row. For the moving average, we explicitly set a 3-month window (current row plus the two preceding rows).
10.5.5 Exercises
Write a query that ranks all products by unit price (highest first) using
ROW_NUMBER,RANK, andDENSE_RANK. Find a price where at least two products are tied and explain how the three functions differ for those rows.Write a query that shows each product alongside its category’s average price and the difference between the product’s price and its category average. Use a window function, not a correlated subquery. Which products are priced furthest above their category average?
Using
LAG, write a query that shows each order for customer'ALFKI'alongside the number of days since their previous order. Which gap between orders is the longest?Write a query that computes a running total of order count by month (using
DATE_TRUNCto group orders by month, then a window function for the cumulative count). At what month did Northwind surpass 100,000 total orders?Predict what happens if you use
PARTITION BY category_idinside aROW_NUMBERwindow function but forget theORDER BYinsideOVER(). Is the result deterministic? Why or why not?
1.
SELECT product_name, unit_price,
ROW_NUMBER() OVER (ORDER BY unit_price DESC) AS row_num,
RANK() OVER (ORDER BY unit_price DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY unit_price DESC) AS dense_rank_val
FROM products ORDER BY unit_price DESC;For tied prices: ROW_NUMBER gives different numbers (1,2), RANK gives same number with gaps (1,1,3), DENSE_RANK gives same number without gaps (1,1,2).
2.
SELECT product_name, category_id, unit_price,
ROUND(AVG(unit_price) OVER (PARTITION BY category_id), 2) AS category_avg,
ROUND(unit_price - AVG(unit_price) OVER (PARTITION BY category_id), 2) AS diff_from_avg
FROM products ORDER BY diff_from_avg DESC;3.
SELECT order_id, order_date,
LAG(order_date) OVER (ORDER BY order_date) AS prev_order_date,
order_date - LAG(order_date) OVER (ORDER BY order_date) AS days_gap
FROM orders WHERE customer_id = 'ALFKI' ORDER BY order_date;4.
WITH monthly AS (
SELECT DATE_TRUNC('month', order_date) AS month, COUNT(*) AS cnt
FROM orders GROUP BY DATE_TRUNC('month', order_date)
)
SELECT month, cnt, SUM(cnt) OVER (ORDER BY month) AS cumulative FROM monthly ORDER BY month;5. Without ORDER BY in OVER(), ROW_NUMBER assigns numbers arbitrarily within each partition. The result is nondeterministic,different runs may produce different numbering.
10.6 Date Functions
Working with time-based data is unavoidable in analytical SQL. DuckDB provides a rich set of date and time functions.
duckdb
SELECT
order_date,
YEAR(order_date) AS order_year,
MONTH(order_date) AS order_month,
DAYNAME(order_date) AS day_of_week,
DATE_TRUNC('quarter', order_date) AS order_quarter,
order_date - LAG(order_date) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS days_since_last_order
FROM orders
WHERE customer_id = 'ALFKI'
ORDER BY order_date;YEAR, MONTH, and DAYNAME extract components from dates. DATE_TRUNC rounds dates down to a specified precision (year, quarter, month, week, day), which is essential for grouping time-series data into periods. The combination of DATE_TRUNC with GROUP BY is one of the most common patterns in analytical SQL.
YEAR(), MONTH(), and DATE_TRUNC() are widely supported but have syntax variations across databases. PostgreSQL uses EXTRACT(YEAR FROM date) as the standard equivalent. SQL Server uses DATEPART(year, date). The concepts are identical; only the function names change. DAYNAME() is DuckDB-specific, and the standard equivalent varies by database.
10.7 A Complete Analysis: ABC Inventory Classification
Let’s bring everything together with a professional supply chain technique. ABC analysis classifies inventory items by their contribution to total value, following the Pareto principle: a small percentage of products (the “A” items) typically account for a large percentage of total inventory value.
The method ranks products by their total sales value, then classifies them into three tiers: “A” items (top 80% of cumulative value), “B” items (next 15%), and “C” items (remaining 5%).
duckdb
-- [STANDARD SQL] ABC Inventory Classification
WITH product_revenue AS (
-- Step 1: Calculate total revenue per product
SELECT
p.product_id,
p.product_name,
c.category_name,
ROUND(SUM(od.unit_price * od.quantity * (1 - od.discount)), 2) AS total_revenue
FROM order_details AS od
INNER JOIN products AS p ON od.product_id = p.product_id
INNER JOIN categories AS c ON p.category_id = c.category_id
GROUP BY p.product_id, p.product_name, c.category_name
),
ranked_products AS (
-- Step 2: Rank by revenue and compute cumulative percentage
SELECT
product_id,
product_name,
category_name,
total_revenue,
SUM(total_revenue) OVER (ORDER BY total_revenue DESC) AS cumulative_revenue,
SUM(total_revenue) OVER () AS grand_total
FROM product_revenue
)
-- Step 3: Classify into ABC tiers
SELECT
product_name,
category_name,
total_revenue,
ROUND(100.0 * cumulative_revenue / grand_total, 1) AS cumulative_pct,
CASE
WHEN 100.0 * cumulative_revenue / grand_total <= 80 THEN 'A'
WHEN 100.0 * cumulative_revenue / grand_total <= 95 THEN 'B'
ELSE 'C'
END AS abc_class
FROM ranked_products
ORDER BY total_revenue DESC;This query combines CTEs, joins across three tables, aggregate functions, window functions for running totals, and a CASE expression for classification. It’s a genuine analytical deliverable: the output tells a supply chain manager which products deserve the most attention. The “A” items need careful forecasting and tight inventory control. The “C” items might be candidates for simplification or discontinuation.
This is the kind of analysis that SQL enables: complex business logic expressed in a readable, reproducible query that anyone can verify and re-run. No copy-pasting between spreadsheets. No manual sorting. The answer updates automatically when new data arrives.
Notice something about this query: every SQL feature it uses, CTEs, window functions, joins, CASE, aggregates, is standard SQL that works in PostgreSQL, SQL Server, Oracle, and every other major database. The techniques you’ve learned are portable. The analytical thinking behind them is even more so.
10.8 Chapter Exercises
These exercises require combining GROUP BY, CTEs, window functions, and joins to answer multi-step analytical questions. Approach each as a real business request.
Quarterly revenue trend. Write a query that computes total revenue by quarter (use
DATE_TRUNC('quarter', order_date)on theorderstable, joined toorder_detailsfor revenue). Then useLAGto compute the quarter-over-quarter change. Which quarter had the largest growth over the preceding quarter?Customer lifetime analysis. Using CTEs, compute each customer’s total lifetime spend, their number of orders, and their average order value. Then rank customers by total spend using
DENSE_RANK. Show the top 10 customers with their rank, company name, total spend, order count, and average order value.Category share analysis. For each product category, compute the total revenue and the percentage of total company revenue it represents. Use a window function with
SUM() OVER ()(emptyOVERclause) to calculate the grand total. Sort by revenue share descending. Which category generates the largest share?Employee performance. Write a query that shows each employee’s total order count and total revenue generated (through orders they handled). Rank employees by revenue. Then use a self-join through
reports_toto add each employee’s manager name. Who is the top-performing employee, and who is their manager?Reorder the clauses. The following query contains valid SQL syntax but the clauses are scrambled. Rewrite it in the correct clause order and explain what the query does:
duckdb
HAVING COUNT(*) >= 3 FROM products ORDER BY avg_price DESC SELECT category_id, ROUND(AVG(unit_price), 2) AS avg_price, COUNT(*) AS cnt GROUP BY category_id LIMIT 5;Moving average report. Build a monthly freight cost report using CTEs and window functions. Step 1: Aggregate total freight per month. Step 2: Compute a 3-month moving average. Step 3: Flag months where actual freight exceeds the moving average by more than 20% (use
CASE). Present the month, actual freight, moving average, and the flag. This is a pattern commonly used in operations monitoring.Portable vs. convenient. Rewrite the following DuckDB-specific query using only standard SQL (no
GROUP BY ALL, noQUALIFY):duckdb
SELECT c.category_name, p.product_name, p.unit_price FROM products AS p INNER JOIN categories AS c ON p.category_id = c.category_id GROUP BY ALL QUALIFY ROW_NUMBER() OVER (PARTITION BY c.category_name ORDER BY p.unit_price DESC) = 1;Which version is more readable? Which would you use in a report that needs to run on PostgreSQL?
Customer geography. Write a query that counts how many customers are in each country, sorted by count descending. Only include countries with more than 5 customers using
HAVING. Then extend it: for each of those countries, also compute the average number of orders per customer. (Hint: you’ll need a subquery or CTE that first counts orders per customer, then groups by country.)Supply chain overlap. Write a query that answers: “For each country where Northwind has both suppliers and customers, how many of each are there?” Start by finding the overlapping countries with
INTERSECT, then use that result (as a CTE or subquery) to count suppliers and customers for those countries. Present the country, supplier count, and customer count side by side.Debug this analysis. A colleague wrote the following query to find “the most expensive product in each category,” but it returns incorrect results. Find the bug and explain why it produces wrong output.
duckdb
SELECT c.category_name, p.product_name, MAX(p.unit_price) AS max_price FROM products AS p INNER JOIN categories AS c ON p.category_id = c.category_id GROUP BY c.category_name, p.product_name ORDER BY max_price DESC;Write a corrected version using a window function, and explain why
GROUP BYalone can’t solve this problem cleanly.Window function prediction. Before running this query, predict the values of
running_totalandrow_numfor the first 5 rows. Then run it and verify.duckdb
SELECT order_date, freight, SUM(freight) OVER (ORDER BY order_date) AS running_total, ROW_NUMBER() OVER (ORDER BY order_date) AS row_num FROM orders WHERE customer_id = 'QUICK' ORDER BY order_date;What would change if you replaced
ROW_NUMBER()withRANK()? Under what condition wouldRANK()andROW_NUMBER()produce different results here?
1.
WITH quarterly AS (
SELECT DATE_TRUNC('quarter', o.order_date) AS qtr,
ROUND(SUM(od.unit_price * od.quantity), 2) AS revenue
FROM orders AS o
INNER JOIN order_details AS od ON o.order_id = od.order_id
GROUP BY DATE_TRUNC('quarter', o.order_date)
)
SELECT qtr, revenue, LAG(revenue) OVER (ORDER BY qtr) AS prev_qtr,
ROUND(revenue - LAG(revenue) OVER (ORDER BY qtr), 2) AS qoq_change
FROM quarterly ORDER BY qtr;2.
WITH cs AS (
SELECT c.company_name, ROUND(SUM(od.unit_price * od.quantity), 2) AS total_spend,
COUNT(DISTINCT o.order_id) AS order_count
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
INNER JOIN order_details AS od ON o.order_id = od.order_id
GROUP BY c.customer_id, c.company_name
)
SELECT DENSE_RANK() OVER (ORDER BY total_spend DESC) AS rank, company_name,
total_spend, order_count, ROUND(total_spend / order_count, 2) AS avg_order_value
FROM cs ORDER BY rank LIMIT 10;3.
WITH cr AS (
SELECT c.category_name, ROUND(SUM(od.unit_price * od.quantity), 2) AS revenue
FROM order_details AS od
INNER JOIN products AS p ON od.product_id = p.product_id
INNER JOIN categories AS c ON p.category_id = c.category_id
GROUP BY c.category_name
)
SELECT category_name, revenue,
ROUND(100.0 * revenue / SUM(revenue) OVER (), 1) AS pct_share
FROM cr ORDER BY revenue DESC;4.
WITH ep AS (
SELECT e.employee_id, e.employee_name, e.reports_to,
COUNT(DISTINCT o.order_id) AS order_count,
ROUND(SUM(od.unit_price * od.quantity), 2) AS revenue
FROM employees AS e
INNER JOIN orders AS o ON e.employee_id = o.employee_id
INNER JOIN order_details AS od ON o.order_id = od.order_id
GROUP BY e.employee_id, e.employee_name, e.reports_to
)
SELECT ep.employee_name, ep.order_count, ep.revenue,
RANK() OVER (ORDER BY ep.revenue DESC) AS revenue_rank,
m.employee_name AS manager
FROM ep LEFT JOIN employees AS m ON ep.reports_to = m.employee_id
ORDER BY revenue_rank;5. Correct order:
SELECT category_id, ROUND(AVG(unit_price), 2) AS avg_price, COUNT(*) AS cnt
FROM products
GROUP BY category_id
HAVING COUNT(*) >= 3
ORDER BY avg_price DESC
LIMIT 5;This finds the top 5 categories (by average price) that have at least 3 products.
6.
WITH mf AS (
SELECT DATE_TRUNC('month', order_date) AS month, ROUND(SUM(freight), 2) AS freight
FROM orders GROUP BY DATE_TRUNC('month', order_date)
)
SELECT month, freight,
ROUND(AVG(freight) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW), 2) AS moving_avg,
CASE WHEN freight > 1.2 * AVG(freight) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) THEN 'OVER' ELSE 'Normal' END AS flag
FROM mf ORDER BY month;7. Standard SQL version:
SELECT category_name, product_name, unit_price
FROM (
SELECT c.category_name, p.product_name, p.unit_price,
ROW_NUMBER() OVER (PARTITION BY c.category_name ORDER BY p.unit_price DESC) AS rn
FROM products AS p
INNER JOIN categories AS c ON p.category_id = c.category_id
) ranked WHERE rn = 1 ORDER BY category_name;The DuckDB version is more concise; the standard version is needed for PostgreSQL.
8.
WITH coc AS (
SELECT c.customer_id, c.country, COUNT(DISTINCT o.order_id) AS orders
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.country
)
SELECT country, COUNT(*) AS customer_count, ROUND(AVG(orders), 1) AS avg_orders_per_customer
FROM coc GROUP BY country HAVING COUNT(*) > 5 ORDER BY customer_count DESC;9.
WITH overlap AS (
SELECT country FROM suppliers INTERSECT SELECT country FROM customers
)
SELECT ov.country,
(SELECT COUNT(*) FROM suppliers WHERE country = ov.country) AS supplier_count,
(SELECT COUNT(*) FROM customers WHERE country = ov.country) AS customer_count
FROM overlap AS ov ORDER BY supplier_count DESC;10. Bug: GROUP BY includes product_name, so each group has one product,MAX(unit_price) just returns that product’s price. Fix with window function:
SELECT category_name, product_name, unit_price FROM (
SELECT c.category_name, p.product_name, p.unit_price,
ROW_NUMBER() OVER (PARTITION BY c.category_id ORDER BY p.unit_price DESC) AS rn
FROM products AS p
INNER JOIN categories AS c ON p.category_id = c.category_id
) ranked WHERE rn = 1 ORDER BY category_name;11. running_total is a cumulative sum of freight ordered by date. row_num is a sequential counter. RANK() would differ from ROW_NUMBER() only if two orders share the same order_date,ties get the same rank with a gap.
10.9 Summary
GROUP BY partitions data into categories and computes aggregate summaries within each group, with the strict requirement that all non-aggregate columns in SELECT must appear in the GROUP BY clause. HAVING filters groups after aggregation, operating on aggregate values rather than individual rows. Subqueries embed one query inside another for scalar comparisons or as virtual tables in FROM, while CTEs (Common Table Expressions) with WITH provide a more readable alternative by naming intermediate results as logical steps. CTEs are part of the SQL standard since SQL:1999 and are supported everywhere.
Window functions are a core analytical tool, not an advanced topic, and they compute aggregates alongside individual rows using OVER, with PARTITION BY defining groups and ORDER BY defining sequence. Ranking functions (ROW_NUMBER, RANK, DENSE_RANK) number rows within partitions, while LAG and LEAD access adjacent rows for period-over-period comparisons. Frame specifications (ROWS BETWEEN) enable running totals and moving averages, giving you precise control over which rows contribute to each window computation.
DuckDB extends standard SQL with GROUP BY ALL for automatic grouping and the QUALIFY clause for filtering window function results directly, though the portable alternatives are explicit column listing and subquery wrapping. These tools combine to enable professional analytical techniques like ABC inventory classification, where CTEs, joins, aggregates, window functions, and CASE expressions work together to produce business-ready deliverables.
10.10 Glossary
- ABC Analysis
- A supply chain technique that classifies inventory items into three tiers (A, B, C) based on their contribution to total value, applying the Pareto principle to focus management attention on the highest-value items.
- Common Table Expression (CTE)
-
A named, temporary result set defined with the
WITHkeyword at the beginning of a query. CTEs make complex queries readable by breaking logic into named, sequential steps. Part of the SQL standard since SQL:1999. - Cumulative Sum (Running Total)
-
A window function pattern that computes the sum of a column from the beginning of the partition through the current row, using
SUM(column) OVER (ORDER BY ...). - GROUP BY
- A SQL clause that partitions rows into groups based on one or more columns, after which aggregate functions compute summaries independently within each group.
- GROUP BY ALL
-
A DuckDB extension that automatically infers grouping columns from the non-aggregate columns in
SELECT. Not part of the SQL standard. - HAVING
-
A SQL clause that filters groups after
GROUP BYaggregation, analogous toWHEREbut operating on aggregate values rather than individual rows. - LAG / LEAD
-
Window functions that access the value of a column from a previous (
LAG) or subsequent (LEAD) row within the partition, as defined by theORDER BYin theOVERclause. - PARTITION BY
-
A clause within a window function’s
OVERspecification that divides rows into groups for the window computation, analogous toGROUP BYbut without collapsing rows. - QUALIFY
-
A DuckDB extension that filters rows based on window function results, functioning for window functions as
HAVINGdoes forGROUP BYaggregates. The portable alternative is wrapping the window function in a subquery and filtering withWHERE. - Scalar Subquery
-
A subquery that returns exactly one value (one row, one column), which can be used wherever a single value is expected, such as in a
WHEREcomparison. - Window Frame
-
The subset of rows within a partition over which a window function operates, specified with
ROWS BETWEENorRANGE BETWEENin theOVERclause. - Window Function
-
A function that computes a value for each row using a “window” of related rows defined by
PARTITION BYandORDER BY, without collapsing the result set the wayGROUP BYdoes. Part of the SQL standard since SQL:2003.