7 Databases

You have been using databases your entire life without thinking about it. Every time you check your bank balance, search for a product online, or scan your student ID at the dining hall, a database is answering a question on your behalf. The ATM doesn’t keep your balance in a spreadsheet. The university doesn’t track enrollment in a folder of text files. These systems use databases because when data matters, when it needs to be correct, fast, and accessible to many people at once, flat files aren’t enough.

This chapter introduces what databases are, why different kinds exist, and why we’ll use one called DuckDB for our analytical work. By the end, you’ll have DuckDB running on your machine, a real dataset loaded, and your first SQL query executed. That might sound like a lot of setup, but DuckDB was designed to get out of your way. You’ll be querying data within minutes.

7.1 Why Databases Exist

Think about how you might track inventory for a small warehouse using spreadsheets. You’d probably create an Excel file with columns for product name, quantity, supplier, and location. For a dozen products, this works fine. But as the operation grows, problems emerge quickly.

Your colleague opens the same file and edits it simultaneously, and someone’s changes get overwritten. A new hire accidentally deletes a column. Someone enters “N/A” in a numeric field, and now your formulas break. The file grows to 500,000 rows and takes thirty seconds to open. You need to answer the question “which supplier’s products have the highest defect rate?” but that data lives across three different spreadsheets, and there’s no reliable way to connect them.

A database management system (DBMS) solves all of these problems through structure and rules. It enforces that every product has a valid supplier ID. It prevents two people from corrupting the same record simultaneously. It answers complex questions across millions of rows in milliseconds. It recovers gracefully from crashes. A DBMS is the difference between a filing cabinet where anyone can shove papers in any drawer and a library with a catalog system, a librarian, and a checkout process.

The key insight is that a database is not just a place to store data. It’s a system that manages data: enforcing rules about what goes in, coordinating who can access it, and providing a powerful language for getting answers out.

7.2 Spreadsheets and Databases: Knowing When to Use Each

Before we move on, let’s address the tool you already know well: the spreadsheet. Excel (and Google Sheets) is genuinely excellent software, and dismissing it would be unfair. A spreadsheet combines data entry, storage, analysis, and visualization in a single interface with immediate feedback. You change a cell and see calculation results instantly. Pivot tables give you drag-and-drop aggregation. Built-in charting produces presentation-ready visuals without writing code. Excel is pre-installed on most corporate computers. For quick analysis, what-if scenarios, and ad-hoc exploration on small datasets, nothing beats the speed of a well-built spreadsheet.

But spreadsheets have real limitations that surface as data grows and stakes rise, and engineers should know the specifics.

Excel’s hard row limit is 1,048,576 rows (the older .xls format caps at 65,536). Performance degrades noticeably above 200,000 to 300,000 rows when complex formulas are involved. There is no referential integrity: any data type can go in any cell, and nothing stops someone from typing “TBD” in a price column. Version control is effectively nonexistent, and concurrent editing creates conflicts. Formula auditing across complex cross-references is, as researchers have described it, “hard to detect and hard to correct even if discovered.”

These aren’t hypothetical risks. In 2012, a spreadsheet error contributed to JP Morgan’s “London Whale” incident, a $6.2 billion loss partly caused by a Value-at-Risk model in Excel where a formula divided by a sum instead of an average, muting volatility by a factor of two. In 2020, the UK’s Public Health England lost approximately 16,000 positive COVID-19 test results because they used the old .xls format with its 65,536 row limit, and rows were silently truncated. An ongoing problem in genetics research involves Excel’s autocorrect converting gene names like MARCH1 and SEPT1 to calendar dates. A 2020 study found that 31% of published genetic papers contained these errors, and the problem became so pervasive that the HUGO Gene Nomenclature Committee actually renamed genes (SEPT1 became SEPTIN1) because researchers wouldn’t stop using Excel. Broader research from the European Spreadsheet Risk Interest Group estimates that 88% of spreadsheets contain errors.

The point isn’t that spreadsheets are bad. The point is that different tools serve different purposes.

Use a spreadsheet when you have fewer than 10,000 rows, a single user, ad-hoc analysis needs, quick visualization, or rapid prototyping. Use a database when you have more than 100,000 rows, multiple users, critical data integrity requirements, audit trail needs, or automation requirements. The transition triggers are concrete: performance degradation, hitting row limits, “who has the file open?” conflicts, version chaos, formula errors propagating silently, or file corruption.

Many of the good habits you’ll learn from databases actually transfer back to spreadsheet work. Using Excel Tables (Ctrl+T) enforces unique headers and auto-expands ranges. Keeping one header row with unique column names, storing one fact per cell, and organizing rows as observations and columns as variables are database principles that make spreadsheets more reliable. The formal term for this structure is tidy data, and it’s a concept that spans both tools. You’ll revisit this spreadsheet-vs-database decision throughout your career, and the framework you just learned, row count, user count, integrity requirements, is how experienced engineers make the call.

7.2.1 Exercises

A manufacturing plant tracks machine uptime in a shared Excel file. Three shift supervisors update it at shift changes, and an analyst pulls weekly summaries from it. The file has 800 rows and is growing by about 50 rows per week. What specific problems would you expect to emerge within six months, and at what point would migrating to a database become worthwhile?
For each of the following scenarios, would you recommend a spreadsheet or a database? Briefly justify each choice.
1. A graduate student collecting survey responses from 200 participants for a thesis project.
2. An e-commerce company tracking 2 million customer orders per year with data entry from 15 warehouse workers.
3. A project manager building a one-time cost estimate for a client proposal.
4. A hospital system storing patient records that must comply with audit and privacy regulations.
The UK COVID-19 data loss occurred because the .xls format has a 65,536 row limit. If the data had been stored in the newer .xlsx format (1,048,576 row limit), would that have prevented the problem? What would a database-first approach have done differently?

Solutions

1. Within six months: concurrent editing conflicts across shifts, no validation on entries, version chaos from copies. Migration is worthwhile now, not later.

2. A=Spreadsheet (small, single user, ad-hoc). B=Database (2M rows/year, 15 concurrent users, audit trail). C=Spreadsheet (one-time estimate, exploratory). D=Database (patient data needs compliance, audit trails, referential integrity).

3. No,the root issue was using a spreadsheet for mission-critical, append-only data. A database would enforce validation at entry, provide audit trails, and separate storage from presentation.

7.3 Two Kinds of Work, Two Kinds of Engine

Not all data work looks the same. Consider two very different questions you might ask about inventory data:

“Update the quantity of Widget A to reflect that we just shipped 50 units.” This is a transaction: a small, precise change to a single record that needs to happen immediately and correctly.

“What were our average shipping costs by region for each quarter of last year?” This is an analysis: a broad sweep across potentially millions of records, aggregating and summarizing to reveal patterns.

These two kinds of work have fundamentally different performance requirements, and the database world has developed two architectures to handle them. The distinction between them is one of the most important concepts in data engineering.

7.3.1 Transactions vs. Analytics: The Engine Analogy

If you’ve ever compared gasoline and diesel engines, you already have an intuition for this distinction.

A gasoline engine is responsive and quick. It starts instantly, revs up fast, and handles the stop-and-go rhythm of city driving with ease. It’s optimized for lots of small, quick bursts of power. This is like an OLTP (Online Transaction Processing) database. OLTP systems handle the day-to-day heartbeat of a business: processing credit card swipes, updating inventory counts, recording new customer registrations. Each operation touches a small number of rows and needs to complete in milliseconds. PostgreSQL, MySQL, and SQL Server are OLTP engines. They store data row by row, which means retrieving a complete customer record (name, address, phone, email) is fast, because all those fields sit next to each other on disk.

A diesel engine is built differently. It produces massive torque, efficiently hauling heavy loads over long distances. It’s not as responsive off the line, but it’s relentless once it gets moving. This is like an OLAP (Online Analytical Processing) database. OLAP systems handle the questions that managers, analysts, and engineers ask: “What are our defect rates across all production lines for the past two years?” These queries scan millions of rows, aggregate columns, and compute summaries. DuckDB, ClickHouse, and BigQuery are OLAP engines. They store data column by column, which means computing the average of a single column across 10 million rows is fast, because all those values sit next to each other on disk, and the engine never wastes time reading the columns you didn’t ask about.

Here’s what that looks like in practice. Imagine a table with four columns and millions of rows:

storage-layout

Row-Oriented Storage (OLTP):
  Row 1: [ID=1] [Name="Widget A"] [Price=12.50] [Stock=340]
  Row 2: [ID=2] [Name="Widget B"] [Price=8.75]  [Stock=1200]
  Row 3: [ID=3] [Name="Gadget C"] [Price=45.00] [Stock=89]
  ...millions more rows...

Column-Oriented Storage (OLAP):
  ID Column:    [1, 2, 3, 4, 5, ...]
  Name Column:  ["Widget A", "Widget B", "Gadget C", ...]
  Price Column: [12.50, 8.75, 45.00, ...]
  Stock Column: [340, 1200, 89, ...]

If you ask “give me everything about product #2,” the row-oriented database wins, because that entire record is stored in one contiguous block. If you ask “what’s the average price across all products?”, the column-oriented database wins, because it only reads the Price column and skips everything else.

Neither architecture is “better.” They’re purpose-built for different workloads, just like you wouldn’t tow a trailer with a sports car or race a semi-truck on a circuit. The important thing is choosing the right engine for the work you’re doing. As engineers analyzing data, our work is overwhelmingly analytical, which is why we’ll use an OLAP database.

7.3.2 Exercises

Classify each of the following operations as OLTP or OLAP, and explain which storage layout (row-oriented or column-oriented) would handle it more efficiently.
1. A barcode scanner at a warehouse register deducts one unit from a product’s stock count.
2. A supply chain manager asks, “What was our average shipping cost by carrier for Q3?”
3. A web form creates a new customer account with name, email, and address.
4. An analyst computes the year-over-year revenue growth rate for each product category.
A colleague says, “Column-oriented databases are just better because analytics is more important than transactions.” What’s wrong with this reasoning? Give an example of a system where row-oriented storage is clearly the right choice.
In the storage diagram above, imagine a table with 50 columns and 10 million rows. You run a query that computes AVG(price). Roughly how much data does a column-oriented engine read versus a row-oriented engine? Why does this difference grow as the number of columns increases?

Solutions

1. A=OLTP/Row (single barcode scan). B=OLAP/Column (aggregate shipping costs across Q3). C=OLTP/Row (creating one customer record). D=OLAP/Column (year-over-year revenue growth).

2. Flawed,both matter. Transactions are the heartbeat (orders, inventory updates); analytics drive decisions. Most organizations run both (PostgreSQL for OLTP, DuckDB for OLAP). They’re complementary.

3. Column-oriented reads 1 column × 10M rows; row-oriented reads 50 columns × 10M rows = 50× more data. This is why analytical queries see dramatic speedups in columnar engines.

7.4 The Database Landscape

Databases are a mature industry with many providers, each with different strengths. Thinking about them like car manufacturers can help you navigate the landscape without getting lost in marketing.

7.4.1 Server-Based Databases

Most databases you’ll encounter in the professional world run as servers, separate processes (often on separate machines) that accept connections from clients over a network. This client-server architecture allows many users and applications to share one database simultaneously, and it’s how organizations centralize their data.

A DoorDash analogy makes this concrete. Think about what happens when you order food through the app. You never walk into the restaurant’s kitchen yourself. Instead, you open the DoorDash app on your phone, browse the menu, and place an order. The app sends your request to the restaurant’s kitchen, which prepares your food. A driver picks it up and delivers the result back to you. You interact with the app, the kitchen does the work, and you never touch the raw ingredients.

Now map that to databases. The database server is the kitchen: the process that stores the data, processes your SQL queries, handles authentication, coordinates access when multiple people query at once, and ensures crash recovery. The database client is the DoorDash app: the software you interact with to send requests and receive results. This might be a graphical tool like pgAdmin or DBeaver, a command-line tool like psql, or an application driver like Python’s psycopg2. The network between them, just like the road between your house and the restaurant, means the client and server don’t need to be on the same machine. Multiple clients can send queries to the same server simultaneously, just like multiple customers can order from the same restaurant at the same time.

To connect a client to a server, you need a connection string specifying where the kitchen is and how to get in, something like postgresql://username:password@host:5432/database. The host is the address (like the restaurant’s location), 5432 is the port (like the specific entrance to use), and the credentials prove you’re allowed to order.

This analogy extends naturally to the digital tools you already use. When you open the DoorDash app on your phone, your phone is the client and DoorDash’s servers are, well, the servers. When you open Instagram, your phone is the client and Instagram’s servers handle storing and retrieving posts. Every time you use a web or mobile app, you’re using a client that talks to a server. Database client-server architecture is the same pattern.

PostgreSQL is the Toyota of databases: reliable, versatile, community-driven, and an excellent default choice for almost any workload. It’s open source, has been developed since the 1980s, and handles everything from small applications to large enterprise systems. If someone says “just use Postgres,” that’s rarely bad advice.

MySQL is similar in spirit, also open source and widely deployed. It powers enormous portions of the web (WordPress, Wikipedia) and is known for being approachable and well-documented. Oracle acquired it in 2010, which led to the creation of MariaDB as a community fork.

SQL Server is Microsoft’s offering, tightly integrated with the Microsoft ecosystem (Windows Server, Azure, Excel). Many enterprises, particularly those already invested in Microsoft tools, standardize on it.

Oracle Database is the Rolls-Royce of the database world: powerful, expensive, and dominant in large enterprise environments, particularly financial services and government. You’ll likely encounter it in your career, but you probably won’t choose it for a personal project.

These server databases are essential infrastructure for running businesses, but they come with operational complexity: installation, configuration, user management, backups, network security, and ongoing maintenance. You need to install both the server and a client, configure network access (default ports like 5432 for PostgreSQL or 3306 for MySQL), manage authentication credentials, and handle connection pooling since opening database connections is expensive (1 to 100 milliseconds each). For an analyst who wants to explore a dataset on their laptop, that’s significant overhead.

7.4.2 Embedded Databases

An embedded database takes a radically different approach. Instead of running as a separate server, it compiles directly into your application or runs as a standalone process on your machine. There’s no network, no user management, no configuration. Your database is just a file on your filesystem.

SQLite is the most widely-deployed database in the world, and chances are you’ve never installed it on purpose. It’s embedded in every smartphone, every web browser, every copy of macOS and Windows. It’s lightweight, goes everywhere, and is perfect for its intended use case. SQLite uses row-oriented storage and handles transactional workloads like application state, configuration, and local caching.

DuckDB is the newer counterpart, and the one we’ll use throughout this book. If SQLite is the embedded database for applications, DuckDB is the embedded database for analytics. It uses column-oriented storage, executes queries across multiple CPU cores simultaneously, and is purpose-built for how analysts actually work today: zero-configuration, single-file, and fast on analytical queries right out of the box.

The contrast between SQLite and DuckDB reinforces the OLTP/OLAP distinction. Both are embedded, serverless, single-file databases, but they’re optimized for opposite workloads:

Dimension	SQLite	DuckDB
Storage	Row-oriented	Column-oriented
Optimized for	Transactions (OLTP)	Analytics (OLAP)
Execution	Single-threaded, row-at-a-time	Multi-threaded, vectorized batches
Best query type	Point lookups by ID	Aggregations across millions of rows

Many production systems use both: SQLite for transactional writes and application state, DuckDB for analytical reads and data exploration. They’re complementary tools, not competitors.

7.4.3 Exercises

Using the DoorDash analogy, explain what happens when:
1. Two clients send queries to the same server database at the same time.
2. The database server crashes while a client is waiting for results.
3. A new team member needs to start querying the database. What do they need from an administrator?
A startup is choosing a database for their new inventory management system. They have 5 employees, expect 10,000 transactions per day, and need their web application to read and write data. Their data analyst also wants to run weekly reports summarizing trends. Recommend a database architecture (which database for what purpose) and justify your choice.
Explain why SQLite and DuckDB are described as “complementary, not competitors” despite both being embedded databases. In what scenario would you use both in the same project?

Solutions

1. A=Two simultaneous queries go to the server, which coordinates them. B=Client times out; server rolls back the in-progress transaction on restart. C=Admin creates a user account with permissions and provides a connection string.

2. PostgreSQL for transactions (15 workers, ACID guarantees), DuckDB for analytics (weekly reports). The two serve complementary workloads.

3. SQLite = embedded OLTP (mobile app caching account data locally). DuckDB = embedded OLAP (analyzing spending patterns on a laptop). They optimize for opposite workloads and never compete.

7.4.4 Why DuckDB

The choice of DuckDB is deliberate. As analysts and engineers, our primary database interaction is querying data to answer questions. We need a tool that:

Installs in seconds, with no server to configure or maintain.
Handles analytical queries across large datasets efficiently.
Provides a standards-compliant SQL dialect that transfers to other databases.
Integrates naturally with Python, which we’ll use in later modules.

DuckDB satisfies all of these. Under the hood, three architectural features make it fast for analytics. Columnar storage keeps values from the same column contiguous on disk, so computing AVG(sales) across 10 million rows reads only the sales column (maybe 40 MB) rather than entire rows (maybe 500 MB), with compression ratios of 5 to 10x since similar values compress efficiently. Vectorized execution processes data in batches of roughly 2,048 values rather than one row at a time, allowing modern CPUs to process these vectors with hardware-level optimizations for dramatically improved throughput. Morsel-driven parallelism automatically distributes work across all available CPU cores, achieving near-linear scaling for analytical queries without any configuration from you.

You don’t need to memorize these internals, but understanding them explains why DuckDB can summarize millions of rows on your laptop in seconds while a spreadsheet with the same data would grind to a halt. The architecture is doing real work behind the scenes.

7.4.5 SQL: The Language You’re Really Learning

While DuckDB is our tool, SQL (Structured Query Language) is the skill. SQL has been the dominant language for relational databases since it was developed at IBM in the 1970s, formalized by the ANSI/ISO standards body starting in 1986 through several major revisions: SQL-92 introduced joins and subqueries, SQL:1999 added Common Table Expressions (CTEs), SQL:2003 brought window functions, SQL:2016 introduced JSON support, and SQL:2023 added graph query capabilities.

The core SQL you’ll learn in this module, SELECT, FROM, WHERE, GROUP BY, ORDER BY, joins, subqueries, CTEs, and window functions, is standard SQL that works across PostgreSQL, MySQL, SQL Server, Oracle, SQLite, and DuckDB. You’re learning a language that has been in continuous use for over 40 years and shows no signs of declining.

7.4.6 Exercises

A job posting requires “experience with SQL.” Based on what you’ve read about ANSI SQL standards, should you be concerned about which specific database the company uses? What would you need to learn that’s new, and what would transfer from this book?

Solutions

1. Core SQL (SELECT, WHERE, GROUP BY, joins, CTEs, window functions) is standard and works identically across databases. Dialect differences (SQL Server’s CONVERT, PostgreSQL’s ARRAY_AGG) are secondary and easy to pick up with the foundation you build here.

7.5 Setting Up DuckDB

Install DuckDB using your package manager (see Appendix G). Verify the installation:

terminal

duckdb --version

You should see output showing the installed version number. No server to start, no configuration files to edit, no users to create.

7.5.1 The CLI (Briefly)

You can start DuckDB directly in your terminal by typing duckdb:

terminal

duckdb

This drops you into an interactive SQL prompt where you can type queries directly.

DuckDB CLI

DuckDB v1.4.4 (Andium) 6ddac802ff
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D

At first glance, it looks similar to anything else in your terminal. But you’re actually inside DuckDB’s own command-line interface (CLI), which interprets your input as SQL commands rather than shell commands. This is an important distinction: if you type ls or cd here, DuckDB won’t understand, because those are shell commands and you’re no longer in the shell.

The way to tell where you are is to look at your prompt, the text that appears before your cursor. Compare these:

macOS / Linux shell

ozan@macbook ~/cw4e $

Windows PowerShell

PS C:\Users\ozan\cw4e>

DuckDB CLI

When you see $ at the end (macOS/Linux) or PS at the beginning followed by a path and > (Windows PowerShell), you’re in your operating system’s shell. Commands like ls, cd, mkdir, and duckdb work here. When you see the bare D, you’re inside DuckDB. Only SQL statements and DuckDB dot-commands (like .exit) work here. Getting these mixed up is one of the most common frustrations when you’re learning, and it never fully goes away, even experienced developers occasionally type ls into the wrong prompt.

Try a quick sanity check:

DuckDB CLI

SELECT 'Hello, DuckDB!' AS greeting;

Notice the semicolon, ;, at the end of the statement. SQL statements always end with a semicolon. This tells DuckDB that you’ve finished your statement and it’s time to execute it. If you press Enter without a semicolon, DuckDB won’t throw an error. Instead, it will wait for more input, letting you spread a long query across multiple lines. That’s a useful feature once you’re writing complex queries, but it can be confusing at first if you’re used to one-line shell commands. If DuckDB seems to be “stuck” waiting, try typing ; and pressing Enter.

You should see a formatted table with your greeting. This confirms that DuckDB is installed and working.

Type .exit or press Ctrl+D to leave the CLI and return to your normal terminal prompt.

You’re Not in the Shell Anymore

A common mistake is forgetting that the DuckDB CLI is a separate environment. If you see errors like Parser Error: syntax error at or near "ls", it means you’re typing shell commands into DuckDB. Look at your prompt:

D → You’re in DuckDB. Type .exit to leave.
$ or % at the end → You’re in a macOS/Linux shell.
PS C:\...> → You’re in Windows PowerShell.

When in doubt, check the prompt before typing.

7.6 The Northwind Dataset

Throughout this module, we’ll work with an enhanced version of the Northwind database, a dataset representing a fictional company called Northwind Traders that imports and exports specialty food products worldwide. Originally created by Microsoft as a teaching database, Northwind has become one of the most widely-used sample datasets in the SQL world, and for good reason: it models a realistic supply chain with suppliers, products, inventory, customer orders, shipping, and employees.

In the coming chapters, you’ll use this dataset to answer real business questions: Which product categories generate the most revenue? Are there suppliers we depend on too heavily? Which shipping routes are slowest? These are the kinds of questions that analysts and engineers answer every day, and SQL is how they do it.

7.6.1 Getting the Data

Download northwind.duckdb from Appendix F and save it to a dedicated folder like ~/cw4e. Navigate to that directory, verify the file is there with ls, then open the database:

terminal

cd ~/cw4e
ls                        # Confirm northwind.duckdb is present
duckdb northwind.duckdb

Your First Week on the Job

When you join a company as a data professional, your first task is almost always to figure out what data exists and where it lives. The workflow is exactly what you’re learning here: connect to a database, query information_schema to see what tables and columns exist, and consult the data dictionary if one exists. This process of exploring an unfamiliar schema is a skill you’ll use more often than writing complex queries, so treat these first exercises as rehearsal for a real onboarding experience.

DuckDB Creates Missing Files Silently

If you provide a filename that doesn’t exist, DuckDB silently creates an empty database instead of warning you. If your information_schema.tables query returns nothing, you likely opened the wrong file or are in the wrong directory. Use ls and pwd to verify your location before launching DuckDB. See Appendix F for detailed troubleshooting.

7.6.2 Exploring the Schema

Before writing queries, it’s worth getting a feel for what’s in the database. The SQL standard defines a set of information_schema views that let you inspect any database’s structure. To see what tables exist, run:

DuckDB CLI

SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'main';

This lists every table in the database. You should see ten tables: the nine Northwind entity tables (categories, customers, employees, order_details, orders, products, regions, shippers, and suppliers) plus a data_dictionary table, which is a data dictionary: a table that describes the other tables. Data dictionaries document what each table and column represents, what values are expected, and how tables relate to each other. In professional environments, a data dictionary is often the first thing you consult when working with an unfamiliar database. We’ll query ours in the exercises below.

To see the columns and data types for a specific table, query information_schema.columns:

DuckDB CLI

SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'products'
ORDER BY ordinal_position;

You’ll see columns like product_name with type VARCHAR (text), unit_price with type DOUBLE (numbers), and discontinued with type BIGINT (whole numbers). These data types determine what operations are valid on each column: you can calculate an average of DOUBLE values but not of VARCHAR values. We’ll explore data types more in Chapter 11, but for now, knowing how to inspect what’s in a database is the important skill.

These information_schema queries are your first tools for orienting yourself in an unfamiliar database. They’re standard SQL that works on PostgreSQL, MySQL, SQL Server, and every other major database. In a professional setting, you’ll use them constantly, every new project starts with figuring out what data is available and how it’s structured.

7.7 Your First Query

Let’s close this chapter by actually asking the database a question. Suppose you’ve just started at Northwind Traders and your manager asks: “Which products do we carry, and how are they priced?”

DuckDB CLI

SELECT * FROM products;

This returns every row and every column in the products table. You can see product names, prices, stock levels, and more. Scroll through the data (the CLI paginates long results; press Enter to see more rows, or q to stop). Get a feel for what’s here.

That’s the whole game in miniature: someone has a question, you have data, and SQL is the bridge between them. How much revenue has Northwind generated? Which suppliers are we most dependent on? Which shipping routes are slowest? You’ll be able to answer all of these by the end of this module. In the next chapter, you’ll learn to be precise about which columns you want, which rows matter, and how to sort and summarize the results.

7.7.1 Exercises

Run SELECT * FROM products; and examine the output. How many columns does the products table have? How many rows? Which column would you use to connect a product to its supplier?
Run SELECT * FROM categories; and then SELECT * FROM suppliers;. Without writing any joins (we’ll learn those later), just by reading the data, can you identify which category “Chai” belongs to? Which country supplies the most products?
Try running SELECT * FROM employees; and look at the reports_to column. Some values are numbers and one is empty (NULL). What do you think this column represents? What would it mean for an employee to have a NULL value here?
Run SELECT * FROM data_dictionary; and read through the results. Pick two tables you haven’t explored yet and query information_schema.columns followed by SELECT * to examine them. Does the data dictionary’s description match what you actually see in the tables?

Solutions

1. The products table has 10 columns and 250 rows. The supplier_id column connects products to suppliers.

2. Chai’s category_id is 1. Looking that up in the categories table, category 1 is “Beverages.” For the supplier question, scan through the suppliers table output and tally countries visually. You’ll find that the USA has the most suppliers. (In the next few chapters, you’ll learn SQL syntax to count and rank automatically, but for now, reading the output is the skill.)

3. The reports_to column stores each employee’s manager (as an employee_id). NULL means the employee has no manager,they’re at the top of the hierarchy.

4. Query SELECT * FROM data_dictionary to see descriptions of all tables. The order_details table is a junction table linking orders to products, with quantity and pricing per line item.

7.8 The DuckDB UI

So far we’ve been working in the DuckDB CLI, which is a good way to build comfort with the terminal and understand that you’re interacting with a database process, not your shell. But for the exploratory analytical work we’ll be doing in the coming chapters, a richer interface is more productive. Starting with version 1.2, DuckDB ships with a full-featured web-based user interface, built in collaboration with MotherDuck. It’s a notebook-style environment similar to Jupyter, but designed specifically for SQL work. You get syntax highlighting, autocompletion, an interactive data catalog, and visual column statistics, all running locally on your machine.

First, exit DuckDB if you’re still in the CLI (.exit or Ctrl+D), then relaunch with the -ui flag:

terminal

duckdb -ui northwind.duckdb

Your default browser will open to http://localhost:4213 with the DuckDB UI loaded. This might seem odd, since opening a browser usually means connecting to the internet, but that’s not what’s happening here. Behind the scenes, DuckDB starts a local web server on your own machine. The localhost in the URL literally means “this computer.” The :4213 is the port number it’s listening on. Think of your computer as an apartment building: localhost is the building’s address, and :4213 is the apartment number. Different programs listen on different ports so they don’t interfere with each other. Your browser is just being used as the display layer, like a fancy window into DuckDB. No data leaves your computer, no internet connection is required, and nobody else can access this interface. When you close DuckDB in the terminal, the web server shuts down and the URL stops working. This pattern, using a browser as the interface for a locally-running application, is increasingly common in data tools, and you’ll encounter it again with Jupyter notebooks, Quarto previews, and other development tools.

The interface has three main areas. The left panel shows your database catalog: attached databases, tables, and their columns. The center panel is your SQL notebook, with cells where you write and execute queries. Press Cmd+Enter (macOS) or Ctrl+Enter (Windows/Linux) to run a cell. The results appear directly below each cell as a formatted table, and you can click on column headers to see distribution summaries.

We’ll use this UI for nearly all of our SQL work in this module. When you need to write longer, more complex SQL that you want to save and version control, you’ll write .sql files in Zed, your code editor. But for exploration and learning, the notebook interface is ideal.

In-Memory vs. Persistent Databases

Notice that we launched the UI with duckdb -ui northwind.duckdb, pointing it at our database file. If you run duckdb -ui without a filename, DuckDB creates an in-memory database, meaning the data lives only in your computer’s active memory (RAM), not saved as a file on disk. It’s like writing notes on a whiteboard: useful while you’re working, but gone once you close the program. Always provide the filename when working through these chapters so your work persists between sessions.

7.9 Chapter Exercises

These exercises bring together concepts from the entire chapter. They require you to think across sections rather than recalling a single idea.

Architecture matching. A regional hospital network is modernizing their data systems. They have three needs:
1. a system to record patient check-ins, prescription fills, and appointment bookings in real time across 12 clinics
2. a system for their analytics team to study readmission patterns and treatment outcomes across 5 years of historical data, and
3. a system for a mobile app that caches the current user’s appointment schedule locally on their phone. For each need, recommend a specific database type (OLTP server, OLAP server, embedded OLTP, or embedded OLAP) and name one real database product that fits.
Spreadsheet autopsy. You inherit a 14-tab Excel workbook that tracks product quality inspections for a manufacturing facility. Tab 1 is a lookup table of inspector names. Tabs 2 through 13 are monthly inspection logs (one tab per month) with columns for date, inspector ID, product ID, pass/fail, and notes. Tab 14 contains pivot tables and charts that summarize annual defect rates. The file is 890 MB and takes over a minute to open. Diagnose at least three specific problems with this setup, and describe, at a conceptual level (not SQL syntax), how a database would solve each one.
Engine internals. Explain, in your own words, why computing AVG(unit_price) across 1 million products is faster in DuckDB than in PostgreSQL, even if both databases are running on the same laptop. Your explanation should reference at least two of the three DuckDB architectural features discussed in this chapter (columnar storage, vectorized execution, morsel-driven parallelism).
Tool selection debate. After reading the DuckDB documentation, a colleague discovers that DuckDB can query CSV files directly without creating tables first. They argue: “There’s no reason to ever use CREATE TABLE. Just query the files every time.” Write a response that acknowledges the valid point in their argument but explains when and why creating persistent tables is the better choice.
Predict and verify. Before running each of the following in DuckDB, predict what will happen, then run them and check your prediction.
1. SELECT 1 + NULL;
2. SELECT 'hello' > 'Hello';
3. SELECT * FROM employees WHERE reports_to IS NULL; followed by SELECT * FROM employees WHERE reports_to = NULL; Explain each result. In particular, why do the two queries in C return different results?
Real-world forensics. A colleague sends you a CSV file of 50,000 sensor readings from a manufacturing floor. They say “I opened it in Excel to check it, and everything looked fine.” Based on what you learned in this chapter about Excel’s limitations, list three specific things that could have gone wrong during that Excel inspection that wouldn’t happen if they’d loaded the file directly into a database. For each, explain the mechanism of failure.

Solutions

1. i=OLTP Server (frequent transactions across 12 clinics). ii=OLAP/DuckDB (analytical scans of 5 years of data). iii=Embedded OLTP/SQLite (mobile app caching appointments locally).

2. Problems: redundancy across 12 monthly tabs, manual merge for annual analysis, 890MB/1min load time. Solutions: single table with dates (not tabs), SQL queries replace 12 pivot tables, proper storage eliminates bloat.

3. DuckDB is faster because of columnar storage (reads only unit_price column), vectorized execution (processes batches of 2048 values using SIMD), and morsel-driven parallelism (splits work across CPU cores).

4. Querying CSVs directly is fine for one-off exploration. Persistent tables win for repeated queries (cached vs. re-reading), enforced types (rejects invalid data), and discoverability (information_schema, schema documentation).

5. A: NULL (number + NULL = NULL). B: TRUE (‘h’ > ‘H’ in ASCII). C: First query returns 1 row (top manager); second returns 0 (= NULL is always unknown, never TRUE).

6. Three risks: silent row truncation at Excel’s limit, automatic type conversion (leading zeros stripped, dates reformatted), floating-point precision loss. A database enforces types at import and preserves full precision.

7.10 Summary

Databases exist because structured data at scale requires more discipline than spreadsheets can provide: enforced rules, concurrent access, crash recovery, and a powerful query language. Spreadsheets excel at quick, interactive, small-scale analysis, but documented failures in finance, public health, and science demonstrate the risks of relying on them for large or critical data. The database world divides into two main architectures. OLTP databases (like PostgreSQL and MySQL) use row-oriented storage optimized for transactions, the small, fast operations that keep businesses running. OLAP databases (like DuckDB and ClickHouse) use column-oriented storage optimized for analytics, the broad aggregations and summaries that inform decisions. Server databases follow a client-server architecture (like ordering through DoorDash, where your app is the client and the kitchen is the server) where separate processes communicate over a network, supporting many simultaneous users but requiring installation and configuration. Embedded databases like SQLite and DuckDB run locally with zero configuration, trading multi-user access for simplicity. DuckDB is our tool for this module because it combines analytical power, through columnar storage, vectorized execution, and automatic parallelism, with installation simplicity, and it teaches standard SQL that transfers to any relational database. SQL itself is a decades-old standard with core features (SELECT, joins, CTEs, window functions) that work across every major database, and we’ll distinguish clearly between standard SQL and DuckDB-specific extensions throughout this book. The Northwind dataset gives us a realistic supply chain context, with suppliers, products, orders, and customers, to work with throughout the remaining chapters.

7.11 Glossary

ANSI SQL: The standardized version of SQL maintained by the American National Standards Institute and the International Organization for Standardization (ISO). Core features like SELECT, JOIN, CTEs, and window functions are defined in the standard and work across all major databases.
Client-Server Architecture: A database deployment model where a server process manages data and accepts connections from separate client applications over a network. PostgreSQL, MySQL, and SQL Server use this model.
Column-Oriented Storage: A database storage format that keeps all values from the same column together on disk, optimizing analytical queries that aggregate across many rows but few columns.
Connection String: A formatted string specifying the information needed to connect to a database server, typically including the host, port, credentials, and database name.
Data Dictionary: A table or document that describes the structure and contents of a database, including what each table and column represents, expected values, and relationships between tables.
DBMS (Database Management System): Software that manages the storage, retrieval, and manipulation of data in a database, enforcing rules for data integrity, security, and concurrent access.
DuckDB: An open-source, embedded, column-oriented analytical database designed for OLAP workloads. It runs in-process with no server and integrates with Python.
Embedded Database: A database that runs within the application process rather than as a separate server, requiring no network configuration or administration. SQLite and DuckDB are embedded databases.
Localhost: A hostname that refers to the current computer. When a program starts a local web server (like duckdb -ui), localhost in the URL means the browser is connecting to a process on your own machine, not to a remote server on the internet.
OLAP (Online Analytical Processing): A database workload pattern focused on complex queries that scan and aggregate large volumes of data to support analysis and decision-making.
OLTP (Online Transaction Processing): A database workload pattern focused on processing many small, fast transactions like inserts, updates, and point lookups that keep day-to-day business operations running.
Port: A number that identifies a specific process or service on a computer. When multiple programs run on the same machine, each listens on a different port (e.g., :4213 for DuckDB, :5432 for PostgreSQL) so they don’t interfere with each other.
Row-Oriented Storage: A database storage format that keeps all values from the same row together on disk, optimizing transactional queries that retrieve or modify complete records.
Server Database: A database that runs as a separate process (often on a dedicated machine), accepting connections from multiple clients over a network. PostgreSQL, MySQL, and SQL Server are server databases.
SQL (Structured Query Language): The standard language for interacting with relational databases, used to query, manipulate, and define data and database structures. First standardized in 1986 and continuously updated since.
Tidy Data: A data organization principle where each row represents one observation, each column represents one variable, and each cell contains a single value. Applies to both database tables and spreadsheets.
Vectorized Execution: A query processing technique that operates on batches of values (vectors) rather than one row at a time, enabling modern CPUs to process data more efficiently through hardware-level optimizations.