19 DataFrames with Polars
You’ve established the first bridge between SQL and Python. In the previous chapter, you connected Python to DuckDB and moved your work into Marimo notebooks. SQL handles the joins and aggregations inside the database, and Python handles everything around it. But there’s a gap: once SQL results arrive in Python, you need a way to transform them further, to compute derived columns, reshape data, and prepare it for visualization or export.
That’s what DataFrames are for. A DataFrame is a table that lives in Python’s memory: rows and columns, just like a database table or a spreadsheet, but accessible to all the tools in the Python ecosystem. In this chapter, you’ll learn Polars, a modern DataFrame library designed for speed, expressiveness, and correctness. If you’ve been using Marimo SQL cells, you’ve already been working with Polars DataFrames without knowing it, because that’s exactly what SQL cell results are.
19.1 What Is a DataFrame?
A DataFrame is, at its core, a table. It has named columns, each with a consistent data type, and rows of data. If you’ve used SQL, the connection is immediate:
| SQL Concept | DataFrame Equivalent |
|---|---|
| Table | DataFrame |
| Column | Series (a single column) |
SELECT |
.select() |
WHERE |
.filter() |
GROUP BY |
.group_by() |
ORDER BY |
.sort() |
AS (alias) |
.alias() |
The difference is context. SQL operates inside a database engine. DataFrames operate inside Python. When you need to pass data to a Python visualization library, export it to Excel, or apply custom transformation logic that’s awkward to express in SQL, DataFrames are the tool.
You already know how to get data out of a database and into Python as lists of dictionaries (Chapter 18). DataFrames are a more structured, more efficient representation for the same data. They preserve column types, support vectorized operations (operations applied to an entire column at once, rather than row by row in a loop), and integrate with the rest of the Python data ecosystem.
19.2 Why Polars?
If you’ve encountered Python DataFrames before, it was probably through Pandas, the library that defined the concept for Python. Pandas is widely used and has an enormous ecosystem, but it has design decisions from 2008 that create real problems: silent type coercion (integers silently become floats when NaN appears), confusing index behavior, and inconsistent APIs.
Polars is a modern alternative, built from scratch with the benefit of hindsight. It uses Apache Arrow as its memory format (the same format DuckDB uses, which is why conversion between them is nearly free), supports lazy evaluation for query optimization, runs operations in parallel automatically, and never silently changes your data types.
For this book, Polars is the right choice for several reasons. It integrates seamlessly with DuckDB, it’s the default DataFrame library in Marimo SQL cells, and its expression API maps naturally to the SQL concepts you already know. When you encounter Pandas in the wild, the concepts transfer directly, even though the syntax differs.
Install Polars in your project:
terminal
uv add polars19.3 Polars Fundamentals
19.3.1 Creating DataFrames
You can create a Polars DataFrame from a Python dictionary, a CSV file, a Parquet file, or a DuckDB query result:
creating_dataframes.py
import polars as pl
# From a dictionary
df = pl.DataFrame({
"product": ["Chai", "Chang", "Tofu", "Miso"],
"category": ["Beverages", "Beverages", "Produce", "Condiments"],
"price": [18.00, 19.00, 23.25, 13.00],
"stock": [39, 17, 35, 29],
})
# From a CSV file
df_csv = pl.read_csv("data/products.csv")
# From a Parquet file
df_parquet = pl.read_parquet("data/northwind_flat.parquet")
# From a DuckDB query result
import duckdb
conn = duckdb.connect("data/northwind.duckdb", read_only=True)
df_db = conn.sql("SELECT * FROM products").pl()19.3.2 Viewing Data
Once you have a DataFrame, several methods let you understand its structure and contents:
viewing.py
import polars as pl
df = pl.read_parquet("data/northwind_flat.parquet")
print(df.head()) # First 5 rows
print(df.head(10)) # First 10 rows
print(df.tail(3)) # Last 3 rows
print(df.shape) # (num_rows, num_columns) as a tuple
print(df.columns) # List of column names
print(df.schema) # Dictionary of column names to data types
print(df.describe()) # Summary statistics for each columnThe .schema property is especially useful. It tells you the exact data type of each column, with no ambiguity:
schema.py
print(df.schema)
# {'order_id': Int64, 'product_name': Utf8, 'unit_price': Float64, ...}Compare this to CSV files (Section 16.4), where everything arrives as strings and you have to convert manually. DataFrames carry type information with them.
19.3.3 Selecting Columns
The .select() method chooses which columns to keep, analogous to SQL’s SELECT:
select.py
import polars as pl
df = pl.DataFrame({
"product": ["Chai", "Chang", "Tofu"],
"category": ["Beverages", "Beverages", "Produce"],
"price": [18.00, 19.00, 23.25],
"stock": [39, 17, 35],
})
# Select specific columns
print(df.select("product", "price"))
# Select using pl.col() for expressions
print(df.select(
pl.col("product"),
pl.col("price"),
(pl.col("price") * pl.col("stock")).alias("stock_value"),
))The pl.col() function creates a column expression, a reference to a column that can be transformed, computed, or combined with other expressions. The .alias() method renames the resulting column, like AS in SQL.
19.3.4 Filtering Rows
The .filter() method keeps only rows that match a condition, analogous to SQL’s WHERE:
filter.py
import polars as pl
df = pl.DataFrame({
"product": ["Chai", "Chang", "Tofu", "Miso"],
"price": [18.00, 19.00, 23.25, 13.00],
"discontinued": [False, False, False, True],
})
# Simple filter
print(df.filter(pl.col("price") > 15))
# Multiple conditions with & (and) and | (or)
print(df.filter(
(pl.col("price") > 15) & (~pl.col("discontinued"))
))
# String filter
print(df.filter(pl.col("product").str.starts_with("Ch")))When combining conditions with & (and) or | (or), each condition must be wrapped in parentheses. This is because Python’s operator precedence rules treat & and | as bitwise operators, which bind tighter than comparison operators. pl.col("price") > 15 & ~pl.col("discontinued") would be parsed incorrectly. Always use (pl.col("price") > 15) & (~pl.col("discontinued")).
19.3.5 Sorting
The .sort() method orders rows, analogous to SQL’s ORDER BY:
sort.py
import polars as pl
print(df.sort("price")) # Ascending (default)
print(df.sort("price", descending=True)) # Descending
print(df.sort("category", "price", descending=[False, True])) # Mixed19.3.6 Adding and Modifying Columns
The .with_columns() method adds new columns or modifies existing ones, while keeping all other columns:
with_columns.py
import polars as pl
df = pl.DataFrame({
"product": ["Chai", "Chang", "Tofu"],
"price": [18.00, 19.00, 23.25],
"stock": [39, 17, 35],
})
result = df.with_columns(
stock_value=pl.col("price") * pl.col("stock"),
price_category=pl.when(pl.col("price") > 20)
.then(pl.lit("Premium"))
.otherwise(pl.lit("Standard")),
)
print(result)The .with_columns() call returns a new DataFrame with the additional columns. The original df is unchanged, because Polars DataFrames are immutable. This is the same principle as string operations in Python (Chapter 12): operations return new values rather than modifying in place.
19.3.7 Exercises
Create a Polars DataFrame from the Northwind products data by reading
data/products.csv. Use.schemato examine the column names and data types. Which columns contain categorical data (text), and which contain numeric data? Then use.select()to choose only theproduct_name,category_id, andunit_pricecolumns, and.filter()to keep only products with aunit_pricegreater than 20. How many products pass the filter?Load the denormalized Northwind Parquet file (
data/northwind_flat.parquet) into a DataFrame. The file is the result of joining multiple normalized tables. Use.columnsto see all 46 column names. How many rows does it contain? What is the difference between using.select("category_name", "product_name")and.with_columns()to keep only those columns?Using the flat file, create a new DataFrame that adds a computed column called
revenue(calculated asunit_price * quantity * (1 - discount)). Use.select()to include onlyorder_id,category_name,product_name,unit_price,quantity,discount, and your newrevenuecolumn. Then filter to show only orders from the “Beverages” category withrevenuegreater than 100. Sort the results byrevenuein descending order.
1. Reading the CSV and examining schema:
products_schema.py
import polars as pl
df = pl.read_csv("data/products.csv")
print(df.schema)Categorical columns include product_name (string), category_id (integer category reference), and discontinued (boolean). Numeric columns include unit_price (float), units_in_stock (integer), units_on_order (integer), reorder_level (integer). Filtering:
products_filter.py
import polars as pl
df = pl.read_csv("data/products.csv")
filtered = df.select("product_name", "category_id", "unit_price").filter(
pl.col("unit_price") > 20
)
print(f"Products with price > $20: {len(filtered)}")2. Loading the flat file and exploring structure:
flat_file_explore.py
import polars as pl
df = pl.read_parquet("data/northwind_flat.parquet")
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")
print(df.columns)The flat file typically contains over 400,000 rows (denormalized order detail lines with product and category information joined). The key difference: .select() immediately discards columns and returns a new DataFrame with only those columns, equivalent to SELECT category_name, product_name FROM ... in SQL. .with_columns() adds new columns while keeping existing ones; the two approaches produce different results if you’re also trying to remove columns. Use .select() when you only want specific columns; use .with_columns() when you’re adding computed columns to existing data.
3. Adding a revenue column and filtering:
beverages_revenue.py
import polars as pl
df = pl.read_parquet("data/northwind_flat.parquet")
result = df.with_columns(
revenue=pl.col("unit_price") * pl.col("quantity") * (1 - pl.col("discount"))
).select(
"order_id",
"category_name",
"product_name",
"unit_price",
"quantity",
"discount",
"revenue",
).filter(
(pl.col("category_name") == "Beverages") & (pl.col("revenue") > 100)
).sort("revenue", descending=True)
print(result)This chains three operations: add the revenue computed column, select only the desired columns, filter on category and revenue threshold, and sort descending. The parentheses around each filter condition are critical in Polars.
19.4 Polars Expressions
The core idea in Polars is the expression. An expression is a declarative description of a computation on columns. Instead of writing imperative code that loops through rows, you describe what you want computed, and Polars figures out how to do it efficiently.
19.4.1 Column Expressions
The simplest expression is a column reference:
col_expression.py
import polars as pl
print(pl.col("price")) # Reference to the "price" column
print(pl.col("price") * pl.col("stock")) # Element-wise multiplication
print(pl.col("price") * 1.06) # Scalar multiplication (tax)These expressions don’t do anything on their own. They describe computations that execute when passed to .select(), .filter(), .with_columns(), or .agg().
19.4.2 String Operations
Polars provides string operations through the .str namespace:
string_ops.py
import polars as pl
print(pl.col("product_name").str.to_lowercase())
print(pl.col("product_name").str.starts_with("Ch"))
print(pl.col("product_name").str.contains("sauce", literal=True))
print(pl.col("product_name").str.len_chars())These correspond to SQL string functions like LOWER(), STARTS_WITH(), and CONTAINS() that you used in the Databases module.
19.4.3 Conditional Expressions
Polars’ when/then/otherwise is the equivalent of SQL’s CASE WHEN:
conditional.py
import polars as pl
# SQL: CASE WHEN price > 50 THEN 'Expensive'
# WHEN price > 20 THEN 'Moderate'
# ELSE 'Budget' END
price_tier = (
pl.when(pl.col("price") > 50).then(pl.lit("Expensive"))
.when(pl.col("price") > 20).then(pl.lit("Moderate"))
.otherwise(pl.lit("Budget"))
)
print(price_tier)The pl.lit() function wraps a literal value into an expression. It’s needed because the .then() and .otherwise() methods expect expressions, not bare Python values.
19.4.4 Why Expressions Compose
Expressions chain naturally, just like method calls on strings in Python or function composition in SQL:
chaining.py
import polars as pl
df = pl.DataFrame({
"product": ["Chai", "Chang", "Tofu"],
"price": [18.00, 19.00, 23.25],
"stock": [39, 17, 35],
})
result = df.select(
pl.col("product"),
revenue=pl.col("price") * pl.col("stock"),
price_rank=pl.col("price").rank(descending=True).cast(pl.Int32),
is_premium=pl.col("price") > 20,
)
print(result)Each expression describes a column in the output. Polars evaluates all of them in parallel where possible, which is one reason it’s fast.
19.4.5 Exercises
Load the Northwind flat file and use
.select()withpl.col()expressions to compute: theproduct_namefrom the original data, theunit_price, a new columnprice_with_tax(price multiplied by 1.08 for 8 percent sales tax usingpl.lit(1.08)), and a column calledprice_categorythat is “Premium” ifunit_price > 50, “Mid-range” ifunit_price > 20, and “Budget” otherwise. Sort the result byunit_pricedescending. Show the first 5 rows.Using the flat file, create a new column called
order_value(calculated asunit_price * quantity). Then use.select()to add a columnorder_value_rankthat ranks products within each category by their order value (highest to lowest). Hint: use.rank(descending=True)on the expression. Select onlycategory_name,product_name,order_value, andorder_value_rank, sorted by category and rank. Compare this to the SQL window function approach from Chapter 10, where you’d needRANK() OVER (PARTITION BY ... ORDER BY ...). Polars handles the same logic with.rank()on an expression, noOVERclause needed.Take the flat file and build a Polars expression chain that: (a) filters to orders placed in 2023 or later, (b) adds a column
discount_amount(unit_price * quantity * discount), (c) adds a columnnet_revenue(unit_price * quantity * (1 - discount)), (d) sorts byorder_datedescending, and (e) selects onlyorder_date,product_name,quantity,discount_amount, andnet_revenue. Write this as a single chain of method calls (like the example in Section 19.4). How many rows result from the 2023+ filter?
1. Creating computed columns with conditional logic:
price_categories.py
import polars as pl
df = pl.read_parquet("data/northwind_flat.parquet")
result = df.select(
"product_name",
"unit_price",
price_with_tax=pl.col("unit_price") * pl.lit(1.08),
price_category=pl.when(pl.col("unit_price") > 50).then(pl.lit("Premium"))
.when(pl.col("unit_price") > 20).then(pl.lit("Mid-range"))
.otherwise(pl.lit("Budget")),
).sort("unit_price", descending=True)
print(result.head(5))Notice the pl.when().then().otherwise() syntax, which maps to SQL’s CASE WHEN. Each branch wraps values in pl.lit() because the branches expect expressions, not bare Python literals.
2. Ranking within groups using expressions:
category_ranking.py
import polars as pl
df = pl.read_parquet("data/northwind_flat.parquet")
result = df.with_columns(
order_value=pl.col("unit_price") * pl.col("quantity"),
).with_columns(
order_value_rank=pl.col("order_value").rank(descending=True).over("category_name")
).select(
"category_name",
"product_name",
"order_value",
"order_value_rank",
).sort("category_name", "order_value_rank")
print(result)The .over("category_name") modifier tells Polars to rank within each category separately. In SQL, this would require a window function: RANK() OVER (PARTITION BY category_name ORDER BY order_value DESC).
3. Chaining multiple transformations:
2023_chain.py
import polars as pl
df = pl.read_parquet("data/northwind_flat.parquet")
result = (
df
.filter(pl.col("order_date") >= pl.lit("2023-01-01").cast(pl.Date))
.with_columns(
discount_amount=pl.col("unit_price") * pl.col("quantity") * pl.col("discount"),
net_revenue=pl.col("unit_price") * pl.col("quantity") * (1 - pl.col("discount")),
)
.sort("order_date", descending=True)
.select("order_date", "product_name", "quantity", "discount_amount", "net_revenue")
)
print(f"Rows from 2023 onwards: {len(result)}")
print(result.head())This demonstrates method chaining, where each operation returns a new DataFrame that the next operation works on. The parentheses around the entire chain improve readability for multi-line operations. Note that the date comparison requires converting the string literal to a pl.Date type for the comparison to work correctly.
19.5 Aggregation and Grouping
The .group_by().agg() pattern is Polars’ equivalent of SQL’s GROUP BY:
group_by.py
import polars as pl
df = pl.read_parquet("data/northwind_flat.parquet")
category_summary = df.group_by("category_name").agg(
pl.col("unit_price").mean().alias("avg_price"),
pl.col("order_id").n_unique().alias("order_count"),
(pl.col("unit_price") * pl.col("quantity")).sum().alias("total_revenue"),
)
print(category_summary.sort("total_revenue", descending=True))Compare this with the equivalent SQL, which you already know:
equivalent.sql
SELECT
category_name,
AVG(unit_price) AS avg_price,
COUNT(DISTINCT order_id) AS order_count,
SUM(unit_price * quantity) AS total_revenue
FROM northwind_flat
GROUP BY category_name
ORDER BY total_revenue DESCThe operations are identical: group the data by category, compute aggregations within each group, and sort the results. The syntax differs, but the mental model is the same. If you can write the SQL, you can write the Polars, and vice versa.
19.5.1 Multiple Aggregations
You can compute as many aggregations as you need in a single .agg() call:
multi_agg.py
import polars as pl
result = df.group_by("category_name").agg(
pl.col("unit_price").min().alias("min_price"),
pl.col("unit_price").max().alias("max_price"),
pl.col("unit_price").mean().alias("avg_price"),
pl.col("unit_price").std().alias("std_price"),
pl.len().alias("row_count"),
)
print(result)The pl.len() expression counts rows in each group, equivalent to SQL’s COUNT(*).
19.5.2 When to Aggregate in SQL vs. Polars
Both DuckDB and Polars can perform aggregations efficiently. Here are guidelines for choosing:
Use SQL when the aggregation involves joins across multiple tables. SQL’s join syntax is more concise and the database can optimize the join order. Also use SQL when you want to aggregate before pulling data into Python, reducing the amount of data transferred.
Use Polars when you’re building on top of data already in a DataFrame, when you need Python-specific logic in the aggregation (custom functions, conditional computations), or when you’re chaining multiple transformations that are natural in Polars’ expression API.
19.6 Polars and DuckDB: Best Friends
Polars and DuckDB integrate so tightly that you can use either tool to query the other’s data with essentially zero cost.
19.6.1 DuckDB Querying Polars DataFrames
DuckDB can query a Polars DataFrame directly, as if it were a table:
duckdb_queries_polars.py
import duckdb
import polars as pl
# Create a Polars DataFrame
products = pl.DataFrame({
"name": ["Chai", "Chang", "Tofu", "Miso"],
"category": ["Beverages", "Beverages", "Produce", "Condiments"],
"price": [18.00, 19.00, 23.25, 13.00],
})
# Query the DataFrame with SQL (note: use the variable name as the table name)
result = duckdb.sql("""
SELECT category, AVG(price) AS avg_price
FROM products
GROUP BY category
ORDER BY avg_price DESC
""")
print(result)DuckDB recognizes the Python variable products as a DataFrame and queries it directly. There’s no data copying because DuckDB and Polars share the same underlying memory format (Apache Arrow). You’ll learn more about why this works in Chapter 25.
19.6.2 Polars DataFrames from DuckDB Results
Going the other direction, the .pl() method converts a DuckDB result to a Polars DataFrame:
duckdb_to_polars.py
import duckdb
import polars as pl
conn = duckdb.connect("data/northwind.duckdb", read_only=True)
# DuckDB handles the join, Polars gets the result
df = conn.sql("""
SELECT
p.product_name,
c.category_name,
p.unit_price,
p.units_in_stock
FROM products AS p
JOIN categories AS c ON p.category_id = c.category_id
""").pl()
# Now use Polars for further transformations
enriched = df.with_columns(
stock_value=pl.col("unit_price") * pl.col("units_in_stock"),
price_tier=pl.when(pl.col("unit_price") > 50).then(pl.lit("Premium"))
.when(pl.col("unit_price") > 20).then(pl.lit("Mid-range"))
.otherwise(pl.lit("Budget")),
)
print(enriched)This pattern, SQL for joins, Polars for transformations, is the recommended workflow for this book. It uses each tool for what it does best.
19.7 The Honest Comparison: Polars vs. SQL
Now that you’ve seen both tools, let’s be direct about the trade-offs.
For analytical workloads, DuckDB’s SQL and Polars are roughly equivalent in performance. Both are columnar, both are vectorized, and both optimize query execution. The choice often comes down to preference and context.
Arguments for SQL. SQL is universal. It works with any database, not just DuckDB. If your future job involves PostgreSQL, SQL Server, or Snowflake, the SQL you learned in the Databases module transfers directly. Polars expressions don’t. SQL is also more concise for complex joins: a five-table join in SQL is a few lines, while the equivalent Polars code is longer and harder to read. And everyone in the data industry knows SQL. It’s the common language.
Arguments for Polars. Polars code lives in Python, which means your editor can lint it, type-check it, and provide autocomplete. If you make a typo in a SQL string, you don’t find out until runtime. If you make a typo in a Polars expression, your editor highlights it immediately. Polars also composes naturally with the rest of your Python code: conditionals, loops, functions, and modules all work normally around Polars operations. And Polars’ lazy evaluation lets you build complex pipelines that the optimizer can rearrange for efficiency.
The pragmatic answer. Use SQL for data retrieval and joins, because SQL was designed for exactly this. Use Polars when you need Python-native transformations, derived columns, or integration with Python libraries. In a Marimo notebook, the natural split is SQL cells for pulling and shaping data, and Python cells for further transformation, visualization, and export. Don’t spend time arguing about which is “better.” The best tool is the one that makes your code readable and your workflow efficient.
19.8 Putting It Together: Northwind Analysis
Here’s a complete analysis that uses SQL for data retrieval and Polars for transformation:
northwind_analysis.py
"""Northwind category analysis using SQL + Polars."""
import duckdb
import polars as pl
conn = duckdb.connect("data/northwind.duckdb", read_only=True)
# Step 1: SQL handles the joins and initial aggregation
monthly_revenue = conn.sql("""
SELECT
c.category_name,
DATE_TRUNC('month', o.order_date) AS order_month,
ROUND(SUM(od.unit_price * od.quantity * (1 - od.discount)), 2) AS revenue,
COUNT(DISTINCT o.order_id) AS order_count
FROM order_details AS od
JOIN orders AS o ON od.order_id = o.order_id
JOIN products AS p ON od.product_id = p.product_id
JOIN categories AS c ON p.category_id = c.category_id
GROUP BY c.category_name, DATE_TRUNC('month', o.order_date)
ORDER BY c.category_name, order_month
""").pl()
# Step 2: Polars handles derived computations
enriched = monthly_revenue.with_columns(
avg_order_value=pl.col("revenue") / pl.col("order_count"),
).with_columns(
# Month-over-month growth rate within each category
revenue_growth=pl.col("revenue").pct_change().over("category_name"),
)
# Step 3: Polars handles filtering and final summary
latest_month = enriched["order_month"].max()
latest_data = enriched.filter(pl.col("order_month") == latest_month)
print(f"Latest month: {latest_month}")
print(latest_data.select(
"category_name",
"revenue",
"order_count",
"avg_order_value",
"revenue_growth",
).sort("revenue", descending=True))
conn.close()This script demonstrates the complete pattern: SQL retrieves and joins the data, Polars computes derived columns (average order value, month-over-month growth), and the results are ready for visualization (Chapter 20) or export (Chapter 21).
Exercises
Exploring the Flat File
Load the Northwind denormalized Parquet file (data/northwind_flat.parquet) into a Polars DataFrame. Use .shape, .schema, .describe(), and .head() to understand its structure. The file has 46 columns, so focus your summary on the key dimension columns (categories, customers, products) and metric columns (prices, quantities, discounts). Compare this structure with the normalized database schema you learned in the Databases module, noting which tables were joined to produce this flat file.
SQL to Polars Translation
Rewrite these three queries as Polars expressions, using the denormalized flat file (data/northwind_flat.parquet) as the data source:
- All products in the “Beverages” category with price above $15, sorted by price descending. (Hint: filter on
category_nameandunit_price.) - The number of distinct products per category.
- The top 5 customers by total revenue, where revenue is
SUM(unit_price * quantity * (1 - discount)), grouped bycompany_name.
For each, write both a SQL version (using conn.sql() against the DuckDB database) and a Polars version (using the DataFrame API on the flat file). Note which approach feels more natural for each task.
Mixed Workflow Notebook
Build a Marimo notebook that uses SQL cells for the joins and initial data retrieval, and Polars cells for transformations. The notebook should answer: “Which product categories had declining revenue in the most recent quarter compared to the previous quarter?” Start by determining the date range of orders in the database (inspect MIN(order_date) and MAX(order_date)) so you know which is the most recent complete quarter and the one before it. Then use SQL to compute quarterly revenue (as SUM(unit_price * quantity * (1 - discount))) by category for those two quarters, Polars to compute the quarter-over-quarter change, and Polars filtering to identify declining categories.
Performance Comparison
Using the Northwind dataset (200,000 orders, 631,000 order detail lines), time these operations in both Polars and DuckDB SQL: count of distinct orders by category, average unit_price by category, and total revenue (SUM(unit_price * quantity * (1 - discount))) by month. For the DuckDB SQL versions, these require joins across order_details, products, categories, and orders. For the Polars versions, load the flat file. Use Python’s time module to measure execution time. Write your findings as comments: are the performance differences meaningful for this dataset size?
Summary
DataFrames are tables that live in Python’s memory, bridging the gap between SQL query results and Python computation. Polars provides a modern, fast, and expressive API for working with DataFrames, with an expression system that maps naturally to the SQL concepts you already know: .select() is SELECT, .filter() is WHERE, .group_by().agg() is GROUP BY, and .sort() is ORDER BY.
The integration between Polars and DuckDB is seamless. DuckDB results convert to Polars DataFrames with .pl() at effectively zero cost, and DuckDB can query Polars DataFrames as if they were database tables. This interoperability lets you use SQL for what it does best (joins, aggregations, data retrieval) and Polars for what it does best (derived computations, Python integration, pipeline composition).
In the next chapter, you’ll add visualization to this workflow. The DataFrames you build with SQL and Polars will become the data source for Altair charts that communicate your findings visually.
Glossary
- aggregation
-
Computing a summary value from multiple rows: sum, mean, count, min, max. In Polars, performed with
.agg()after.group_by(). - alias
-
Renaming a column in the output. In Polars,
.alias("new_name"). In SQL,AS new_name. - column expression
-
A Polars expression that references or transforms a column. Created with
pl.col("name")and chainable with operations like.mean(),.sum(),.alias(). - DataFrame
- A two-dimensional data structure with named columns and typed values. The Python equivalent of a database table or spreadsheet.
- eager evaluation
- Executing each operation immediately as it’s called. The default mode for most Polars operations.
- expression
- A declarative description of a computation in Polars. Expressions describe what to compute; Polars determines how to compute it efficiently.
- lazy evaluation
-
Deferring execution until results are explicitly requested with
.collect(). Allows Polars to optimize the entire computation plan. - Polars
- A modern DataFrame library for Python, built on Apache Arrow. Designed for speed, correctness, and expressiveness.
- Series
- A single column of a Polars DataFrame. Has a name and a consistent data type.
- vectorized operation
- An operation applied to an entire column at once, rather than element by element in a loop. Vectorized operations are significantly faster than loops.
- zero-copy
- Sharing data between tools without duplicating it in memory. DuckDB and Polars achieve zero-copy interchange through their shared Apache Arrow format.