DuckLake brings lakehouse capabilities to DuckDB, providing ACID transactions and time travel on your data lake. SQLMesh adds sophisticated transformation management with incremental processing. Together, they create a powerful, cost-effective alternative to commercial data platforms.
Introduction
Imagine you're building a house. You could buy a pre-fabricated home from a single vendor, but what if you want the freedom to choose your own materials, design your own floor plan, and expand whenever you need? That's the philosophy behind combining DuckLake and SQLMesh - you're building a data infrastructure with best-in-class open-source components that give you control.
DuckLake brings lakehouse capabilities to DuckDB, providing ACID transactions and time travel on your data lake. SQLMesh adds transformation management with incremental processing. Together, they create a powerful, cost-effective lakehouse solution.
Check out the DuckDB SQLMesh documentation for additional details.
Part 1: Understanding the Architecture
Think of our architecture as having three distinct layers:
The Storage Layer (DuckLake): This is where your actual data lives. DuckLake stores data as Parquet files while maintaining a SQL-accessible catalog with ACID guarantees. It's like having a filing system that's both highly organized and tamper-proof.
The Transformation Layer (SQLMesh): This orchestrates how data moves and transforms through your pipeline. SQLMesh tracks what's been processed, manages dependencies, and ensures data quality. Think of it as your project manager, keeping track of what needs to be done and in what order.
The State Layer (Local DuckDB): This is SQLMesh's notebook where it keeps track of its work. Since DuckLake can't be updated (only appended to), we need a separate place for SQLMesh to maintain its records.
Part 2: Prerequisites and Environment Setup
System Requirements
First, ensure you have Python 3.8 or higher installed. You can check this by running:
python --version
If you need to install Python, visit python.org and follow the installation instructions for your operating system.
Creating Your Project Structure
Let's create a well-organized project structure. Each directory has a specific purpose:
# Create the main project directory
mkdir ducklake-sqlmesh-tutorial
cd ducklake-sqlmesh-tutorial
# Create subdirectories for different components
# On macOS/Linux:
mkdir -p models seeds tests audits macros scripts data/ducklake
# On Windows PowerShell (alternative):
New-Item -ItemType Directory -Force -Path models, seeds, tests, audits, macros, scripts, data/ducklake
Here's what each directory will contain:
models/
: Your SQLMesh model definitions (transformations)seeds/
: Static CSV files for initial datatests/
: Data quality testsaudits/
: Data validation rulesmacros/
: Reusable code snippetsscripts/
: Setup and utility scriptsdata/ducklake/
: Where DuckLake will store Parquet files
Setting Up a Python Virtual Environment
Virtual environments keep your project dependencies isolated. Think of it as creating a clean workspace for this specific project:
# Create a virtual environment
python -m venv venv
# Activate it
# On macOS/Linux:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# You should see (venv) at the beginning of your command prompt
Installing Required Packages
Now let's install SQLMesh and DuckDB. I'll explain what each package does:
# Install SQLMesh with DuckDB support
pip install "sqlmesh[lsp,duckdb]"
# Install DuckDB (this includes DuckLake as an extension)
pip install "duckdb>=0.10.3"
SQLMesh is our transformation framework - it will manage how data flows through our pipeline. DuckDB is our compute engine, and it includes DuckLake as an extension that adds lakehouse features.
Part 3: Configuring the Integration
This is the most critical part of our setup. We need to configure SQLMesh to work around DuckLake's limitation with UPDATE operations. Let me explain each configuration option in detail.
Creating the Configuration File
Create a file named config.yaml in your project root:
# config.yaml - Minimal configuration for SQLMesh with DuckDB
gateways:
dev:
connection:
type: duckdb
database: tutorial.duckdb
state_connection:
type: duckdb
database: dev_state.db
prod:
connection:
type: duckdb
database: tutorial_prod.duckdb
state_connection:
type: duckdb
database: prod_state.db
default_gateway: dev
model_defaults:
dialect: duckdb
start: '2024-01-01'
Understanding the Configuration
Let me break down why each part of this configuration is essential:
The init_sql Commands: These four commands are the secret sauce that makes everything work:
INSTALL ducklake;
- Downloads and installs the DuckLake extension if it's not already presentLOAD ducklake;
- Loads the extension into memory (required for DuckDB 0.10.3+)ATTACH
... - This connects to a DuckLake database, creating it if it doesn't exist. The DATA_PATH parameter tells DuckLake where to store its Parquet filesUSE my_ducklake;
- This is crucial! It makes DuckLake the active database, so all our tables are created there, not in the default DuckDB database
The state_connection: This is our workaround for DuckLake's UPDATE
limitation. SQLMesh needs to update tables to track what it's processed, but DuckLake only supports INSERT
operations. By using a separate local DuckDB file for state, we give SQLMesh a place to keep its records while our actual data lives in DuckLake.
Part 4: Initializing DuckLake
Now let's set up DuckLake properly. We'll create a script that initializes our lakehouse with the proper structure.
Creating the Setup Script
Create a file scripts/setup_ducklake.py:
# setup_ducklake.py - Initialize our DuckLake lakehouse
# setup_ducklake.py - Initialize our DuckLake lakehouse
import duckdb
import sys
def setup_ducklake():
"""Initialize DuckLake with proper database structure"""
try:
# Connect to DuckDB
print("Connecting to DuckDB...")
conn = duckdb.connect('tutorial.duckdb')
# Install and load DuckLake
print("Installing DuckLake extension...")
conn.execute("INSTALL ducklake;")
conn.execute("LOAD ducklake;")
# Attach DuckLake with explicit data path
print("Creating DuckLake database...")
conn.execute("""
ATTACH 'ducklake:dev_metadata.ducklake' AS my_ducklake
(DATA_PATH 'data/ducklake');
""")
# Switch to DuckLake as active database
conn.execute("USE my_ducklake;")
# Create our schema structure
print("Creating schemas...")
schemas = ['raw', 'staging', 'analytics']
for schema in schemas:
conn.execute(f"CREATE SCHEMA IF NOT EXISTS {schema};")
print(f" ✓ Created schema: {schema}")
# Verify everything is working
print("\nVerifying DuckLake setup...")
result = conn.execute("SELECT current_database();").fetchone()
print(f" Current database: {result[0]}")
# Show the schemas we created
schemas = conn.execute("""
SELECT schema_name
FROM information_schema.schemata
WHERE catalog_name = 'my_ducklake'
ORDER BY schema_name;
""").fetchall()
print(" Available schemas:")
for schema in schemas:
print(f" - {schema[0]}")
print("\n✅ DuckLake setup complete!")
except Exception as e:
print(f"\n❌ Error during setup: {e}", file=sys.stderr)
sys.exit(1)
finally:
if 'conn' in locals():
conn.close()
if __name__ == "__main__":
setup_ducklake()
Run the setup script:
python scripts/setup_ducklake.py
You should see output confirming that DuckLake is set up with three schemas: raw, staging, and analytics. These represent the different layers of our data pipeline.
Part 5: Creating Sample Data
Let's create some realistic e-commerce data to work with. This will help us see how our pipeline processes information.
Creating the Seed Data
First, create a CSV file with sample data. Create seeds/raw_events.csv
:
event_id,user_id,event_type,event_timestamp,product_id,revenue
1,101,page_view,2024-01-01 10:00:00,201,0
2,101,add_to_cart,2024-01-01 10:05:00,201,0
3,101,purchase,2024-01-01 10:10:00,201,29.99
4,102,page_view,2024-01-01 11:00:00,202,0
5,102,page_view,2024-01-01 11:02:00,203,0
6,103,page_view,2024-01-02 09:00:00,201,0
7,103,add_to_cart,2024-01-02 09:03:00,201,0
8,103,purchase,2024-01-02 09:08:00,201,29.99
9,104,page_view,2024-01-02 14:00:00,204,0
10,104,purchase,2024-01-02 14:15:00,204,49.99
This data represents user interactions with an e-commerce site. Notice how we have events spanning multiple days - this will be important when we demonstrate incremental processing.
Creating the Seed Model
Now we need to tell SQLMesh how to load this data. Create models/raw_events_seed.sq
l:
-- raw_events_seed.sql - Load initial event data from CSV
MODEL (
name raw.raw_events, -- Table will be created in the raw schema
kind SEED (
path '../seeds/raw_events.csv' -- Path to our CSV file
),
columns (
event_id INT,
user_id INT,
event_type TEXT,
event_timestamp TIMESTAMP,
product_id INT,
revenue DECIMAL(10,2)
),
grain event_id -- event_id is our unique identifier
);
The SEED
model type tells SQLMesh to load data from a file rather than transform existing data. This is perfect for initial data loads or reference tables.
Part 6: Building Data Transformations
Now comes the exciting part - creating transformations that turn raw data into business insights. We'll build three types of models to demonstrate different processing strategies.
Staging Model - Basic Cleaning
Create models/stg_events.sq
l:
-- stg_events.sql - Clean and standardize raw event data
MODEL (
name staging.stg_events,
kind VIEW, -- Views are rebuilt every time they're queried
description 'Cleaned and standardized event data'
);
SELECT
event_id,
user_id,
event_type,
event_timestamp,
-- Convert timestamp to UTC for consistency
event_timestamp AT TIME ZONE 'UTC' as event_timestamp_utc,
-- Extract date for partitioning
DATE(event_timestamp) as event_date,
product_id,
revenue,
-- Add processing metadata
CURRENT_TIMESTAMP as processed_at
FROM raw.raw_events
WHERE event_id IS NOT NULL -- Basic data quality filter
This staging model performs basic cleaning and standardization.
Full Refresh Model - Complete Recalculation
Create models/daily_revenue_summary.sql
:
-- daily_revenue_summary.sql - Daily revenue aggregation (full refresh)
MODEL (
name analytics.daily_revenue_summary,
kind FULL, -- Completely rebuilds the table each run
cron '@daily', -- Intended to run daily
grain event_date -- One row per date
);
SELECT
event_date,
COUNT(DISTINCT user_id) as unique_users,
COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) as purchasers,
COUNT(*) as total_events,
COUNT(CASE WHEN event_type = 'purchase' THEN 1 END) as purchases,
COALESCE(SUM(CASE WHEN event_type = 'purchase' THEN revenue END), 0) as total_revenue,
-- Calculate conversion rate
CASE
WHEN COUNT(DISTINCT user_id) > 0
THEN COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) * 100.0 / COUNT(DISTINCT user_id)
ELSE 0
END as conversion_rate
FROM staging.stg_events
GROUP BY event_date
ORDER BY event_date;
This FULL model recalculates everything from scratch each time it runs. It's like taking a fresh inventory count - simple and reliable, but not efficient for large datasets.
Incremental Model - Smart Processing
Now for the star of the show - an incremental model that only processes new data:
Create models/user_activity_incremental.sql
:
-- user_activity_incremental.sql - Track user activity incrementally
MODEL (
name analytics.user_activity_incremental,
kind INCREMENTAL_BY_TIME_RANGE (
time_column event_date -- Changed to use date instead of timestamp
),
start '2024-01-01', -- Start processing from this date
cron '@daily', -- Process new data daily
grain (user_id, event_date), -- Unique key combination
audits (
NOT_NULL(columns = (user_id, event_date, events_count)),
FORALL(criteria = (events_count > 0)) -- Ensure positive counts
)
);
WITH user_daily_activity AS (
SELECT
user_id,
DATE(event_timestamp_utc) as event_date,
COUNT(*) as events_count,
COUNT(DISTINCT event_type) as unique_event_types,
SUM(CASE WHEN event_type = 'purchase' THEN revenue ELSE 0 END) as daily_revenue,
MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as made_purchase,
MIN(event_timestamp_utc) as first_event_time,
MAX(event_timestamp_utc) as last_event_time
FROM staging.stg_events
WHERE
-- SQLMesh will replace @start_ds and @end_ds with date boundaries
DATE(event_timestamp_utc) >= @start_ds
AND DATE(event_timestamp_utc) < @end_ds
GROUP BY user_id, DATE(event_timestamp_utc)
)
SELECT
user_id,
event_date, -- This must be included since it's our time_column
events_count,
unique_event_types,
daily_revenue,
made_purchase,
first_event_time,
last_event_time,
-- Calculate session duration for the day
EXTRACT(EPOCH FROM (last_event_time - first_event_time)) / 60 as session_duration_minutes,
-- Calculate cumulative metrics
SUM(daily_revenue) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_revenue,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY event_date
) as days_active
FROM user_daily_activity;
This incremental model is where SQLMesh truly shines. The @start_ds
and @end_ds
variables are dynamically replaced with time boundaries, so the model only processes new time periods. It's like having a smart assistant who knows exactly what work has already been done and only processes what's new.
Part 7: Running Your Pipeline
Now let's bring our pipeline to life! SQLMesh uses a two-step process: planning and running.
Initialize SQLMesh
First, initialize SQLMesh in your project with DuckDB as the engine:
sqlmesh init duckdb
This command does several important things:
- Creates the necessary SQLMesh directories if they don't exist
- Sets up DuckDB as your default SQL dialect
- Prepares the project structure for DuckDB-specific features
Why specify 'duckdb'? SQLMesh supports multiple SQL engines (Snowflake, BigQuery, PostgreSQL, etc.), each with different SQL syntax and features. By specifying duckdb, SQLMesh:
- Sets the correct SQL dialect for parsing your models
- Configures DuckDB-specific optimizations
- Ensures proper type mappings for DuckDB data types
- Validates your SQL against DuckDB's syntax rules
If you forget to specify the engine, SQLMesh might default to a generic SQL dialect that could cause syntax errors or miss DuckDB-specific optimizations.
Plan Your Changes
The plan command is like a preview - it shows you what SQLMesh intends to do without actually doing it:
sqlmesh plan dev
You'll see output showing:
- New models detected (our seed, staging, and analytics models)
- What tables will be created
- What data will be processed
When prompted "Apply - Backfill and Promote All Changes?", type y
and press Enter.
Understanding What Just Happened
SQLMesh has now:
- Created tables in DuckLake for each model
- Loaded the CSV data into raw.raw_events
- Created the staging view
- Calculated the daily revenue summary
- Processed the incremental user activity data
Let's verify everything worked:
# Create verify_pipeline.py
import duckdb
# Connect to our DuckLake database
conn = duckdb.connect('tutorial.duckdb')
conn.execute("ATTACH 'ducklake:dev_metadata.ducklake' AS my_ducklake;")
conn.execute("USE my_ducklake;")
# Check our tables
print("=== Tables in DuckLake ===")
tables = conn.execute("""
SELECT table_schema, table_name
FROM information_schema.tables
WHERE table_catalog = 'my_ducklake'
AND table_schema IN ('raw', 'staging', 'analytics_dev')
ORDER BY table_schema, table_name;
""").fetchall()
for schema, table in tables:
count = conn.execute(f"SELECT COUNT(*) FROM {schema}.{table}").fetchone()[0]
print(f"{schema}.{table}: {count} rows")
# Check daily revenue
print("\n=== Daily Revenue Summary ===")
revenue = conn.execute("""
SELECT * FROM tutorial.analytics.daily_revenue_summary
ORDER BY event_date;
""").fetchall()
for row in revenue:
date, users, purchasers, events, purchases, revenue, rate = row
print(f"{date}: {users} users, ${revenue:.2f} revenue, {rate:.1f}% conversion")
conn.close()
Run this verification script:
python verify_pipeline.py
Part 8: Demonstrating Incremental Processing
This is where the magic happens. Let's add new data and see how SQLMesh intelligently processes only what's new.
Adding New Data
Append this data to your seeds/raw_events.csv
file:
11,105,page_view,2024-01-03 10:00:00,201,0
12,105,add_to_cart,2024-01-03 10:05:00,201,0
13,105,purchase,2024-01-03 10:10:00,201,29.99
14,106,page_view,2024-01-03 11:00:00,205,0
15,106,purchase,2024-01-03 11:30:00,205,99.99
Rerun the Pipeline
Now let's process the new data:
# Plan the changes
sqlmesh plan dev
# You'll see that SQLMesh detects changes to the seed data
# Apply the changes
Part 9: Testing and Quality Assurance
Let's add tests to ensure our pipeline produces accurate results.
Creating Tests
Create tests/test_full_model.sql
:
test_example_full_model:
model: sqlmesh_example.full_model
inputs:
sqlmesh_example.incremental_model:
rows:
- id: 1
item_id: 1
- id: 2
item_id: 1
- id: 3
item_id: 2
outputs:
query:
rows:
- item_id: 1
num_orders: 2
- item_id: 2
num_orders: 1
Run the tests:
sqlmesh test
If the test passes (no rows returned), your tables are accurate!
Part 10: Troubleshooting Common Issues
Let me address some common problems you might encounter and how to solve them.
Issue 1: "DuckLake extension not found"
If you see this error, your DuckDB version might be too old. Ensure you have DuckDB 0.10.3 or later:
pip install --upgrade "duckdb>=0.10.3"
Issue 2: "Models created in wrong database"
This happens if the USE my_ducklake;
command is missing from your configuration. Check your config.yaml
and ensure all four init_sql
commands are present.
Issue 3: "Test can't find tables"
Remember that in development, SQLMesh adds suffixes to schema names. Your analytics schema becomes analytics_dev
. Always use environment-specific names in tests.
Issue 4: "State database locked"
This can happen if you have multiple SQLMesh processes running. Ensure you've closed all SQLMesh commands before running new ones.
Part 11: Production Considerations
As you move toward production, consider these enhancements:
Using Cloud Storage
DuckLake can store data in S3 or other cloud storage:
ATTACH 'ducklake:metadata.ducklake' AS my_ducklake
(DATA_PATH 's3://my-bucket/ducklake-data/');
Monitoring and Alerting
Add monitoring to track pipeline health:
-- Create a monitoring model
MODEL (
name analytics.pipeline_health,
kind FULL
);
SELECT
CURRENT_TIMESTAMP as check_time,
(SELECT COUNT(*) FROM analytics_dev.daily_revenue_summary) as revenue_rows,
(SELECT MAX(event_date) FROM analytics_dev.daily_revenue_summary) as latest_date,
DATEDIFF('day',
(SELECT MAX(event_date) FROM analytics_dev.daily_revenue_summary),
CURRENT_DATE
) as days_behind
Conclusion: Your Journey Forward
Congratulations! You've successfully built a modern data lakehouse using DuckLake and SQLMesh. You've learned:
- How to configure SQLMesh to work around DuckLake's UPDATE limitations
- How to build different types of models (SEED, VIEW, FULL, INCREMENTAL)
- How incremental processing dramatically improves efficiency
- How to test and validate your data transformations
- How to troubleshoot common issues
This foundation enables you to build sophisticated data pipelines that scale efficiently. As your data grows from gigabytes to terabytes, the incremental processing patterns you've learned will save tremendous amounts of time and compute resources. As DuckLake evolves and becomes compatible with other table formats, like Apache Iceberg, its utility will continue to expand. The future is exciting!
{{banner-slack}}