Introduction

Imagine you're building a house. You could buy a pre-fabricated home from a single vendor, but what if you want the freedom to choose your own materials, design your own floor plan, and expand whenever you need? That's the philosophy behind combining DuckLake and SQLMesh - you're building a data infrastructure with best-in-class open-source components that give you control.

DuckLake brings lakehouse capabilities to DuckDB, providing ACID transactions and time travel on your data lake. SQLMesh adds transformation management with incremental processing. Together, they create a powerful, cost-effective lakehouse solution. 

Check out the DuckDB SQLMesh documentation for additional details.

Part 1: Understanding the Architecture

Think of our architecture as having three distinct layers:

The Storage Layer (DuckLake): This is where your actual data lives. DuckLake stores data as Parquet files while maintaining a SQL-accessible catalog with ACID guarantees. It's like having a filing system that's both highly organized and tamper-proof.

The Transformation Layer (SQLMesh): This orchestrates how data moves and transforms through your pipeline. SQLMesh tracks what's been processed, manages dependencies, and ensures data quality. Think of it as your project manager, keeping track of what needs to be done and in what order.

The State Layer (Local DuckDB): This is SQLMesh's notebook where it keeps track of its work. Since DuckLake can't be updated (only appended to), we need a separate place for SQLMesh to maintain its records.

Part 2: Prerequisites and Environment Setup

System Requirements

First, ensure you have Python 3.8 or higher installed. You can check this by running:

python --version

If you need to install Python, visit python.org and follow the installation instructions for your operating system.

Creating Your Project Structure

Let's create a well-organized project structure. Each directory has a specific purpose:

# Create the main project directory

mkdir ducklake-sqlmesh-tutorial

cd ducklake-sqlmesh-tutorial

# Create subdirectories for different components

# On macOS/Linux:

mkdir -p models seeds tests audits macros scripts data/ducklake

# On Windows PowerShell (alternative):

New-Item -ItemType Directory -Force -Path models, seeds, tests, audits, macros, scripts, data/ducklake

Here's what each directory will contain:

  • models/: Your SQLMesh model definitions (transformations)
  • seeds/: Static CSV files for initial data
  • tests/: Data quality tests
  • audits/: Data validation rules
  • macros/: Reusable code snippets
  • scripts/: Setup and utility scripts
  • data/ducklake/: Where DuckLake will store Parquet files

Setting Up a Python Virtual Environment

Virtual environments keep your project dependencies isolated. Think of it as creating a clean workspace for this specific project:

# Create a virtual environment

python -m venv venv

# Activate it

# On macOS/Linux:

source venv/bin/activate

# On Windows:

venv\Scripts\activate

# You should see (venv) at the beginning of your command prompt

Installing Required Packages

Now let's install SQLMesh and DuckDB. I'll explain what each package does:

# Install SQLMesh with DuckDB support

pip install "sqlmesh[lsp,duckdb]"

# Install DuckDB (this includes DuckLake as an extension)

pip install "duckdb>=0.10.3"

SQLMesh is our transformation framework - it will manage how data flows through our pipeline. DuckDB is our compute engine, and it includes DuckLake as an extension that adds lakehouse features.

Part 3: Configuring the Integration

This is the most critical part of our setup. We need to configure SQLMesh to work around DuckLake's limitation with UPDATE operations. Let me explain each configuration option in detail.

Creating the Configuration File

Create a file named config.yaml in your project root:

# config.yaml - Minimal configuration for SQLMesh with DuckDB

gateways:

  dev:

    connection:

      type: duckdb

      database: tutorial.duckdb

    state_connection:

      type: duckdb

      database: dev_state.db

  

  prod:

    connection:

      type: duckdb

      database: tutorial_prod.duckdb

    state_connection:

      type: duckdb

      database: prod_state.db

default_gateway: dev

model_defaults:

  dialect: duckdb

  start: '2024-01-01'

Understanding the Configuration

Let me break down why each part of this configuration is essential:

The init_sql Commands: These four commands are the secret sauce that makes everything work:

  1. INSTALL ducklake; - Downloads and installs the DuckLake extension if it's not already present
  2. LOAD ducklake; - Loads the extension into memory (required for DuckDB 0.10.3+)
  3. ATTACH... - This connects to a DuckLake database, creating it if it doesn't exist. The DATA_PATH parameter tells DuckLake where to store its Parquet files
  4. USE my_ducklake; - This is crucial! It makes DuckLake the active database, so all our tables are created there, not in the default DuckDB database

The state_connection: This is our workaround for DuckLake's UPDATE limitation. SQLMesh needs to update tables to track what it's processed, but DuckLake only supports INSERT operations. By using a separate local DuckDB file for state, we give SQLMesh a place to keep its records while our actual data lives in DuckLake.

Part 4: Initializing DuckLake

Now let's set up DuckLake properly. We'll create a script that initializes our lakehouse with the proper structure.

Creating the Setup Script

Create a file scripts/setup_ducklake.py:

# setup_ducklake.py - Initialize our DuckLake lakehouse

# setup_ducklake.py - Initialize our DuckLake lakehouse

import duckdb

import sys

def setup_ducklake():

    """Initialize DuckLake with proper database structure"""

    

    try:

        # Connect to DuckDB

        print("Connecting to DuckDB...")

        conn = duckdb.connect('tutorial.duckdb')

        

        # Install and load DuckLake

        print("Installing DuckLake extension...")

        conn.execute("INSTALL ducklake;")

        conn.execute("LOAD ducklake;")

        

        # Attach DuckLake with explicit data path

        print("Creating DuckLake database...")

        conn.execute("""

            ATTACH 'ducklake:dev_metadata.ducklake' AS my_ducklake 

            (DATA_PATH 'data/ducklake');

        """)

        

        # Switch to DuckLake as active database

        conn.execute("USE my_ducklake;")

        

        # Create our schema structure

        print("Creating schemas...")

        schemas = ['raw', 'staging', 'analytics']

        for schema in schemas:

            conn.execute(f"CREATE SCHEMA IF NOT EXISTS {schema};")

            print(f"  ✓ Created schema: {schema}")

        

        # Verify everything is working

        print("\nVerifying DuckLake setup...")

        result = conn.execute("SELECT current_database();").fetchone()

        print(f"  Current database: {result[0]}")

        

        # Show the schemas we created

        schemas = conn.execute("""

            SELECT schema_name 

            FROM information_schema.schemata 

            WHERE catalog_name = 'my_ducklake'

            ORDER BY schema_name;

        """).fetchall()

        

        print("  Available schemas:")

        for schema in schemas:

            print(f"    - {schema[0]}")

        

        print("\n✅ DuckLake setup complete!")

        

    except Exception as e:

        print(f"\n❌ Error during setup: {e}", file=sys.stderr)

        sys.exit(1)

    finally:

        if 'conn' in locals():

            conn.close()

if __name__ == "__main__":

    setup_ducklake()

Run the setup script:

python scripts/setup_ducklake.py

You should see output confirming that DuckLake is set up with three schemas: raw, staging, and analytics. These represent the different layers of our data pipeline.

Part 5: Creating Sample Data

Let's create some realistic e-commerce data to work with. This will help us see how our pipeline processes information.

Creating the Seed Data

First, create a CSV file with sample data. Create seeds/raw_events.csv:

event_id,user_id,event_type,event_timestamp,product_id,revenue

1,101,page_view,2024-01-01 10:00:00,201,0

2,101,add_to_cart,2024-01-01 10:05:00,201,0

3,101,purchase,2024-01-01 10:10:00,201,29.99

4,102,page_view,2024-01-01 11:00:00,202,0

5,102,page_view,2024-01-01 11:02:00,203,0

6,103,page_view,2024-01-02 09:00:00,201,0

7,103,add_to_cart,2024-01-02 09:03:00,201,0

8,103,purchase,2024-01-02 09:08:00,201,29.99

9,104,page_view,2024-01-02 14:00:00,204,0

10,104,purchase,2024-01-02 14:15:00,204,49.99

This data represents user interactions with an e-commerce site. Notice how we have events spanning multiple days - this will be important when we demonstrate incremental processing.

Creating the Seed Model

Now we need to tell SQLMesh how to load this data. Create models/raw_events_seed.sql:

-- raw_events_seed.sql - Load initial event data from CSV

MODEL (

    name raw.raw_events,  -- Table will be created in the raw schema

    kind SEED (

        path '../seeds/raw_events.csv'  -- Path to our CSV file

    ),

    columns (

        event_id INT,

        user_id INT,

        event_type TEXT,

        event_timestamp TIMESTAMP,

        product_id INT,

        revenue DECIMAL(10,2)

    ),

    grain event_id  -- event_id is our unique identifier

);

The SEED model type tells SQLMesh to load data from a file rather than transform existing data. This is perfect for initial data loads or reference tables.

Part 6: Building Data Transformations

Now comes the exciting part - creating transformations that turn raw data into business insights. We'll build three types of models to demonstrate different processing strategies.

Staging Model - Basic Cleaning

Create models/stg_events.sql:

-- stg_events.sql - Clean and standardize raw event data

MODEL (

    name staging.stg_events,

    kind VIEW,  -- Views are rebuilt every time they're queried

    description 'Cleaned and standardized event data'

);

SELECT 

    event_id,

    user_id,

    event_type,

    event_timestamp,

    -- Convert timestamp to UTC for consistency

    event_timestamp AT TIME ZONE 'UTC' as event_timestamp_utc,

    -- Extract date for partitioning

    DATE(event_timestamp) as event_date,

    product_id,

    revenue,

    -- Add processing metadata

    CURRENT_TIMESTAMP as processed_at

FROM raw.raw_events

WHERE event_id IS NOT NULL  -- Basic data quality filter

This staging model performs basic cleaning and standardization.

Full Refresh Model - Complete Recalculation

Create models/daily_revenue_summary.sql:

-- daily_revenue_summary.sql - Daily revenue aggregation (full refresh)

MODEL (

    name analytics.daily_revenue_summary,

    kind FULL,  -- Completely rebuilds the table each run

    cron '@daily',  -- Intended to run daily

    grain event_date  -- One row per date

);

SELECT 

    event_date,

    COUNT(DISTINCT user_id) as unique_users,

    COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) as purchasers,

    COUNT(*) as total_events,

    COUNT(CASE WHEN event_type = 'purchase' THEN 1 END) as purchases,

    COALESCE(SUM(CASE WHEN event_type = 'purchase' THEN revenue END), 0) as total_revenue,

    -- Calculate conversion rate

    CASE 

        WHEN COUNT(DISTINCT user_id) > 0 

        THEN COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) * 100.0 / COUNT(DISTINCT user_id)

        ELSE 0 

    END as conversion_rate

FROM staging.stg_events

GROUP BY event_date

ORDER BY event_date;

This FULL model recalculates everything from scratch each time it runs. It's like taking a fresh inventory count - simple and reliable, but not efficient for large datasets.

Incremental Model - Smart Processing

Now for the star of the show - an incremental model that only processes new data:

Create models/user_activity_incremental.sql:

-- user_activity_incremental.sql - Track user activity incrementally

MODEL (

    name analytics.user_activity_incremental,

    kind INCREMENTAL_BY_TIME_RANGE (

        time_column event_date  -- Changed to use date instead of timestamp

    ),

    start '2024-01-01',  -- Start processing from this date

    cron '@daily',  -- Process new data daily

    grain (user_id, event_date),  -- Unique key combination

    audits (

        NOT_NULL(columns = (user_id, event_date, events_count)),

        FORALL(criteria = (events_count > 0))  -- Ensure positive counts

    )

);

WITH user_daily_activity AS (

    SELECT 

        user_id,

        DATE(event_timestamp_utc) as event_date,

        COUNT(*) as events_count,

        COUNT(DISTINCT event_type) as unique_event_types,

        SUM(CASE WHEN event_type = 'purchase' THEN revenue ELSE 0 END) as daily_revenue,

        MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as made_purchase,

        MIN(event_timestamp_utc) as first_event_time,

        MAX(event_timestamp_utc) as last_event_time

    FROM staging.stg_events

    WHERE 

        -- SQLMesh will replace @start_ds and @end_ds with date boundaries

        DATE(event_timestamp_utc) >= @start_ds 

        AND DATE(event_timestamp_utc) < @end_ds

    GROUP BY user_id, DATE(event_timestamp_utc)

)

SELECT 

    user_id,

    event_date,  -- This must be included since it's our time_column

    events_count,

    unique_event_types,

    daily_revenue,

    made_purchase,

    first_event_time,

    last_event_time,

    -- Calculate session duration for the day

    EXTRACT(EPOCH FROM (last_event_time - first_event_time)) / 60 as session_duration_minutes,

    -- Calculate cumulative metrics

    SUM(daily_revenue) OVER (

        PARTITION BY user_id 

        ORDER BY event_date 

        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

    ) as cumulative_revenue,

    ROW_NUMBER() OVER (

        PARTITION BY user_id 

        ORDER BY event_date

    ) as days_active

FROM user_daily_activity;

This incremental model is where SQLMesh truly shines. The @start_ds and @end_ds variables are dynamically replaced with time boundaries, so the model only processes new time periods. It's like having a smart assistant who knows exactly what work has already been done and only processes what's new.

Part 7: Running Your Pipeline

Now let's bring our pipeline to life! SQLMesh uses a two-step process: planning and running.

Initialize SQLMesh

First, initialize SQLMesh in your project with DuckDB as the engine:

sqlmesh init duckdb

This command does several important things:

  • Creates the necessary SQLMesh directories if they don't exist
  • Sets up DuckDB as your default SQL dialect
  • Prepares the project structure for DuckDB-specific features

Why specify 'duckdb'? SQLMesh supports multiple SQL engines (Snowflake, BigQuery, PostgreSQL, etc.), each with different SQL syntax and features. By specifying duckdb, SQLMesh:

  • Sets the correct SQL dialect for parsing your models
  • Configures DuckDB-specific optimizations
  • Ensures proper type mappings for DuckDB data types
  • Validates your SQL against DuckDB's syntax rules

If you forget to specify the engine, SQLMesh might default to a generic SQL dialect that could cause syntax errors or miss DuckDB-specific optimizations.

Plan Your Changes

The plan command is like a preview - it shows you what SQLMesh intends to do without actually doing it:

sqlmesh plan dev

You'll see output showing:

  • New models detected (our seed, staging, and analytics models)
  • What tables will be created
  • What data will be processed

When prompted "Apply - Backfill and Promote All Changes?", type y and press Enter.

Understanding What Just Happened

SQLMesh has now:

  1. Created tables in DuckLake for each model
  2. Loaded the CSV data into raw.raw_events
  3. Created the staging view
  4. Calculated the daily revenue summary
  5. Processed the incremental user activity data

Let's verify everything worked:

# Create verify_pipeline.py

import duckdb

# Connect to our DuckLake database

conn = duckdb.connect('tutorial.duckdb')

conn.execute("ATTACH 'ducklake:dev_metadata.ducklake' AS my_ducklake;")

conn.execute("USE my_ducklake;")

# Check our tables

print("=== Tables in DuckLake ===")

tables = conn.execute("""

    SELECT table_schema, table_name 

    FROM information_schema.tables 

    WHERE table_catalog = 'my_ducklake'

    AND table_schema IN ('raw', 'staging', 'analytics_dev')

    ORDER BY table_schema, table_name;

""").fetchall()

for schema, table in tables:

    count = conn.execute(f"SELECT COUNT(*) FROM {schema}.{table}").fetchone()[0]

    print(f"{schema}.{table}: {count} rows")

# Check daily revenue

print("\n=== Daily Revenue Summary ===")

revenue = conn.execute("""

    SELECT * FROM tutorial.analytics.daily_revenue_summary

    ORDER BY event_date;

""").fetchall()

for row in revenue:

    date, users, purchasers, events, purchases, revenue, rate = row

    print(f"{date}: {users} users, ${revenue:.2f} revenue, {rate:.1f}% conversion")

conn.close()

Run this verification script:

python verify_pipeline.py

Part 8: Demonstrating Incremental Processing

This is where the magic happens. Let's add new data and see how SQLMesh intelligently processes only what's new.

Adding New Data

Append this data to your seeds/raw_events.csv file:

11,105,page_view,2024-01-03 10:00:00,201,0

12,105,add_to_cart,2024-01-03 10:05:00,201,0

13,105,purchase,2024-01-03 10:10:00,201,29.99

14,106,page_view,2024-01-03 11:00:00,205,0

15,106,purchase,2024-01-03 11:30:00,205,99.99

Rerun the Pipeline

Now let's process the new data:

# Plan the changes

sqlmesh plan dev

# You'll see that SQLMesh detects changes to the seed data

# Apply the changes

Part 9: Testing and Quality Assurance

Let's add tests to ensure our pipeline produces accurate results.

Creating Tests

Create tests/test_full_model.sql:

test_example_full_model:

  model: sqlmesh_example.full_model

  inputs:

    sqlmesh_example.incremental_model:

      rows:

      - id: 1

        item_id: 1

      - id: 2

        item_id: 1

      - id: 3

        item_id: 2

  outputs:

    query:

      rows:

      - item_id: 1

        num_orders: 2

      - item_id: 2

        num_orders: 1

Run the tests:

sqlmesh test

If the test passes (no rows returned), your tables are accurate!

Part 10: Troubleshooting Common Issues

Let me address some common problems you might encounter and how to solve them.

Issue 1: "DuckLake extension not found"

If you see this error, your DuckDB version might be too old. Ensure you have DuckDB 0.10.3 or later:

pip install --upgrade "duckdb>=0.10.3"

Issue 2: "Models created in wrong database"

This happens if the USE my_ducklake; command is missing from your configuration. Check your config.yaml and ensure all four init_sql commands are present.

Issue 3: "Test can't find tables"

Remember that in development, SQLMesh adds suffixes to schema names. Your analytics schema becomes analytics_dev. Always use environment-specific names in tests.

Issue 4: "State database locked"

This can happen if you have multiple SQLMesh processes running. Ensure you've closed all SQLMesh commands before running new ones.

Part 11: Production Considerations

As you move toward production, consider these enhancements:

Using Cloud Storage

DuckLake can store data in S3 or other cloud storage:

ATTACH 'ducklake:metadata.ducklake' AS my_ducklake 

(DATA_PATH 's3://my-bucket/ducklake-data/');

Monitoring and Alerting

Add monitoring to track pipeline health:

-- Create a monitoring model

MODEL (

    name analytics.pipeline_health,

    kind FULL

);

SELECT 

    CURRENT_TIMESTAMP as check_time,

    (SELECT COUNT(*) FROM analytics_dev.daily_revenue_summary) as revenue_rows,

    (SELECT MAX(event_date) FROM analytics_dev.daily_revenue_summary) as latest_date,

    DATEDIFF('day', 

        (SELECT MAX(event_date) FROM analytics_dev.daily_revenue_summary), 

        CURRENT_DATE

    ) as days_behind

Conclusion: Your Journey Forward

Congratulations! You've successfully built a modern data lakehouse using DuckLake and SQLMesh. You've learned:

  1. How to configure SQLMesh to work around DuckLake's UPDATE limitations
  2. How to build different types of models (SEED, VIEW, FULL, INCREMENTAL)
  3. How incremental processing dramatically improves efficiency
  4. How to test and validate your data transformations
  5. How to troubleshoot common issues

This foundation enables you to build sophisticated data pipelines that scale efficiently. As your data grows from gigabytes to terabytes, the incremental processing patterns you've learned will save tremendous amounts of time and compute resources. As DuckLake evolves and becomes compatible with other table formats, like Apache Iceberg, its utility will continue to expand. The future is exciting!

{{banner-slack}}