Skip to main content

Canonical Layer

The canonical layer creates a single source of truth for your business entities. It harmonizes data from multiple sources into unified, reliable datasets that represent core business concepts.

Example Folder Structure

models
└── canonical
├── orders
│ ├── orders.sql
│ └── orders.yml
├── customers
│ ├── customers.sql
│ └── customers.yml
└── products
├── products.sql
└── products.yml

Structure Best Practices

  • ✅ Entity-Based Structure ~ Group by business entity (orders, customers, etc.)
  • ✅ File Naming ~ Use clear, simple entity names (customers, products, etc.)
  • ✅ Single Source of Truth ~ One canonical model per business entity

Purpose

Canonical models serve to:

  • Unify data from multiple sources
  • Resolve conflicts between sources
  • Establish standard definitions
  • Create consistent business entities

Example Scenarios

ScenarioPurposeExample
Multiple SourcesCombine order data from web and retail systemsOrders from Shopify and POS
Conflicting DataResolve different customer details across systemsCustomer details from CRM and ecommerce
Data StandardsEstablish consistent product categorizationUnified product hierarchy

Example Model

-- models/canonical/customers/customers.sql
WITH crm_customers AS (
SELECT * FROM {{ ref('stg_salesforce__customers') }}
),

web_customers AS (
SELECT * FROM {{ ref('stg_shopify__customers') }}
),

unified AS (
SELECT
COALESCE(crm.customer_id, web.customer_id) as customer_id,
-- Prefer CRM data over web data
COALESCE(crm.email, web.email) as email,
COALESCE(crm.name, web.name) as customer_name,
crm.customer_segment,
web.first_order_date,
-- Additional fields and logic
FROM crm_customers crm
FULL OUTER JOIN web_customers web
ON crm.email = web.email
)

SELECT * FROM unified

Materialization

Canonical models should be materialized as tables for optimal performance:

# dbt_project.yml
models:
your_project:
canonical:
+materialized: table
Why Tables?

Frequently accessed by downstream models Complex transformations best stored as tables Serve as stable reference points Better query performance

Documentation and Testing

Thorough documentation and testing are crucial for canonical models:

# models/canonical/customers/customers.yml
version: 2

models:
- name: customers
description: Single source of truth for customer data
columns:
- name: customer_id
description: Unique identifier for customers
tests:
- unique
- not_null
- name: email
description: Primary customer email
tests:
- not_null
- unique

tests:
- dbt_utils.equal_rowcount:
compare_model: ref('stg_salesforce__customers')

Best Practices

Data Quality

  • Implement comprehensive testing
  • Monitor data freshness
  • Track source system changes
  • Document business rules

Performance

  • Index key columns
  • Partition large tables
  • Optimize join conditions
  • Schedule appropriate refresh times

Governance

  • Document data lineage
  • Define clear ownership
  • Maintain version control
  • Monitor usage patterns
Common Pitfalls
  • Creating canonical models without clear business need
  • Insufficient testing of source data integration
  • Poor documentation of business rules
  • Lack of stakeholder alignment
Success Criteria

A good canonical model should:

  1. Have clear, unique business entity definition
  2. Resolve conflicts consistently
  3. Be well-documented and tested
  4. Serve as a reliable reference
  5. Maintain data quality standards