Quality Control with dbt-yaml-guardrails

Quality Control with dbt-yaml-guardrails image

Fusion compatibility was what pushed my own dbt work into the YAML weeds: predictable structure, sane defaults, fewer surprises when the authoring surface shrinks. The same pressures are landing on vanilla dbt as well—newer Core releases are moving in a similar direction, and dbt Fusion (and the engine behind it) will eventually enforce accepted key lists themselves. Already in dbt Core 1.10+, you get deprecation warnings that nudge projects towards that world.

So yes, this story can start with Fusion—but dbt-yaml-guardrails is deliberately useful whether or not Fusion is on your roadmap. Treat it as a way to tighten property YAML generally, especially around meta: which keys exist, whether they’re required or forbidden, which string values count as valid, and the same kinds of knobs for config and tags depending on how you wire the hooks.

What I tried first

Across the ecosystem, dbt-checkpoint, dbt-autofix, and dbt-project-evaluator each overlap a little with “keep the project healthy.” Checkpoint and project-evaluator are powerful, but they assume a dbt run—manifests or the database—which is appropriate for their scope but heavier than I wanted for quick local and CI feedback on edited YAML alone. dbt-autofix is excellent for deprecation-driven refactors towards 1.10+ and Fusion-style authoring, yet it’s opinionated by design: great when you want the official migration path, less so when your team wants granular, per-team rules (especially on meta) spelt out as separate small checks.

Nothing quite matched “lightweight hooks, configurable allowlists and accepted values, no artifacts required,” so I built dbt-yaml-guardrails: a Python package exposing pre-commit hooks that operate on property YAML shape and conventions, Fusion-oriented defaults where they matter, but just as usable as a hygiene layer on classic Core-centric projects tightening up before the engine does.

Behaviour and scope are spelt out under specs/ (start with specs/README.md and specs/project-spec.md if you dive in). specs/scope.md makes the Fusion-first stance explicit—while still carving out “plain YAML linting” as in-bounds—and HOOKS.md lists each shipped hook (*-allowed-keys, *-allowed-meta-keys, *-meta-accepted-values, and the rest).

What it looks like in practice

Concretely, it’s just pre-commit wiring: another repo: block in .pre-commit-config.yaml, a rev tag so everyone runs the same release, then one id: per concern so failures point at something you can grep for. The listing below follows the README’s example usage and layers in --fix-legacy-yaml—see the note under the snippet for what that actually does. Treat the rev: as illustrative and bump it when you next pin dependencies.

Optional legacy YAML rewrites: the six *-allowed-keys hook families and the *-allowed-column-keys hooks for model/seed/snapshot expose --fix-legacy-yaml (default false). Turn it on and the hook performs deterministic editsthen validation—rather than validating alone. v1 renames declaration-site tests to data_tests where dbt expects it. v2 moves top-level meta and tags on each resource entry into config.meta / config.tags (with conflict checks if config already carries those keys). There is no separate “fixer” hook id: it’s the same model-allowed-keys / model-allowed-column-keys entry points, so you opt in per stanza. catalog-allowed-keys and dbt-project-allowed-keys do not offer this flag. Typical pre-commit caveat: true can mean your commit touches more lines than you expected—same choreography as hooks that autofix formatting.

repos:
  - repo: https://github.com/scrambldchannel/dbt-yaml-guardrails
    rev: v0.7.1
    hooks:
      # Check top-level keys only; config and column key validation delegated to hooks below
      - id: model-allowed-keys
        files: ^models/
        args:
          [
            "--required",
            "description",
            "--check-config",
            "false",
            "--check-columns",
            "false",
            "--fix-legacy-yaml",
            "true",
          ]

      # Require a description on every column entry
      - id: model-allowed-column-keys
        files: ^models/
        args: ["--required", "description", "--fix-legacy-yaml", "true"]

      # Check that model config keys are valid and forbid the use of schema and database
      - id: model-allowed-config-keys
        files: ^models/
        args: ["--forbidden", "schema,database"]

      # Check that if a model has a domain key defined under meta, it must be in a defined list
      - id: model-meta-accepted-values
        files: ^models/
        name: Accepted Domains
        alias: accepted-domains
        args:
          ["--key", "domain", "--values", "sales,hr,finance,all", "--optional"]

      # Check that a model has a required owner key under meta which matches a defined list
      - id: model-meta-accepted-values
        files: ^models/
        name: Accepted Owners
        alias: accepted-owners
        args: ["--key", "owner", "--values", "alex,annemarie,ryu,ken"]

In plain terms: model-allowed-keys only checks model-level shape (description, with nested config / columns checks off so the hooks below own those). With --fix-legacy-yaml, it applies rewrites first—testsdata_tests, top-level meta / tags into config—then validates. model-allowed-column-keys enforces a description on each column and runs the same fix pass. model-allowed-config-keys blocks schema / database inside config. accepted-domains and accepted-owners share a hook id but distinguish them with alias: and name:; --optional applies to domains, not owners.

Nothing here needs dbt run first; files: ^models/ is just narrowing which paths each hook bothers with. Tune the comma-separated --values lists to whatever your team actually agrees on—you could swap sales / hr for squad names, environments, sensitivity tiers; the pattern is “small, repeatable checks rather than one giant validator that nobody dares touch.”

Spec-driven development

I leaned hard on spec-driven development here: behavioural intent lives in specs/ first (or grows with the code, never as an afterthought), so discussions and reviews stay anchored to documents instead of scrambling to infer rules from scattered Python. The “constitution” in specs/project-spec.md captures goals, repo layout mirrored by hook family modules, expectations for README / HOOKS.md / CHANGELOG, and how tests should line up (testing-strategy.md covers fixtures and conventions).

That structure paid off whenever I iterated quickly: agreeing on yaml-handling.md and family-specific markdown under hook-families/ meant implementation and pytest cases could chase a stable target. It also made it easier to slot in automation—when agentic coding can produce a lot of surface area fast, having written intent in one place curbs the worst kind of drift.

What I’d revisit

The trade-off is real: the hook surface I ended up with is arguably bloated—many small CLIs and flags that mirror flexibility, but not the smallest possible UX. At some point I’ll probably consolidate or reshape the CLI so common patterns don’t repeat quite so visibly.

The other lesson pairs with that velocity. Models and agents make it tempting to ship feature after feature because the scaffolding comes cheap. Fast coding still deserves a deliberate product shape; otherwise “just one more hook” piles up alongside an API you’ll need to tame later. dbt-yaml-guardrails is intentionally pre-commit-first precisely to keep consumption simple even as internals stay flexibly specified.

If you’re squeezing meta into something your governance team can audit, prepping for tighter Core defaults, or heading towards Fusion, a small, YAML-only linter might deserve a slot next to heavier dbt-backed tools—they solve different slices of the same migration.