The open skill registry and execution toolkit that gives AI agents the operating procedures to define, plan, implement, validate, replay, and ship reliable data products with the same discipline used by strong data engineering teams.
Not generic prompts. Operating procedures for building reliable data products.
Covering ingestion, transformation, orchestration, streaming, lakehouse, warehousing, governance, quality, legacy modernization, release management, incident recovery, and platform operating models. Each skill is a structured SKILL.md with progressive disclosure for agents.
Vendor-neutral skills adapt to AWS, Azure, GCP, Databricks, Snowflake, Alibaba Cloud, and Apache stacks through platform-specific presets.
Source contracts, dataset contracts, metric contracts, and compliance controls defined before implementation. No guessing grain, freshness, or behavior.
Idempotent, replayable, backfill-safe patterns are the default. Backfill guards, schema change guards, and release gates built in.
Session start detection, contract validation, schema safety checks, cost analysis, backfill guards, pipeline review gates, and release readiness checks. The repository behaves like a workflow system, not a static library. Run hooks before risky operations to catch problems early.
13 end-to-end projects with specs, plans, tasks, smoke tests, rollback demos, and proof paths.
Machine-readable YAML templates for contracts, backfill plans, schema changes, release gates, and incident runbooks.
Opinionated bundles for AWS Lakehouse, Databricks Medallion, Streaming, CI/CD, Privacy, ESG, and more.
A spec-first lifecycle that maps to the entire data product delivery journey.
First-class plugin support for the editors your team already uses.
A full command-palette installer for the VS Code family. Install the complete toolkit, core pack, agent adapters, starter packs, MCP templates, or runnable examples with a single command.
Native plugin for IntelliJ-platform IDEs. Install the data engineering skill toolkit directly into your JetBrains workflow with tool-window integration and project-level configuration.
Native adapters and install surfaces for all major AI coding assistants and IDEs.
.vsix extension
.cursor/rules/
CLAUDE.md + plugin
copilot-instructions
.gemini/commands/
.kiro/steering/
.windsurfrules
.opencode/
AGENTS.md
Plugin .zip
Vendor-neutral skills adapt to your stack through platform-specific presets.
S3, Glue, Athena, EMR, Lake Formation
ADLS, Synapse, Purview, Data Factory
BigQuery, Dataflow, Pub/Sub, Dataplex
Delta Lake, Unity Catalog, Auto Loader
Streams, Tasks, Dynamic Tables
MaxCompute, DataWorks, OSS
Batch & streaming processing
Stream processing engine
Workflow orchestration
Event streaming platform
Open table format
Enterprise data integration
Data integration & quality
Hybrid & cross-cloud patterns
Pre-flight checks that make the repository behave like a workflow system.
Detects repo type, recommends preset, starter pack, and next command
Validates contract files contain minimum required fields
Checks for tests, contracts, rollback notes, lineage, quality gates
Switches session into incident-response posture
Safety questions and validation gates before replay or cutover
Detects risky schema evolution: breaking renames, drops, destructive refreshes
Flags cost hotspots in SQL, dbt, Spark, and warehouse projects
Validates reconciliation, publish control, observability, and rollback evidence
Complete project packs with specs, plans, tasks, smoke tests, and proof paths.
Full lakehouse adoption with zone architecture, normalization jobs, reconciliation checks, and rollback demonstrations.
Bronze-to-silver transformation with Auto Loader, contract validation, and rollback-aware proof paths.
Analytics modeling with DuckDB local execution, semantic layer governance, and quality gates.
Stream processing with replay safety, schema-aware pipelines, and incident recovery patterns.
Managed GCP stream-to-analytics path with observability and SLA management.
Warehouse-to-operational serving with publishing contracts and data sharing patterns.
GDPR-compliant deletion propagation with audit evidence and lineage tracking.
Platform migration with dual-run reconciliation, consumer cutover, and rollback paths.
Staged platform release flow with shadow validation, canary checks, and gated rollout.
Governed sustainability and regulatory reporting with evidence collection and regional controls.
Release readiness through proof: validation, testcase inventory, and security controls.
Feature consistency between training and serving with quality checks and observability.
From ingestion to governance, from streaming to compliance, from legacy to cloud-native.
API/SaaS ingestion, CDC, Debezium, file/SFTP feeds, partner sources, source reliability, and extraction resilience.
Kafka, Flink, Beam, Pub/Sub, Kinesis, schema registries, replay-safe streaming, and ClickHouse real-time analytics.
Delta Lake, Iceberg, Hudi, zone architecture, medallion patterns, dbt, semantic layers, and cost optimization.
PII, PCI, HIPAA, GDPR, regional sovereignty, ESG reporting, audit evidence, and platform-native governance.
Observability, SLA management, incident triage, disaster recovery, resiliency testing, and failure injection.
ETL/ELT strategy, mainframe offload, enterprise ETL modernization for Informatica, Talend, DataStage, and SSIS.
Python pipeline packaging, Scala JVM data jobs, and Java integration services with idiomatic patterns.
Glue Data Catalog, Lake Formation, Unity Catalog, Microsoft Purview, Dataplex, OpenMetadata, and DataHub.
CI/CD release management, Terraform infrastructure, platform operating models, golden paths, and service ownership.
Give your AI agents the discipline of a strong data engineering team. Open source, MIT licensed, production-ready.