Open Source • MIT Licensed • Production-Grade

Data Engineering
Agent Skills

The open skill registry and execution toolkit that gives AI agents the operating procedures to define, plan, implement, validate, replay, and ship reliable data products with the same discipline used by strong data engineering teams.

Get Started Download Latest Documentation

70+

Skills

Presets

Examples

Starter Packs

MCP Configs

10+

Agent Surfaces

Core Capabilities

Everything You Need for
Disciplined Data Delivery

Not generic prompts. Operating procedures for building reliable data products.

🎯

70+ Production-Grade Workflow Skills

Covering ingestion, transformation, orchestration, streaming, lakehouse, warehousing, governance, quality, legacy modernization, release management, incident recovery, and platform operating models. Each skill is a structured SKILL.md with progressive disclosure for agents.

☁️

14 Platform Presets

Vendor-neutral skills adapt to AWS, Azure, GCP, Databricks, Snowflake, Alibaba Cloud, and Apache stacks through platform-specific presets.

📜

Contract-First Design

Source contracts, dataset contracts, metric contracts, and compliance controls defined before implementation. No guessing grain, freshness, or behavior.

🔄

Replay-Safe Execution

Idempotent, replayable, backfill-safe patterns are the default. Backfill guards, schema change guards, and release gates built in.

🛡

Operational Hooks & Guardrails

Session start detection, contract validation, schema safety checks, cost analysis, backfill guards, pipeline review gates, and release readiness checks. The repository behaves like a workflow system, not a static library. Run hooks before risky operations to catch problems early.

📊

Runnable Examples

13 end-to-end projects with specs, plans, tasks, smoke tests, rollback demos, and proof paths.

📚

Templates & Evidence

Machine-readable YAML templates for contracts, backfill plans, schema changes, release gates, and incident runbooks.

🚀

Starter Packs

Opinionated bundles for AWS Lakehouse, Databricks Medallion, Streaming, CI/CD, Privacy, ESG, and more.

Delivery Lifecycle

Seven Commands. One Discipline.

A spec-first lifecycle that maps to the entire data product delivery journey.

📜

/spec

Define

📋

/plan

Break down

🔧

/build

Implement

✅

/validate

Prove

🔍

/review

Inspect

🔄

/backfill

Replay

🚀

/ship

Release

# Quick install
git clone https://github.com/vaquarkhan/data-engineering-agent-skills.git
cd data-engineering-agent-skills
# Install into your project (all agents)
./bootstrap.sh /path/to/project auto
# Or pick specific agents
scripts/install.sh --tool cursor,claude,kiro --target ./my-project
# Run benchmarks
python benchmarks/score_benchmarks.py

IDE Plugins

Install in VS Code & JetBrains

First-class plugin support for the editors your team already uses.

VS Code Extension

A full command-palette installer for the VS Code family. Install the complete toolkit, core pack, agent adapters, starter packs, MCP templates, or runnable examples with a single command.

Works with VS Code, Cursor, Windsurf, VSCodium
Command palette installers for all asset types
Download .vsix from GitHub Releases
Marketplace publish workflow included

Download .vsix

// VS Code Command Palette
> DE Skills: Install Full Toolkit
> DE Skills: Install Core Pack
> DE Skills: Install Agent Adapters
> DE Skills: Install Starter Pack
> DE Skills: Install MCP Templates
> DE Skills: Install Examples

JetBrains Plugin

Native plugin for IntelliJ-platform IDEs. Install the data engineering skill toolkit directly into your JetBrains workflow with tool-window integration and project-level configuration.

IntelliJ IDEA, PyCharm, DataGrip, WebStorm, GoLand
Download plugin .zip from GitHub Releases
Marketplace publish workflow ready
Install smoke tests in CI

Download Plugin .zip

// JetBrains Tool Window
Tools > Data Engineering Skills
  ▶ Install Full Toolkit
  ▶ Install Core Skills
  ▶ Install Platform Preset
  ▶ Install Starter Pack
  ▶ Run Session Hook

Multi-Agent Support

Works With Every AI Agent

Native adapters and install surfaces for all major AI coding assistants and IDEs.

VS Code

.vsix extension

Cursor

.cursor/rules/

Claude

CLAUDE.md + plugin

Copilot

copilot-instructions

Gemini

.gemini/commands/

Kiro

.kiro/steering/

Windsurf

.windsurfrules

OpenCode

.opencode/

Codex

AGENTS.md

JetBrains

Plugin .zip

Platform Coverage

Every Major Data Platform

Vendor-neutral skills adapt to your stack through platform-specific presets.

AWS

S3, Glue, Athena, EMR, Lake Formation

Azure

ADLS, Synapse, Purview, Data Factory

GCP

BigQuery, Dataflow, Pub/Sub, Dataplex

Databricks

Delta Lake, Unity Catalog, Auto Loader

Snowflake

Streams, Tasks, Dynamic Tables

Alibaba Cloud

MaxCompute, DataWorks, OSS

Apache Spark

Batch & streaming processing

Apache Flink

Stream processing engine

Apache Airflow

Workflow orchestration

Apache Kafka

Event streaming platform

Apache Iceberg

Open table format

Informatica

Enterprise data integration

Talend

Data integration & quality

Multi-Cloud

Hybrid & cross-cloud patterns

Automation Layer

Operational Hooks & Guardrails

Pre-flight checks that make the repository behave like a workflow system.

📌

session-start

Detects repo type, recommends preset, starter pack, and next command

📜

contract-check-pre

Validates contract files contain minimum required fields

🛡

pipeline-review-pre

Checks for tests, contracts, rollback notes, lineage, quality gates

🚨

incident-mode

Switches session into incident-response posture

🔄

backfill-guard

Safety questions and validation gates before replay or cutover

⚠️

schema-change-guard

Detects risky schema evolution: breaking renames, drops, destructive refreshes

💰

cost-check

Flags cost hotspots in SQL, dbt, Spark, and warehouse projects

🚀

release-guard

Validates reconciliation, publish control, observability, and rollback evidence

End-to-End Projects

13 Real-World Examples

Complete project packs with specs, plans, tasks, smoke tests, and proof paths.

Lakehouse

AWS S3 + Glue + Athena + Iceberg

Full lakehouse adoption with zone architecture, normalization jobs, reconciliation checks, and rollback demonstrations.

Medallion

Databricks Delta Medallion

Bronze-to-silver transformation with Auto Loader, contract validation, and rollback-aware proof paths.

Analytics

dbt Warehouse Marts

Analytics modeling with DuckDB local execution, semantic layer governance, and quality gates.

Streaming

Kafka + Flink Streaming

Stream processing with replay safety, schema-aware pipelines, and incident recovery patterns.

GCP

GCP Pub/Sub + Dataflow + BigQuery

Managed GCP stream-to-analytics path with observability and SLA management.

Snowflake

Snowflake + dbt + Reverse ETL

Warehouse-to-operational serving with publishing contracts and data sharing patterns.

Governance

Privacy Retention & Deletion

GDPR-compliant deletion propagation with audit evidence and lineage tracking.

Migration

Multi-Cloud Warehouse Cutover

Platform migration with dual-run reconciliation, consumer cutover, and rollback paths.

Release

CI/CD Progressive Release

Staged platform release flow with shadow validation, canary checks, and gated rollout.

ESG

ESG Regulatory Reporting

Governed sustainability and regulatory reporting with evidence collection and regional controls.

Security

Validation & Security Review

Release readiness through proof: validation, testcase inventory, and security controls.

Feature Store Online/Offline Parity

Feature consistency between training and serving with quality checks and observability.

Full Spectrum

Comprehensive Coverage

From ingestion to governance, from streaming to compliance, from legacy to cloud-native.

📊

Ingestion & Extraction

API/SaaS ingestion, CDC, Debezium, file/SFTP feeds, partner sources, source reliability, and extraction resilience.

⚡

Streaming & Messaging

Kafka, Flink, Beam, Pub/Sub, Kinesis, schema registries, replay-safe streaming, and ClickHouse real-time analytics.

🏭

Lakehouse & Warehouse

Delta Lake, Iceberg, Hudi, zone architecture, medallion patterns, dbt, semantic layers, and cost optimization.

🔒

Security & Compliance

PII, PCI, HIPAA, GDPR, regional sovereignty, ESG reporting, audit evidence, and platform-native governance.

🚨

Reliability & Recovery

Observability, SLA management, incident triage, disaster recovery, resiliency testing, and failure injection.

🔧

Modernization

ETL/ELT strategy, mainframe offload, enterprise ETL modernization for Informatica, Talend, DataStage, and SSIS.

💻

Language-Specific Skills

Python pipeline packaging, Scala JVM data jobs, and Java integration services with idiomatic patterns.

🌐

Platform Governance

Glue Data Catalog, Lake Formation, Unity Catalog, Microsoft Purview, Dataplex, OpenMetadata, and DataHub.

🛠

Operations & Platform

CI/CD release management, Terraform infrastructure, platform operating models, golden paths, and service ownership.

Data Engineering Agent Skills

Everything You Need forDisciplined Data Delivery

70+ Production-Grade Workflow Skills

14 Platform Presets

Contract-First Design

Replay-Safe Execution

Operational Hooks & Guardrails

Runnable Examples

Templates & Evidence

Starter Packs

Seven Commands. One Discipline.

Install in VS Code & JetBrains

VS Code Extension

JB JetBrains Plugin

Works With Every AI Agent

VS Code

Cursor

Claude

Copilot

Gemini

Kiro

Windsurf

OpenCode

Codex

JetBrains

Every Major Data Platform

AWS

Azure

GCP

Databricks

Snowflake

Alibaba Cloud

Apache Spark

Apache Flink

Apache Airflow

Apache Kafka

Apache Iceberg

Informatica

Talend

Multi-Cloud

Operational Hooks & Guardrails

session-start

contract-check-pre

pipeline-review-pre

incident-mode

backfill-guard

schema-change-guard

cost-check

release-guard

13 Real-World Examples

AWS S3 + Glue + Athena + Iceberg

Databricks Delta Medallion

dbt Warehouse Marts

Kafka + Flink Streaming

GCP Pub/Sub + Dataflow + BigQuery

Snowflake + dbt + Reverse ETL

Privacy Retention & Deletion

Multi-Cloud Warehouse Cutover

CI/CD Progressive Release

ESG Regulatory Reporting

Validation & Security Review

Feature Store Online/Offline Parity

Comprehensive Coverage

Ingestion & Extraction

Streaming & Messaging

Lakehouse & Warehouse

Security & Compliance

Reliability & Recovery

Modernization

Language-Specific Skills

Platform Governance

Operations & Platform

Ready to Level Up YourData Engineering?

Data Engineering
Agent Skills

Everything You Need for
Disciplined Data Delivery

JetBrains Plugin

Ready to Level Up Your
Data Engineering?