Open Source • MIT Licensed • Production-Grade

Data Engineering
Agent Skills

The open skill registry and execution toolkit that gives AI agents the operating procedures to define, plan, implement, validate, replay, and ship reliable data products with the same discipline used by strong data engineering teams.

70+
Skills
14
Presets
13
Examples
12
Starter Packs
10
MCP Configs
10+
Agent Surfaces

Everything You Need for
Disciplined Data Delivery

Not generic prompts. Operating procedures for building reliable data products.

🎯

70+ Production-Grade Workflow Skills

Covering ingestion, transformation, orchestration, streaming, lakehouse, warehousing, governance, quality, legacy modernization, release management, incident recovery, and platform operating models. Each skill is a structured SKILL.md with progressive disclosure for agents.

☁️

14 Platform Presets

Vendor-neutral skills adapt to AWS, Azure, GCP, Databricks, Snowflake, Alibaba Cloud, and Apache stacks through platform-specific presets.

📜

Contract-First Design

Source contracts, dataset contracts, metric contracts, and compliance controls defined before implementation. No guessing grain, freshness, or behavior.

🔄

Replay-Safe Execution

Idempotent, replayable, backfill-safe patterns are the default. Backfill guards, schema change guards, and release gates built in.

🛡

Operational Hooks & Guardrails

Session start detection, contract validation, schema safety checks, cost analysis, backfill guards, pipeline review gates, and release readiness checks. The repository behaves like a workflow system, not a static library. Run hooks before risky operations to catch problems early.

📊

Runnable Examples

13 end-to-end projects with specs, plans, tasks, smoke tests, rollback demos, and proof paths.

📚

Templates & Evidence

Machine-readable YAML templates for contracts, backfill plans, schema changes, release gates, and incident runbooks.

🚀

Starter Packs

Opinionated bundles for AWS Lakehouse, Databricks Medallion, Streaming, CI/CD, Privacy, ESG, and more.

Seven Commands. One Discipline.

A spec-first lifecycle that maps to the entire data product delivery journey.

📜
/spec
Define
📋
/plan
Break down
🔧
/build
Implement
/validate
Prove
🔍
/review
Inspect
🔄
/backfill
Replay
🚀
/ship
Release
1# Quick install
2git clone https://github.com/vaquarkhan/data-engineering-agent-skills.git
3cd data-engineering-agent-skills
4
5# Install into your project (all agents)
6./bootstrap.sh /path/to/project auto
7
8# Or pick specific agents
9scripts/install.sh --tool cursor,claude,kiro --target ./my-project
10
11# Run benchmarks
12python benchmarks/score_benchmarks.py

Install in VS Code & JetBrains

First-class plugin support for the editors your team already uses.

VS Code Extension

A full command-palette installer for the VS Code family. Install the complete toolkit, core pack, agent adapters, starter packs, MCP templates, or runnable examples with a single command.

  • Works with VS Code, Cursor, Windsurf, VSCodium
  • Command palette installers for all asset types
  • Download .vsix from GitHub Releases
  • Marketplace publish workflow included
Download .vsix
1// VS Code Command Palette
2> DE Skills: Install Full Toolkit
3> DE Skills: Install Core Pack
4> DE Skills: Install Agent Adapters
5> DE Skills: Install Starter Pack
6> DE Skills: Install MCP Templates
7> DE Skills: Install Examples

JB JetBrains Plugin

Native plugin for IntelliJ-platform IDEs. Install the data engineering skill toolkit directly into your JetBrains workflow with tool-window integration and project-level configuration.

  • IntelliJ IDEA, PyCharm, DataGrip, WebStorm, GoLand
  • Download plugin .zip from GitHub Releases
  • Marketplace publish workflow ready
  • Install smoke tests in CI
Download Plugin .zip
1// JetBrains Tool Window
2Tools > Data Engineering Skills
3
4 Install Full Toolkit
5 Install Core Skills
6 Install Platform Preset
7 Install Starter Pack
8 Run Session Hook

Works With Every AI Agent

Native adapters and install surfaces for all major AI coding assistants and IDEs.

VS Code

.vsix extension

Cursor

.cursor/rules/

Claude

CLAUDE.md + plugin

Copilot

copilot-instructions

Gemini

.gemini/commands/

Kiro

.kiro/steering/

Windsurf

.windsurfrules

OpenCode

.opencode/

Codex

AGENTS.md

JetBrains

Plugin .zip

Every Major Data Platform

Vendor-neutral skills adapt to your stack through platform-specific presets.

AWS

S3, Glue, Athena, EMR, Lake Formation

Azure

ADLS, Synapse, Purview, Data Factory

GCP

BigQuery, Dataflow, Pub/Sub, Dataplex

Databricks

Delta Lake, Unity Catalog, Auto Loader

Snowflake

Streams, Tasks, Dynamic Tables

Alibaba Cloud

MaxCompute, DataWorks, OSS

Apache Spark

Batch & streaming processing

Apache Flink

Stream processing engine

Apache Airflow

Workflow orchestration

Apache Kafka

Event streaming platform

Apache Iceberg

Open table format

Informatica

Enterprise data integration

Talend

Data integration & quality

Multi-Cloud

Hybrid & cross-cloud patterns

Operational Hooks & Guardrails

Pre-flight checks that make the repository behave like a workflow system.

📌

session-start

Detects repo type, recommends preset, starter pack, and next command

📜

contract-check-pre

Validates contract files contain minimum required fields

🛡

pipeline-review-pre

Checks for tests, contracts, rollback notes, lineage, quality gates

🚨

incident-mode

Switches session into incident-response posture

🔄

backfill-guard

Safety questions and validation gates before replay or cutover

⚠️

schema-change-guard

Detects risky schema evolution: breaking renames, drops, destructive refreshes

💰

cost-check

Flags cost hotspots in SQL, dbt, Spark, and warehouse projects

🚀

release-guard

Validates reconciliation, publish control, observability, and rollback evidence

13 Real-World Examples

Complete project packs with specs, plans, tasks, smoke tests, and proof paths.

Lakehouse

AWS S3 + Glue + Athena + Iceberg

Full lakehouse adoption with zone architecture, normalization jobs, reconciliation checks, and rollback demonstrations.

Medallion

Databricks Delta Medallion

Bronze-to-silver transformation with Auto Loader, contract validation, and rollback-aware proof paths.

Analytics

dbt Warehouse Marts

Analytics modeling with DuckDB local execution, semantic layer governance, and quality gates.

Streaming

Kafka + Flink Streaming

Stream processing with replay safety, schema-aware pipelines, and incident recovery patterns.

GCP

GCP Pub/Sub + Dataflow + BigQuery

Managed GCP stream-to-analytics path with observability and SLA management.

Snowflake

Snowflake + dbt + Reverse ETL

Warehouse-to-operational serving with publishing contracts and data sharing patterns.

Governance

Privacy Retention & Deletion

GDPR-compliant deletion propagation with audit evidence and lineage tracking.

Migration

Multi-Cloud Warehouse Cutover

Platform migration with dual-run reconciliation, consumer cutover, and rollback paths.

Release

CI/CD Progressive Release

Staged platform release flow with shadow validation, canary checks, and gated rollout.

ESG

ESG Regulatory Reporting

Governed sustainability and regulatory reporting with evidence collection and regional controls.

Security

Validation & Security Review

Release readiness through proof: validation, testcase inventory, and security controls.

ML

Feature Store Online/Offline Parity

Feature consistency between training and serving with quality checks and observability.

Comprehensive Coverage

From ingestion to governance, from streaming to compliance, from legacy to cloud-native.

📊

Ingestion & Extraction

API/SaaS ingestion, CDC, Debezium, file/SFTP feeds, partner sources, source reliability, and extraction resilience.

Streaming & Messaging

Kafka, Flink, Beam, Pub/Sub, Kinesis, schema registries, replay-safe streaming, and ClickHouse real-time analytics.

🏭

Lakehouse & Warehouse

Delta Lake, Iceberg, Hudi, zone architecture, medallion patterns, dbt, semantic layers, and cost optimization.

🔒

Security & Compliance

PII, PCI, HIPAA, GDPR, regional sovereignty, ESG reporting, audit evidence, and platform-native governance.

🚨

Reliability & Recovery

Observability, SLA management, incident triage, disaster recovery, resiliency testing, and failure injection.

🔧

Modernization

ETL/ELT strategy, mainframe offload, enterprise ETL modernization for Informatica, Talend, DataStage, and SSIS.

💻

Language-Specific Skills

Python pipeline packaging, Scala JVM data jobs, and Java integration services with idiomatic patterns.

🌐

Platform Governance

Glue Data Catalog, Lake Formation, Unity Catalog, Microsoft Purview, Dataplex, OpenMetadata, and DataHub.

🛠

Operations & Platform

CI/CD release management, Terraform infrastructure, platform operating models, golden paths, and service ownership.

Ready to Level Up Your
Data Engineering?

Give your AI agents the discipline of a strong data engineering team. Open source, MIT licensed, production-ready.