Skip to content

brycewang-stanford/Auto-Empirical-Research-Skills

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Auto-Empirical Research Skills (AERS)

🌐 Language: English | 简体中文 | 繁體中文 | 日本語 | 한국어


Auto-Empirical Research Skills cover
CoPaper.AI Stanford REAP - Center on China's Economy & Institutions

Stanford REAP × CoPaper.AI · An academic–industrial AI toolkit for empirical research
Built by Stanford's empirical-methodology team — the full pipeline from data cleaning to top-journal submission


Awesome GitHub stars License: CC BY-SA 4.0 PRs Welcome Validate catalog OpenSSF Scorecard Security audit: 52/52 CLEAN Powered by StatsPAI

The empirical-research specialist's agent-skills distribution. Not a marketing list — 1,072 skills vendored and cataloged in this repo, wrapped in a numeric benchmark, an eval harness, a security audit, and CI, plus a curated map of 23,000+ skills across 119 repositories in the wider ecosystem.

AERS is two things at once: (1) a small set of first-party flagship skills that run the full empirical pipeline — data cleaning → identification → estimation → robustness → tables/figures → submission-ready draft — and (2) a curated, security-aware catalog of the empirical-research skill ecosystem, organized by research-workflow stage. The differentiator is not the count; it is that the flagship behavior is verified against known answers, not asserted.

Note

Renamed. This project was formerly Awesome Agent Skills for Empirical Research. GitHub redirects the old URL automatically; please update your remote:

git remote set-url origin https://github.com/brycewang-stanford/Auto-Empirical-Research-Skills.git

Contents


What you actually get (the numbers, precisely)

Numbers in this README are kept honest and disambiguated. "Vendored" means the files live in this repo and are tracked in a generated catalog; "cataloged ecosystem" means curated links to external repositories.

What it is Count Source of truth
Skills vendored into this repo and cataloged 1,072 catalog/skills.json
Vendored collections 64 catalog/skills.json
First-party flagship full-pipeline skills (StatsPAI DSL + explicit Python/Stata/R) 4 skills/00*
Numeric benchmark tasks with gold values recomputed from data each run 5 benchmark/
Behavioral eval scenarios / rubric items 17 / 95 eval-harness/
Security audit of the original baseline (collections / files) 52 / 2,940+, 52/52 CLEAN SECURITY-SCAN-REPORT.md
Curated map of the wider ecosystem 23,000+ skills / 119 repos this README · docs/SKILL_CATALOG.md
Tools catalog (tools/): causal/econometrics libraries, autonomous research agents, MCP servers, causal discovery, benchmark datasets 335 tools / 6 categories tools/tools.json · tools/CATALOG.md

The security audit covered the original 52-collection / 2,940-file baseline (52/52 CLEAN). Skills vendored after that baseline are tracked in catalog/provenance.json, docs/LICENSE_AUDIT.md, and docs/SKILL_AUDIT.md; run make audit before relying on them in high-trust contexts.


Verify it yourself in 2 minutes

The most persuasive thing here is not a number — it is that the flagship pipeline's behavior is checkable without an API key or paid model. Just Python 3:

git clone https://github.com/brycewang-stanford/Auto-Empirical-Research-Skills.git
cd Auto-Empirical-Research-Skills
make check        # repo validation + unit tests + eval lint + numeric benchmark

The benchmark is the convincing part: it recomputes the gold answer from the raw dataset on every run, so a passing score cannot be faked by hard-coding a number. Out of the box it recovers:

  • LaLonde (1986) / Dehejia–Wahba (1999) — the naive observational comparison gets the wrong sign (−$635); covariate adjustment flips it positive (≈ +$1,548) toward the experimental benchmark (≈ +$1,794).
  • Card (1995) — IV return to schooling (0.131) exceeds OLS (0.075), with the first-stage F (13.3) reported rather than hidden.
  • Plus staggered-DID (TWFE bias vs. group-time truth), sharp RDD, and a bad-control / post-treatment-bias trap.

A pipeline passes only if it surfaces the trap, refuses to headline the misleading number, and matches the recomputed truth. See benchmark/ and the full trust overview in docs/TRUST.md.

💡 Want it hosted and end-to-end? Skip the assembly — copaper.ai runs the empirical pipeline for you, built alongside this catalog by the same Stanford methodology team.


Why trust this — three layers

Layer Anchor What it brings
🏛️ Academic lineage Stanford REAP / SCCEI — Stanford Center on China's Economy and Institutions A research center with a sustained publication record in empirical-economics methodology and a deep tradition in applied causal inference.
🔧 Engineering delivery CoPaper.AI — empirical-research AI assistant Ships 20 econometric-methodology skills (DID / IV / RDD / PSM / DML, …) behind a Supervisor + 4-sub-agent architecture, one-sentence triggers, automatic publication-ready output.
⚙️ Open-source engine StatsPAI — the causal-inference engine 900+ functions · one import statspai as sp · JOSS in submission · MIT. Every DID / IV / RD / SCM estimate CoPaper.AI produces is driven by StatsPAI, and this catalog is part of that ecosystem.

The flagship pipeline skills

Four parallel implementations of the same 8-step empirical loopdata cleaning → variable construction → descriptives → diagnostics → estimation → robustness → mechanism/heterogeneity → publication-ready tables & figures — plus the submission and de-AIGC stacks. Each uses progressive disclosure: a thin canonical-call spine in SKILL.md, with deep per-step reference manuals loaded only on demand. They coexist; pick by stack and use case.

Skill Stack Best for
StatsPAI 🔥 Agent-native Python DSL — one sp.causal(...) runs the loop; 900+ functions, self-describing API, unified CausalResult Whole-pipeline automation in one agent call when you trust the DSL
Full Empirical Analysis — Python 📘 Explicit stack: pandas · statsmodels · linearmodels · pyfixest · rdrobust · econml · causalml Teaching, referee-level line-by-line audit, strict replication needing full control
Full Empirical Analysis — Stata 📊 Community standard: reghdfe · ivreg2 · csdid · did_imputation · sdid · rdrobust · synth · psmatch2 · boottest · esttab When a referee or co-author insists on a Stata replication pack (AER/QJE/JPE/ReStud style)
Full Empirical Analysis — R 📗 Modern tidyverse: fixest · did · synthdid · HonestDiD · rdrobust · grf · DoubleML · marginaleffects · Quarto Single-.qmd reproducibility reports rendered to PDF/HTML/Word in one command
AER-Skills 📕 9 skills: topic routing → identification audit → robustness → intro → tables → replication → submission → R&R → orchestrator Top-5 economics (AER / AER:Insights / AEJ) submission: identification-first — fragile design, no prose saves it
chinese-de-aigc 🇨🇳 17-pattern Chinese AI-tell library, 5-step locate→diagnose→rewrite→score→review loop Lowering AI-writing signal for CNKI / Wanfang / VIP / Turnitin-Chinese submissions

Why a DSL and explicit ports? Reach for StatsPAI when you trust the one-shot DSL; reach for 00.1/00.2/00.3 when you are teaching, auditing, or must swap every diagnostic by hand. AER-skills then takes a correct analysis to acceptance threshold — these solve different problems and compose.


Start here — pick a skill in 30 seconds

Goal Start with
Run a complete empirical pipeline StatsPAI (or Python · Stata · R)
Audit a top-5 identification strategy first aer-identification
Prepare an AER / AEJ submission aer-workflow
Build an AEA-ready replication package aer-replication
Lower the AI-writing signal of a Chinese draft chinese-de-aigc

More ways in:


What makes this more than a 23K-skill dump

Public-skill counts are easy to inflate, and recent studies show large skill indexes are often redundant and occasionally unsafe. AERS competes on verifiable quality, not raw count. Every layer below runs locally via make check and in CI.

Layer What it catches Where
Numeric benchmark Reported numbers that don't match truth recomputed from real data — the naive-DID sign trap, weak-IV without first-stage F, TWFE bias under staggered timing, RDD trend confound, post-treatment bad controls benchmark/ · 5 tasks
Eval harness Prose-level failures: weak-IV false reassurance, staggered-DID TWFE misuse, fabricated citations, unsafe curl | bash setup, multiple-testing abuse, AER compliance gaps eval-harness/ · 17 scenarios / 95 rubric items
Security audit Pipe-to-shell, reverse shells, credential exfiltration, prompt injection across 13 risk categories — 6-phase, 40+ hook scripts reviewed by hand SECURITY-SCAN-REPORT.md
Provenance & license Unvendored sources, license risk, hygiene drift across all 1,072 cataloged skills docs/LICENSE_AUDIT.md · docs/SKILL_QUALITY.md
CI & compatibility Catalog freshness, broken local links, GitHub Actions policy, Python 3.9 and 3.12 syntax floor .github/workflows/ · 6 workflows
make catalog     # regenerate catalog, provenance, audit, enrichment
make validate    # freshness + link / frontmatter checks
make check       # full gate: validate + Python compile + unit tests + eval lint + benchmark

The trust surface is necessary, not sufficient — regex rubrics don't certify prose and a small benchmark doesn't cover every design. It is built to fail fast on known high-cost mistakes. Read the honest scope in docs/TRUST.md and docs/QUALITY_GATE.md.


Browse the landscape

By research stage

Topic Ideation → Lit Search → Deep Reading → Research Design → Data Collection
      │              │             │              │                │
      ▼              ▼             ▼              ▼                ▼
     01             02            03             01               04

Data Cleaning → Statistical Analysis → First Draft → Revision → Typesetting
      │              │                    │            │            │
      ▼              ▼                    ▼            ▼            ▼
     04             05                   06           07           08

Replication → Submission → Peer Review Response → Defense
      │           │              │                   │
      ▼           ▼              ▼                   ▼
     09          10             10                  10

Per-stage skill notes (bilingual): 01 Topic & design · 02 Lit review · 03 Paper reading · 04 Data & cleaning · 05 Causal inference · 06 Writing · 07 Revision · 08 Citation & typesetting · 09 Replication · 10 Review response

Comprehensive skill suites

The pain point AERS exists to fix: ask an AI to "run a DID" and it gives the baseline regression and stops. "Parallel trends?" — it adds one. "Placebo?" — another. Every time, like squeezing toothpaste. A skill is a methodology playbook for the agent: it already knows a complete DID means parallel-trends → baseline → robustness battery → heterogeneity → mechanism, with a defined output at each step.

Academic research — general-purpose research suites (K-Dense, AI-Research-SKILLs, claude-scholar, …)
Suite Stars # Skills Key features
K-Dense-AI/claude-scientific-skills 8,799 140+ 28+ scientific databases (OpenAlex, PubMed); scientific-writing + literature-review + statistical-analysis
Orchestra-Research/AI-Research-SKILLs 3,637 87 22 categories, ML paper writing, LaTeX templates, citation verification
Imbad0202/academic-research-skills ~1,790 Multiple Full paper pipeline (research → write → review → revise → finalize), style calibration, hallucination detection
Galaxy-Dawn/claude-scholar - 25+ Full research lifecycle: ideation → review → experiments → writing → review response; Zotero MCP
luwill/research-skills 209 3 Research-proposal generation, medical review writing, paper-to-slides, bilingual
lishix520/academic-paper-skills 22 2 Strategist (7-dimension reviewer simulation) + Composer (systematic writing)
Data-Wise/claude-plugins - 17 Statistical research: arXiv search, DOI lookup, BibTeX, methodology writing, referee response
Economics / causal inference — the first-party flagships plus community Stata/IV/feedback suites

The first-party flagships (StatsPAI, Python, Stata, R, AER-skills) are described above. Community complements:

Suite Key features Use case
CoPaper.AI 20 methodology skills, Supervisor + 4 sub-agents, smart routing, automatic output Full empirical-economics workflow, hosted
claesbackman/AI-research-feedback 2-agent pre-review: causal-overclaiming detection, identification assessment (AER/QJE/JPE/Econometrica/REStud); 6-agent grant review Pre-submission self-review, grants
fuhaoda/stats-paper-writing-agent-skills LaTeX statistical-paper writing, front-end draft generation Statistics & econometrics papers
dylantmoore/stata-skill Full Stata coverage: syntax, data management, econometrics, causal inference, Mata, 20+ packages Stata users
SepineTam/stata-mcp LLM drives Stata regressions directly via MCP Stata econometrics
hanlulong/stata-mcp Stata-MCP editor extension (VS Code/Cursor/Antigravity): run .do directly, live output, data/graph viewer; MIT · 414★ (same name as SepineTam above, different project) In-editor AI pairing with Stata
tmonk/mcp-stata · vendored at skills/64 20 SKILL.md skills from the Stata MCP server: replication / data audit / publication QA / legacy modernization / referee response / power / causal inference; AGPL-3.0 (kept as a separately-licensed aggregate; server code not vendored) Stata replication & robustness audits
PovertyAction/ipa-stata-template IPA reproducible Stata research template + .claude/skills: numbered pipeline, assertion-based defensive programming, LaTeX tables; MIT Development economics / field-RCT replication
lcrawfurd/claude-skills Academic skills: paper / code review, referee, pre-submission; code-review encodes Stata/R/Python coding standards (DIME / Reif / AEA Data Editor) Pre-submission review & code audit
AEADataEditor/replication-template AEA Data Editor's official replication-package template (Stata-centric, REPLICATION.md) — the reproducibility "gold standard" AEA / top-journal replication packaging
Finance · education & public health · law · marketing · product · general agents

Finance & investmentfinancial-services-plugins (Anthropic official) · OctagonAI/skills · tradermonty/claude-trading-skills · himself65/finance-skills · quant-sentiment-ai/claude-equity-research

Education & public healthGarethManning/claude-education-skills · FreedomIntelligence/OpenClaw-Medical-Skills (869 medical skills: epidemiology, surveillance, clinical research, drug safety, biostatistics)

Governance, compliance & lawClaude-Skills-Governance-Risk-and-Compliance (ISO 27001 / SOC 2 / GDPR / HIPAA) · zubair-trabzada/ai-legal-claude · evolsb/claude-legal-skill

Marketing & consumer behaviorcoreyhaines31/marketingskills · zubair-trabzada/ai-marketing-claude · ericosiu/ai-marketing-skills

Product & organizational behaviorphuryn/pm-skills (100+ skills) · mastepanoski/claude-skills (Nielsen heuristics, NIST AI RMF, ISO 42001)

General agent capabilitieslyndonkl/claude (85 skills + 6 orchestrators) · alirezarezvani/claude-skills (220+ skills, ~5,200★) · rohitg00/awesome-claude-code-toolkit · jeremylongshore/claude-code-plugins-plus-skills (1,367 skills) · posit-dev/skills (Posit official)

Anti-AIGC detection & de-AI academic writing

One of 2026's sharpest pain points: papers failing AIGC detection (Turnitin, GPTZero, CNKI) can be rejected outright. The skills below are the most complete open-source solutions — all MIT, all locally archived (skills/44-48).

Suite Key features Best for Local
chinese-de-aigc 🇨🇳 Original Chinese academic de-AIGC by CoPaper.AI; 17-pattern Chinese-tell library, 5-step loop, per-section strategy, 5-dim scoring. The only GitHub skill dedicated to Chinese academic de-AIGC CNKI / Wanfang / VIP / Turnitin-Chinese 48
matsuikentaro1/humanizer_academic Academic-specific; 23 AI-writing patterns; preserves legitimate academic transitions Medical, life-science, natural-science papers 44
stephenturner/skill-deslop Distinguishes legitimate discipline conventions from AI tells; 5-dimension scoring Scientific papers, technical blogs 45
hardikpandya/stop-slop 3-layer detection + 5-dim scoring; banned phrases, structural clichés, sentence rules General prose, blogs, reports 46
conorbronsdon/avoid-ai-writing Structured audit + rewrite + second-pass audit; auditable, traceable Workflows needing a paper trail 47

Combos: 🇨🇳 Chinese (CNKI/Wanfang/VIP) → chinese-de-aigc · 🇬🇧 English → humanizer_academic · need an audit trail → avoid-ai-writing · general prose → stop-slop.

Tools catalog (tools/) — automated empirical & causal-inference tools

Unlike the skills above, tools/ catalogs the software and services an agent (or researcher) actually invokes — structured, license- and maintenance-aware, and wired into make validate. Source of truth: tools/tools.json; browsable list: tools/CATALOG.md.

335 tools across 6 categories (curated 2026-06):

  • Causal-inference / treatment-effect libraries (32) — DoWhy · EconML · CausalML · DoubleML · CausalPy · causallib · grf · CATENets · TMLE family · Mendelian randomization …
  • Econometrics / quasi-experimental libraries (170) — panel FE · DiD (incl. modern/staggered) · event study · RDD · IV · synthetic control/SDID · matching & weighting · sensitivity (fixest · did · HonestDiD · rdrobust · synthdid · reghdfe · csdid · sdid · pyfixest · linearmodels …); plus spatial econometrics (spdep · PySAL/spreg · GeoDa), local projections/IRF & (S)VAR (lpirfs · vars · svars), survey weighting/MRP/raking (survey · samplics · balance), and meta-analysis (metafor · meta · netmeta · metan) — across R/Python/Stata/Julia.
  • Autonomous research / data-science agents (51) — end-to-end research & data analysis: AI-Scientist · data-to-paper · Agent Laboratory · RD-Agent · AI-Researcher · STORM · PaperQA2 · gpt-researcher · DeepAnalyze · MetaGPT (DI) · Biomni … (⚠️ includes non-OSI / no-LICENSE repos — confirm terms before use).
  • MCP servers (48) — stats execution (StatsPAI · stata-mcp · R/Jupyter MCP) + data access (FRED · World Bank · IMF · OECD · Eurostat · Census · BEA · BLS · SEC EDGAR · OpenAlex · Semantic Scholar · PubMed · Zotero · arXiv …).
  • Causal discovery / structure learning (25) — causal-learn · Tetrad/py-tetrad · gCastle · CDT · tigramite (PCMCI) · LiNGAM · NOTEARS/DAGMA · pcalg · bnlearn · pgmpy …
  • Benchmarks & datasets (9) — causaldata · IHDP/Twins · ACIC competition data · RealCause · JustCause · Tübingen cause-effect pairs · bnlearn network repository …

Full write-up: tools/README.md.

Multi-agent systems · MCP servers · platforms · learning

Multi-agent collaboration systems — paper revision, autonomous research, data-science teams

Role separation beats a single agent because the reviewer is independent of the drafter — the same logic as peer review.

Paper revision & writing: copy-edit-master (3 sub-agents, Strunk & White / McCloskey rules) · introduction-writer (strategist → drafter → reviewer → reviser) · CoPaper.AI PaperAgent (Supervisor + 4 sub-agents).

Autonomous research & data science: ruc-datalab/DeepAnalyze · business-science/ai-data-science-team · HKUDS/AI-Researcher (NeurIPS 2025 Spotlight) · wanshuiyin/ARIS · SamuelSchmidgall/AgentLaboratory (84% cost reduction) · SakanaAI/AI-Scientist-v2 · assafelovic/gpt-researcher · pedrohcgs/claude-code-my-workflow (Emory).

Academic data MCP servers — OpenAlex, Semantic Scholar, FRED, World Bank, Zotero, …

xingyulu23/Academix · Eclipse-Cj/paper-distill-mcp · oksure/openalex-research-mcp (240M+ works) · openags/paper-search-mcp (20+ sources) · lzinga/us-gov-open-data-mcp (40+ US gov APIs) · stefanoamorelli/fred-mcp-server (FRED 800K+ series) · llnOrmll/world-bank-data-mcp · 54yyyu/zotero-mcp

Skill aggregation platforms & learning resources

Platforms: VoltAgent/awesome-agent-skills (1,000+) · sickn33/antigravity-awesome-skills (1,340+) · VoltAgent/awesome-openclaw-skills (5,400+) · skills.sh · ClawHub (13,729) · Anthropic official skills.

Learning: Claude Code Skills guide (PDF) · Agent Skills Standard · Causal Inference for the Brave and True · Awesome AI for Economists · Awesome Econ AI Stuff.


Security

The original 52 skill collections / 2,940+ files passed a systematic audit — 52/52 CLEAN, zero FLAGGED: no malicious prompts, viruses, reverse shells, or prompt injection. Every "sensitive" hit verified as one of three legitimate categories: defensive security rules, legitimate academic API calls (arXiv / CrossRef / PubMed / FRED / World Bank / OECD / BLS), or standard Claude Code workflow hooks (all local file ops, zero network IO).

Skills Security Scan Overview

Six-phase, defense-in-depth: automated grep across 13 risk categories → 100% manual review of all 6 hook-bearing skills and their 40+ hook scripts (no Bash(*) wildcards anywhere) → three parallel agent content audits → supplemental integrity checks (hidden Unicode, encoding anomalies, HTML injection, network imports).

Key insight: largest ≠ riskiest. The biggest skills all passed; 17-DAAF actually sets the bar for security-conscious design (14 defensive hooks + 32 deny rules + active credential scanning).

Newer vendored additions are tracked in catalog/provenance.json and docs/SKILL_AUDIT.md — run make audit. Full report: SECURITY-SCAN-REPORT.md.


Changelog

The narrative changelog has moved to CHANGELOG.md. Recent highlights:

  • 2026-05 — Vendored AER-skills (top-5 economics submission stack, 9 skills) with weekly upstream sync; expanded the numeric benchmark to 5 causal-recovery tasks and the eval harness to 17 scenarios / 95 rubric items.
  • 2026-04 — Completed the 52/52 security baseline; shipped the four full-pipeline flagships (StatsPAI + explicit Python / Stata / R); launched the original chinese-de-aigc skill.
  • Earlier — Grew from 43 collections to a curated map of 119 repos / 23,000+ skills; added bilingual README, academic data MCP servers, and multi-agent systems.

Contributing & citation

Contributions welcome — see CONTRIBUTING.md and the docs/SKILL_SUBMISSION_GUIDE.md. We especially welcome social-science skills (economics, political science, sociology, psychology, education, public health), new causal-inference implementations, MCP servers for academic/government data, Chinese-friendly skills, and multi-agent case studies. New submissions must declare source, license, and category for the provenance audit.

If AERS helps your work, please cite it (CITATION.cff) and star the repo so more researchers can find it.

Star History Chart

AI is an amplifier, not a replacement. It handles the heavy lifting; you keep the core judgment.


CoPaper.AI Stanford REAP

Stanford REAP × CoPaper.AI · An academic–industrial AI toolkit for empirical research


Visit copaper.ai
Visit copaper.ai
CoPaper.AI WeChat
WeChat: CoPaper.AI

20 built-in methodology skills · 20-minute empirical paper · powered by StatsPAI (900+ functions, MIT)


Maintained by CoPaper.AI, incubated at Stanford REAP / SCCEI · AI Assistant for Empirical Research

About

🔬 A curated collection of 23,000+ agent skills for empirical research across 8 social science disciplines. | 精选 23,000+ AI Agent 技能库,覆盖8大社会科学学科的实证研究。CoPaper.AI 20分钟完成一篇可复现的规范实证论文,并支持用户上传 Skills。-- Maintained by CoPaper.AI from Stanford REAP.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors