🌐 Language: English | 简体中文 | 繁體中文 | 日本語 | 한국어
|
|
Stanford REAP × CoPaper.AI · An academic–industrial AI toolkit for empirical research
Built by Stanford's empirical-methodology team — the full pipeline from data cleaning to top-journal submission
The empirical-research specialist's agent-skills distribution. Not a marketing list — 1,072 skills vendored and cataloged in this repo, wrapped in a numeric benchmark, an eval harness, a security audit, and CI, plus a curated map of 23,000+ skills across 119 repositories in the wider ecosystem.
AERS is two things at once: (1) a small set of first-party flagship skills that run the full empirical pipeline — data cleaning → identification → estimation → robustness → tables/figures → submission-ready draft — and (2) a curated, security-aware catalog of the empirical-research skill ecosystem, organized by research-workflow stage. The differentiator is not the count; it is that the flagship behavior is verified against known answers, not asserted.
Note
Renamed. This project was formerly Awesome Agent Skills for Empirical Research. GitHub redirects the old URL automatically; please update your remote:
git remote set-url origin https://github.com/brycewang-stanford/Auto-Empirical-Research-Skills.git- What you actually get (the numbers, precisely)
- Verify it yourself in 2 minutes
- Why trust this — three layers
- The flagship pipeline skills
- Start here — pick a skill in 30 seconds
- What makes this more than a 23K-skill dump
- Browse the landscape
- Security
- Changelog
- Contributing & citation
Numbers in this README are kept honest and disambiguated. "Vendored" means the files live in this repo and are tracked in a generated catalog; "cataloged ecosystem" means curated links to external repositories.
| What it is | Count | Source of truth |
|---|---|---|
| Skills vendored into this repo and cataloged | 1,072 | catalog/skills.json |
| Vendored collections | 64 | catalog/skills.json |
| First-party flagship full-pipeline skills (StatsPAI DSL + explicit Python/Stata/R) | 4 | skills/00* |
| Numeric benchmark tasks with gold values recomputed from data each run | 5 | benchmark/ |
| Behavioral eval scenarios / rubric items | 17 / 95 | eval-harness/ |
| Security audit of the original baseline (collections / files) | 52 / 2,940+, 52/52 CLEAN | SECURITY-SCAN-REPORT.md |
| Curated map of the wider ecosystem | 23,000+ skills / 119 repos | this README · docs/SKILL_CATALOG.md |
Tools catalog (tools/): causal/econometrics libraries, autonomous research agents, MCP servers, causal discovery, benchmark datasets |
335 tools / 6 categories | tools/tools.json · tools/CATALOG.md |
The security audit covered the original 52-collection / 2,940-file baseline (52/52 CLEAN). Skills vendored after that baseline are tracked in
catalog/provenance.json,docs/LICENSE_AUDIT.md, anddocs/SKILL_AUDIT.md; runmake auditbefore relying on them in high-trust contexts.
The most persuasive thing here is not a number — it is that the flagship pipeline's behavior is checkable without an API key or paid model. Just Python 3:
git clone https://github.com/brycewang-stanford/Auto-Empirical-Research-Skills.git
cd Auto-Empirical-Research-Skills
make check # repo validation + unit tests + eval lint + numeric benchmarkThe benchmark is the convincing part: it recomputes the gold answer from the raw dataset on every run, so a passing score cannot be faked by hard-coding a number. Out of the box it recovers:
- LaLonde (1986) / Dehejia–Wahba (1999) — the naive observational comparison gets the wrong sign (−$635); covariate adjustment flips it positive (≈ +$1,548) toward the experimental benchmark (≈ +$1,794).
- Card (1995) — IV return to schooling (0.131) exceeds OLS (0.075), with the first-stage F (13.3) reported rather than hidden.
- Plus staggered-DID (TWFE bias vs. group-time truth), sharp RDD, and a bad-control / post-treatment-bias trap.
A pipeline passes only if it surfaces the trap, refuses to headline the misleading number, and matches the recomputed truth. See benchmark/ and the full trust overview in docs/TRUST.md.
💡 Want it hosted and end-to-end? Skip the assembly — copaper.ai runs the empirical pipeline for you, built alongside this catalog by the same Stanford methodology team.
| Layer | Anchor | What it brings |
|---|---|---|
| 🏛️ Academic lineage | Stanford REAP / SCCEI — Stanford Center on China's Economy and Institutions | A research center with a sustained publication record in empirical-economics methodology and a deep tradition in applied causal inference. |
| 🔧 Engineering delivery | CoPaper.AI — empirical-research AI assistant | Ships 20 econometric-methodology skills (DID / IV / RDD / PSM / DML, …) behind a Supervisor + 4-sub-agent architecture, one-sentence triggers, automatic publication-ready output. |
| ⚙️ Open-source engine | StatsPAI — the causal-inference engine | 900+ functions · one import statspai as sp · JOSS in submission · MIT. Every DID / IV / RD / SCM estimate CoPaper.AI produces is driven by StatsPAI, and this catalog is part of that ecosystem. |
Four parallel implementations of the same 8-step empirical loop — data cleaning → variable construction → descriptives → diagnostics → estimation → robustness → mechanism/heterogeneity → publication-ready tables & figures — plus the submission and de-AIGC stacks. Each uses progressive disclosure: a thin canonical-call spine in SKILL.md, with deep per-step reference manuals loaded only on demand. They coexist; pick by stack and use case.
| Skill | Stack | Best for |
|---|---|---|
| StatsPAI 🔥 | Agent-native Python DSL — one sp.causal(...) runs the loop; 900+ functions, self-describing API, unified CausalResult |
Whole-pipeline automation in one agent call when you trust the DSL |
| Full Empirical Analysis — Python 📘 | Explicit stack: pandas · statsmodels · linearmodels · pyfixest · rdrobust · econml · causalml |
Teaching, referee-level line-by-line audit, strict replication needing full control |
| Full Empirical Analysis — Stata 📊 | Community standard: reghdfe · ivreg2 · csdid · did_imputation · sdid · rdrobust · synth · psmatch2 · boottest · esttab |
When a referee or co-author insists on a Stata replication pack (AER/QJE/JPE/ReStud style) |
| Full Empirical Analysis — R 📗 | Modern tidyverse: fixest · did · synthdid · HonestDiD · rdrobust · grf · DoubleML · marginaleffects · Quarto |
Single-.qmd reproducibility reports rendered to PDF/HTML/Word in one command |
| AER-Skills 📕 | 9 skills: topic routing → identification audit → robustness → intro → tables → replication → submission → R&R → orchestrator | Top-5 economics (AER / AER:Insights / AEJ) submission: identification-first — fragile design, no prose saves it |
| chinese-de-aigc 🇨🇳 | 17-pattern Chinese AI-tell library, 5-step locate→diagnose→rewrite→score→review loop | Lowering AI-writing signal for CNKI / Wanfang / VIP / Turnitin-Chinese submissions |
Why a DSL and explicit ports? Reach for StatsPAI when you trust the one-shot DSL; reach for 00.1/00.2/00.3 when you are teaching, auditing, or must swap every diagnostic by hand. AER-skills then takes a correct analysis to acceptance threshold — these solve different problems and compose.
| Goal | Start with |
|---|---|
| Run a complete empirical pipeline | StatsPAI (or Python · Stata · R) |
| Audit a top-5 identification strategy first | aer-identification |
| Prepare an AER / AEJ submission | aer-workflow |
| Build an AEA-ready replication package | aer-replication |
| Lower the AI-writing signal of a Chinese draft | chinese-de-aigc |
More ways in:
- Not sure which skill? →
docs/CHOOSING_A_SKILL.md· faceted search:docs/search.html - First 10 minutes, end to end →
docs/GETTING_STARTED.md - Copy-paste a full workflow →
docs/GOLDEN_WORKFLOWS.md - Install into a runtime / use without installing →
docs/INSTALL.md - Machine-readable index →
catalog/skills.json· taxonomy:docs/TAXONOMY.md· full catalog:docs/SKILL_CATALOG.md - FAQ →
docs/FAQ.md
Public-skill counts are easy to inflate, and recent studies show large skill indexes are often redundant and occasionally unsafe. AERS competes on verifiable quality, not raw count. Every layer below runs locally via make check and in CI.
| Layer | What it catches | Where |
|---|---|---|
| Numeric benchmark | Reported numbers that don't match truth recomputed from real data — the naive-DID sign trap, weak-IV without first-stage F, TWFE bias under staggered timing, RDD trend confound, post-treatment bad controls | benchmark/ · 5 tasks |
| Eval harness | Prose-level failures: weak-IV false reassurance, staggered-DID TWFE misuse, fabricated citations, unsafe curl | bash setup, multiple-testing abuse, AER compliance gaps |
eval-harness/ · 17 scenarios / 95 rubric items |
| Security audit | Pipe-to-shell, reverse shells, credential exfiltration, prompt injection across 13 risk categories — 6-phase, 40+ hook scripts reviewed by hand | SECURITY-SCAN-REPORT.md |
| Provenance & license | Unvendored sources, license risk, hygiene drift across all 1,072 cataloged skills | docs/LICENSE_AUDIT.md · docs/SKILL_QUALITY.md |
| CI & compatibility | Catalog freshness, broken local links, GitHub Actions policy, Python 3.9 and 3.12 syntax floor | .github/workflows/ · 6 workflows |
make catalog # regenerate catalog, provenance, audit, enrichment
make validate # freshness + link / frontmatter checks
make check # full gate: validate + Python compile + unit tests + eval lint + benchmarkThe trust surface is necessary, not sufficient — regex rubrics don't certify prose and a small benchmark doesn't cover every design. It is built to fail fast on known high-cost mistakes. Read the honest scope in docs/TRUST.md and docs/QUALITY_GATE.md.
Topic Ideation → Lit Search → Deep Reading → Research Design → Data Collection
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
01 02 03 01 04
Data Cleaning → Statistical Analysis → First Draft → Revision → Typesetting
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
04 05 06 07 08
Replication → Submission → Peer Review Response → Defense
│ │ │ │
▼ ▼ ▼ ▼
09 10 10 10
Per-stage skill notes (bilingual): 01 Topic & design · 02 Lit review · 03 Paper reading · 04 Data & cleaning · 05 Causal inference · 06 Writing · 07 Revision · 08 Citation & typesetting · 09 Replication · 10 Review response
The pain point AERS exists to fix: ask an AI to "run a DID" and it gives the baseline regression and stops. "Parallel trends?" — it adds one. "Placebo?" — another. Every time, like squeezing toothpaste. A skill is a methodology playbook for the agent: it already knows a complete DID means parallel-trends → baseline → robustness battery → heterogeneity → mechanism, with a defined output at each step.
Academic research — general-purpose research suites (K-Dense, AI-Research-SKILLs, claude-scholar, …)
| Suite | Stars | # Skills | Key features |
|---|---|---|---|
| K-Dense-AI/claude-scientific-skills | 8,799 | 140+ | 28+ scientific databases (OpenAlex, PubMed); scientific-writing + literature-review + statistical-analysis |
| Orchestra-Research/AI-Research-SKILLs | 3,637 | 87 | 22 categories, ML paper writing, LaTeX templates, citation verification |
| Imbad0202/academic-research-skills | ~1,790 | Multiple | Full paper pipeline (research → write → review → revise → finalize), style calibration, hallucination detection |
| Galaxy-Dawn/claude-scholar | - | 25+ | Full research lifecycle: ideation → review → experiments → writing → review response; Zotero MCP |
| luwill/research-skills | 209 | 3 | Research-proposal generation, medical review writing, paper-to-slides, bilingual |
| lishix520/academic-paper-skills | 22 | 2 | Strategist (7-dimension reviewer simulation) + Composer (systematic writing) |
| Data-Wise/claude-plugins | - | 17 | Statistical research: arXiv search, DOI lookup, BibTeX, methodology writing, referee response |
Economics / causal inference — the first-party flagships plus community Stata/IV/feedback suites
The first-party flagships (StatsPAI, Python, Stata, R, AER-skills) are described above. Community complements:
| Suite | Key features | Use case |
|---|---|---|
| CoPaper.AI | 20 methodology skills, Supervisor + 4 sub-agents, smart routing, automatic output | Full empirical-economics workflow, hosted |
| claesbackman/AI-research-feedback | 2-agent pre-review: causal-overclaiming detection, identification assessment (AER/QJE/JPE/Econometrica/REStud); 6-agent grant review | Pre-submission self-review, grants |
| fuhaoda/stats-paper-writing-agent-skills | LaTeX statistical-paper writing, front-end draft generation | Statistics & econometrics papers |
| dylantmoore/stata-skill | Full Stata coverage: syntax, data management, econometrics, causal inference, Mata, 20+ packages | Stata users |
| SepineTam/stata-mcp | LLM drives Stata regressions directly via MCP | Stata econometrics |
| hanlulong/stata-mcp | Stata-MCP editor extension (VS Code/Cursor/Antigravity): run .do directly, live output, data/graph viewer; MIT · 414★ (same name as SepineTam above, different project) |
In-editor AI pairing with Stata |
tmonk/mcp-stata · vendored at skills/64 |
20 SKILL.md skills from the Stata MCP server: replication / data audit / publication QA / legacy modernization / referee response / power / causal inference; AGPL-3.0 (kept as a separately-licensed aggregate; server code not vendored) | Stata replication & robustness audits |
| PovertyAction/ipa-stata-template | IPA reproducible Stata research template + .claude/skills: numbered pipeline, assertion-based defensive programming, LaTeX tables; MIT |
Development economics / field-RCT replication |
| lcrawfurd/claude-skills | Academic skills: paper / code review, referee, pre-submission; code-review encodes Stata/R/Python coding standards (DIME / Reif / AEA Data Editor) | Pre-submission review & code audit |
| AEADataEditor/replication-template | AEA Data Editor's official replication-package template (Stata-centric, REPLICATION.md) — the reproducibility "gold standard" |
AEA / top-journal replication packaging |
Finance · education & public health · law · marketing · product · general agents
Finance & investment — financial-services-plugins (Anthropic official) · OctagonAI/skills · tradermonty/claude-trading-skills · himself65/finance-skills · quant-sentiment-ai/claude-equity-research
Education & public health — GarethManning/claude-education-skills · FreedomIntelligence/OpenClaw-Medical-Skills (869 medical skills: epidemiology, surveillance, clinical research, drug safety, biostatistics)
Governance, compliance & law — Claude-Skills-Governance-Risk-and-Compliance (ISO 27001 / SOC 2 / GDPR / HIPAA) · zubair-trabzada/ai-legal-claude · evolsb/claude-legal-skill
Marketing & consumer behavior — coreyhaines31/marketingskills · zubair-trabzada/ai-marketing-claude · ericosiu/ai-marketing-skills
Product & organizational behavior — phuryn/pm-skills (100+ skills) · mastepanoski/claude-skills (Nielsen heuristics, NIST AI RMF, ISO 42001)
General agent capabilities — lyndonkl/claude (85 skills + 6 orchestrators) · alirezarezvani/claude-skills (220+ skills, ~5,200★) · rohitg00/awesome-claude-code-toolkit · jeremylongshore/claude-code-plugins-plus-skills (1,367 skills) · posit-dev/skills (Posit official)
One of 2026's sharpest pain points: papers failing AIGC detection (Turnitin, GPTZero, CNKI) can be rejected outright. The skills below are the most complete open-source solutions — all MIT, all locally archived (
skills/44-48).
| Suite | Key features | Best for | Local |
|---|---|---|---|
| chinese-de-aigc 🇨🇳 | Original Chinese academic de-AIGC by CoPaper.AI; 17-pattern Chinese-tell library, 5-step loop, per-section strategy, 5-dim scoring. The only GitHub skill dedicated to Chinese academic de-AIGC | CNKI / Wanfang / VIP / Turnitin-Chinese | 48 |
| matsuikentaro1/humanizer_academic | Academic-specific; 23 AI-writing patterns; preserves legitimate academic transitions | Medical, life-science, natural-science papers | 44 |
| stephenturner/skill-deslop | Distinguishes legitimate discipline conventions from AI tells; 5-dimension scoring | Scientific papers, technical blogs | 45 |
| hardikpandya/stop-slop | 3-layer detection + 5-dim scoring; banned phrases, structural clichés, sentence rules | General prose, blogs, reports | 46 |
| conorbronsdon/avoid-ai-writing | Structured audit + rewrite + second-pass audit; auditable, traceable | Workflows needing a paper trail | 47 |
Combos: 🇨🇳 Chinese (CNKI/Wanfang/VIP) → chinese-de-aigc · 🇬🇧 English → humanizer_academic · need an audit trail → avoid-ai-writing · general prose → stop-slop.
Unlike the skills above,
tools/catalogs the software and services an agent (or researcher) actually invokes — structured, license- and maintenance-aware, and wired intomake validate. Source of truth:tools/tools.json; browsable list:tools/CATALOG.md.
335 tools across 6 categories (curated 2026-06):
- Causal-inference / treatment-effect libraries (32) — DoWhy · EconML · CausalML · DoubleML · CausalPy · causallib · grf · CATENets · TMLE family · Mendelian randomization …
- Econometrics / quasi-experimental libraries (170) — panel FE · DiD (incl. modern/staggered) · event study · RDD · IV · synthetic control/SDID · matching & weighting · sensitivity (fixest · did · HonestDiD · rdrobust · synthdid · reghdfe · csdid · sdid · pyfixest · linearmodels …); plus spatial econometrics (spdep · PySAL/spreg · GeoDa), local projections/IRF & (S)VAR (lpirfs · vars · svars), survey weighting/MRP/raking (survey · samplics · balance), and meta-analysis (metafor · meta · netmeta · metan) — across R/Python/Stata/Julia.
- Autonomous research / data-science agents (51) — end-to-end research & data analysis: AI-Scientist · data-to-paper · Agent Laboratory · RD-Agent · AI-Researcher · STORM · PaperQA2 · gpt-researcher · DeepAnalyze · MetaGPT (DI) · Biomni … (
⚠️ includes non-OSI / no-LICENSE repos — confirm terms before use). - MCP servers (48) — stats execution (StatsPAI · stata-mcp · R/Jupyter MCP) + data access (FRED · World Bank · IMF · OECD · Eurostat · Census · BEA · BLS · SEC EDGAR · OpenAlex · Semantic Scholar · PubMed · Zotero · arXiv …).
- Causal discovery / structure learning (25) — causal-learn · Tetrad/py-tetrad · gCastle · CDT · tigramite (PCMCI) · LiNGAM · NOTEARS/DAGMA · pcalg · bnlearn · pgmpy …
- Benchmarks & datasets (9) — causaldata · IHDP/Twins · ACIC competition data · RealCause · JustCause · Tübingen cause-effect pairs · bnlearn network repository …
Full write-up: tools/README.md.
Multi-agent collaboration systems — paper revision, autonomous research, data-science teams
Role separation beats a single agent because the reviewer is independent of the drafter — the same logic as peer review.
Paper revision & writing: copy-edit-master (3 sub-agents, Strunk & White / McCloskey rules) · introduction-writer (strategist → drafter → reviewer → reviser) · CoPaper.AI PaperAgent (Supervisor + 4 sub-agents).
Autonomous research & data science: ruc-datalab/DeepAnalyze · business-science/ai-data-science-team · HKUDS/AI-Researcher (NeurIPS 2025 Spotlight) · wanshuiyin/ARIS · SamuelSchmidgall/AgentLaboratory (84% cost reduction) · SakanaAI/AI-Scientist-v2 · assafelovic/gpt-researcher · pedrohcgs/claude-code-my-workflow (Emory).
Academic data MCP servers — OpenAlex, Semantic Scholar, FRED, World Bank, Zotero, …
xingyulu23/Academix · Eclipse-Cj/paper-distill-mcp · oksure/openalex-research-mcp (240M+ works) · openags/paper-search-mcp (20+ sources) · lzinga/us-gov-open-data-mcp (40+ US gov APIs) · stefanoamorelli/fred-mcp-server (FRED 800K+ series) · llnOrmll/world-bank-data-mcp · 54yyyu/zotero-mcp
Skill aggregation platforms & learning resources
Platforms: VoltAgent/awesome-agent-skills (1,000+) · sickn33/antigravity-awesome-skills (1,340+) · VoltAgent/awesome-openclaw-skills (5,400+) · skills.sh · ClawHub (13,729) · Anthropic official skills.
Learning: Claude Code Skills guide (PDF) · Agent Skills Standard · Causal Inference for the Brave and True · Awesome AI for Economists · Awesome Econ AI Stuff.
The original 52 skill collections / 2,940+ files passed a systematic audit — 52/52 CLEAN, zero FLAGGED: no malicious prompts, viruses, reverse shells, or prompt injection. Every "sensitive" hit verified as one of three legitimate categories: defensive security rules, legitimate academic API calls (arXiv / CrossRef / PubMed / FRED / World Bank / OECD / BLS), or standard Claude Code workflow hooks (all local file ops, zero network IO).
Six-phase, defense-in-depth: automated grep across 13 risk categories → 100% manual review of all 6 hook-bearing skills and their 40+ hook scripts (no Bash(*) wildcards anywhere) → three parallel agent content audits → supplemental integrity checks (hidden Unicode, encoding anomalies, HTML injection, network imports).
Key insight: largest ≠ riskiest. The biggest skills all passed; 17-DAAF actually sets the bar for security-conscious design (14 defensive hooks + 32 deny rules + active credential scanning).
Newer vendored additions are tracked in catalog/provenance.json and docs/SKILL_AUDIT.md — run make audit. Full report: SECURITY-SCAN-REPORT.md.
The narrative changelog has moved to CHANGELOG.md. Recent highlights:
- 2026-05 — Vendored AER-skills (top-5 economics submission stack, 9 skills) with weekly upstream sync; expanded the numeric benchmark to 5 causal-recovery tasks and the eval harness to 17 scenarios / 95 rubric items.
- 2026-04 — Completed the 52/52 security baseline; shipped the four full-pipeline flagships (StatsPAI + explicit Python / Stata / R); launched the original chinese-de-aigc skill.
- Earlier — Grew from 43 collections to a curated map of 119 repos / 23,000+ skills; added bilingual README, academic data MCP servers, and multi-agent systems.
Contributions welcome — see CONTRIBUTING.md and the docs/SKILL_SUBMISSION_GUIDE.md. We especially welcome social-science skills (economics, political science, sociology, psychology, education, public health), new causal-inference implementations, MCP servers for academic/government data, Chinese-friendly skills, and multi-agent case studies. New submissions must declare source, license, and category for the provenance audit.
If AERS helps your work, please cite it (CITATION.cff) and star the repo so more researchers can find it.
AI is an amplifier, not a replacement. It handles the heavy lifting; you keep the core judgment.
|
|
Stanford REAP × CoPaper.AI · An academic–industrial AI toolkit for empirical research
![]() Visit copaper.ai |
![]() WeChat: CoPaper.AI |
20 built-in methodology skills · 20-minute empirical paper · powered by StatsPAI (900+ functions, MIT)
Maintained by CoPaper.AI, incubated at Stanford REAP / SCCEI · AI Assistant for Empirical Research




