Claude Code as a GitHub-Native Agent: The Issue-to-Merge Development Loop

Most write-ups of "Claude Code in GitHub" are screenshots of an @claude mention turning into a pull request, captioned as if the magic were the code generation. It isn't. The code generation is the easy part. The hard part — the part that decides whether this works on pull request #50 as well as it did on #1 — is where the state lives. Each Claude session starts with a fresh context window, by design. If your workflow depends on the agent "remembering" the last PR, it degrades the moment the session ends. If your workflow externalizes everything durable into GitHub and repo files, a cold agent rehydrates in seconds and the chat being gone doesn't cost you anything.

This post is a technical report on running Claude Code as an autonomous agent across a GitHub-centric software development lifecycle, from issue intake through merge. It's synthesized from a fact-checked deep-research pass (25 claims adversarially verified, 0 refuted; sources overwhelmingly first-party Anthropic docs plus the anthropics/claude-code-action repo). I've kept a hard line between verified mechanics and vendor-framed efficacy claims, and there's a myth-busting section near the end for the figures that get repeated but don't hold up — including the viral "100% of PRs at Anthropic are written by Claude" stat, which did not appear among the verified claims.

The mental model: durable state vs. ephemeral context

Everything else follows from one idea:

GitHub is the system of record. Claude Code is the agent that moves work through it. The agent's conversation is ephemeral; everything that must survive across PRs lives in GitHub or in repo files.

You don't fight the fresh-context-per-session design by keeping one giant session alive forever. You externalize state into durable, reloadable artifacts and let each session reload only the slice it needs.

        DURABLE (survives every session)            EPHEMERAL (wiped each session)
   ┌──────────────────────────────────────┐      ┌──────────────────────────────┐
   │                                       │      │                              │
   │  CLAUDE.md + repo docs                │      │   conversation context       │
   │     → HOW the codebase works          │      │      → working memory for     │
   │     → auto-injected every session     │      │        the current task       │
   │                                       │      │      → gone when the session  │
   │  GitHub Issues                        │      │        ends                   │
   │     → WHAT to do and WHY (the spec)   │      │                              │
   │     → loaded on demand: gh issue view │      └──────────────────────────────┘
   │                                       │                    ▲
   │  PRs / commits                        │                    │
   │     → WHAT was done, linked to issue  │         a fresh session reloads
   │     → walk: git blame → commit → PR   │         durable state and rebuilds
   │                                       │         this from scratch — cheaply
   └──────────────────────────────────────┘

Layer	What it holds	How the agent loads it	Persistence
`CLAUDE.md` + repo docs	How the codebase works (architecture, conventions, commands)	Auto-injected every session	Durable (git)
GitHub Issues	What to do and why (task spec, decision log)	On demand via `gh issue view`	Durable (GitHub)
PRs / commits	What was done, linked to the issue	`git blame` → commit → PR → issue	Durable (git/GitHub)
Conversation context	Working memory for the current task	N/A — wiped each session	Ephemeral

A cold agent on PR #50 should understand your codebase as fast as it did on PR #1, because the understanding lives on disk, not in chat. The rest of this post is essentially the operational consequences of taking that seriously.

The end-to-end loop at a glance

Here's the whole cycle. Every arrow is a documented, verified mechanism, and the following sections walk each one.

  ┌──────────┐      ┌──────────────┐      ┌──────────────────────────────────┐
  │  ISSUE   │ ───► │   TRIGGER    │ ───► │   Explore → Plan → Implement →    │
  │ (the     │      │  @claude /   │      │            Commit                 │
  │  spec)   │      │  assign /    │      │     (the Claude Code agent)       │
  └──────────┘      │  label       │      └────────────────┬─────────────────┘
                    └──────────────┘                       │
                                                           ▼
                                              ┌──────────────────────────┐
                                              │   PR opened (Closes #N)   │
                                              │        via gh CLI         │
                                              └────────────┬─────────────┘
                                                           │
                                                           ▼
                                              ┌──────────────────────────┐
                                              │     AUTOMATED REVIEW      │
                                              │  action / Code Review app │
                                              │  /code-review / security  │
                                              └────────────┬─────────────┘
                                       new commits         │  severity-tagged
                                    (synchronize) re-fires │  inline comments
                                              ▲            ▼
                                              │   ┌──────────────────────┐
                                              └───│  Human approval +     │
                                                  │  CI gates             │
                                                  └──────────┬───────────┘
                                                             ▼
                                                  ┌──────────────────────┐
                                                  │        MERGE          │
                                                  │  (human / CI logic —  │
                                                  │   never the agent)    │
                                                  └──────────────────────┘

The two load-bearing constraints baked into this picture: the agent enters through GitHub primitives (not a chat box), and it cannot merge — the final gate is always a human or CI logic, never agent self-approval. Both are covered below.

Issue intake & specs (confidence: high — first-party)

GitHub Issues are the durable, traceable task context. Claude is brought into an issue or PR three ways, all auto-detected by the v1 claude-code-action:

                          ┌─────────────────────────────┐
   1.  @claude mention ──►│                             │
       (default phrase,   │      claude-code-action     │──► analyzes code,
        configurable      │           (v1)              │    implements, opens PR
        via trigger_phrase│                             │
                          │   auto-detects MODE:        │
   2.  Assignment      ──►│   • interactive             │
       (assignee_trigger) │       (mention/assign/label)│
                          │   • automation              │
   3.  Label "claude"  ──►│       (explicit workflow    │
       (label_trigger)    │        prompt)              │
                          └─────────────────────────────┘

The v1 action auto-detects mode: interactive (responding to a mention, assignment, or label) vs. automation (an explicit prompt you supply in the workflow). That mode auto-detection is what turns "an issue" into "a PR" without a human writing the connecting code.

Why issues, not chat, hold the spec: the issue thread is a recoverable, timestamped record of the task and the reasoning behind it. The agent's recollection of an issue is ephemeral — so the discipline is: write decisions in the issue, and have the agent re-read it via gh issue view N at the start of each session, rather than trusting it to "remember." This is the same externalization principle from the mental model, applied to the what and the why.

Sources: code.claude.com/docs/en/github-actions, github.com/anthropics/claude-code-action, …/docs/usage.md.

Setup & infrastructure model (confidence: high — first-party)

Install path: run /install-github-app from the Claude Code terminal. It installs the Claude GitHub App, which requests read & write on Contents, Issues, and Pull requests, and configures the API key secret.
Where it runs: the action executes entirely on your own GitHub runner; Anthropic API calls go to your chosen provider (Anthropic direct, Amazon Bedrock, Google Vertex, or Foundry). Nothing about your code leaves your runner except the model calls.
Secret name — watch out: claude-code-action uses ANTHROPIC_API_KEY, while the separate anthropics/claude-code-security-review example uses CLAUDE_API_KEY. Copy-paste errors between the two are a common, silent source of "why isn't this running."

   your GitHub repo
        │
        │  /install-github-app  ──► Claude GitHub App
        │                           (Contents R/W, Issues R/W, PRs R/W)
        ▼
   ┌──────────────────────────────────────────────────┐
   │   YOUR GitHub Actions runner                      │
   │                                                   │
   │   claude-code-action ── runs here, on your infra  │
   │        │                                          │
   │        │  ONLY model calls leave the runner ──────┼──► Anthropic API
   │        │                                          │     / Bedrock
   │   your code never leaves ─────────────────────────┤     / Vertex
   │                                                   │     / Foundry
   └──────────────────────────────────────────────────┘

Sources: code.claude.com/docs/en/github-actions, github.com/anthropics/claude-code-action.

Context persistence across sessions/PRs (confidence: medium-high)

This is the part teams most often get wrong, and the part the whole loop depends on. The fresh-context-per-session design means the only understanding the agent has is what it reloads — so you engineer the reload.

CLAUDE.md — the always-loaded brief. Injected into context at the start of every session. Keep it a map, not the territory: architecture, where things live, conventions ("we do X not Y"), build/test/lint commands, and non-obvious gotchas. It costs tokens on every turn, so keep it lean. Bootstrap a first draft with /init, then prune.
Nested CLAUDE.md. Drop one in a subdirectory; it loads only when Claude works in that area. Keeps the root file small while putting deep detail next to the code it describes.
Reference docs. Long-form docs (ARCHITECTURE.md, runbooks) live in the repo and are pointed to from CLAUDE.md, so Claude reads them on demand instead of you re-explaining each PR.
Cheap re-mapping with subagents. For the unavoidable "re-understand the relevant slice" cost, use an exploration subagent: it sweeps files and returns only the conclusion, keeping the main working context clean. The understanding cost becomes localized instead of polluting (and bloating) the session.
Write learnings back. When Claude discovers something non-obvious during a PR, append it to CLAUDE.md. The doc compounds — each PR makes the next one cheaper.

   PR #1                PR #2                 PR #50
     │                    │                      │
     ▼                    ▼                      ▼
 ┌────────┐   append   ┌────────┐   append   ┌────────┐
 │CLAUDE  │──learning─►│CLAUDE  │──learning─►│CLAUDE  │   ... compounds
 │.md     │            │.md (+) │            │.md(++) │
 └────────┘            └────────┘            └────────┘
     │                    │                      │
  fresh agent          fresh agent           fresh agent
  reads the map        reads a better        reads the best
  + explores           map, explores         map yet — cold start
  the slice            less                  ≈ as fast as PR #1

Note: the research pass confirmed the surrounding workflow strongly but flagged that the official CLAUDE.md / cross-PR re-hydration mechanics were under-represented in the fetched sources. Treat this section as docs-plus-known-behavior, and verify exact mechanics against the current Claude Code memory docs for your version before standardizing on them.

Branching, commits & traceability (confidence: high for workflow; medium for commit-format specifics)

The recommended four-phase workflow (verbatim from Anthropic's best-practices doc) is the agent's inner loop on any task:

   ┌──────────┐    ┌──────────┐    ┌────────────┐    ┌──────────┐
   │ EXPLORE  │───►│  PLAN    │───►│ IMPLEMENT  │───►│  COMMIT  │
   │ read the │    │ lay out  │    │ make the   │    │ + open   │
   │ relevant │    │ the      │    │ change     │    │ a PR     │
   │ code/    │    │ approach │    │            │    │ (Closes  │
   │ context  │    │ (plan    │    │            │    │  #N)     │
   │ first    │    │  mode)   │    │            │    │          │
   └──────────┘    └──────────┘    └────────────┘    └──────────┘
        read-only       think         act              record

Skipping Explore and Plan is the most common way to get a confidently-wrong implementation. For non-trivial work, plan mode pays for itself.

gh CLI is the spine. Installing the GitHub CLI lets Claude create issues, open PRs, and read comments as an authenticated client. Without gh, Claude falls back to unauthenticated GitHub API requests, which hit rate limits fast — roughly 60 req/hr unauthenticated vs. ~5,000 req/hr authenticated. On any real repo you will hit the unauthenticated ceiling mid-task.

Commit hygiene (medium confidence — confirm against your version): Claude writes a descriptive commit message derived from the diff and recent history (to match your style). Conventions teams commonly enforce via CLAUDE.md:

Conventional commits (feat:, fix:, chore: …) if the repo uses them.
A Co-Authored-By: trailer attributing Claude, so authorship is transparent in git log and git blame.

The traceability chain — the payoff. This is what the whole externalization discipline buys you:

   git blame <line>          "who/what touched this line?"
        │
        ▼
   commit                    "what change, and its message"
        │
        ▼
   PR  (Closes #N)           "the reviewed unit of work"
        │
        ▼
   Issue                     "the WHY — spec + decision log"

A PR body that says Closes #N links the work back to its spec permanently. Six months later, any human or a fresh agent can walk that chain back to the original reasoning — even though the chat that produced it is long gone. The conversation was always the disposable part; this chain is the part that lasts.

Sources: code.claude.com/docs/en/best-practices, anthropic.com/engineering/claude-code-best-practices, practitioner writeups on Co-Authored-By.

Automated PR review (confidence: high — first-party)

There are four distinct review surfaces. They are genuinely different — different triggers, different infrastructure, different guarantees — and teams pick one or combine them. The single most useful thing to understand here is that they are not interchangeable.

  ┌─────────────────────────────────────────────────────────────────────────┐
  │  1. claude-code-action on pull_request                                    │
  │     fires on opened/synchronize — NO @claude mention needed               │
  │     runs in: your runner          guarantee: whatever prompt you give it  │
  ├─────────────────────────────────────────────────────────────────────────┤
  │  2. Inline comments + "Fix this" links                                    │
  │     buffered + classified after the session (Haiku pass) before posting   │
  │     runs in: your runner          guarantee: noise filtered, fix links    │
  ├─────────────────────────────────────────────────────────────────────────┤
  │  3. /code-review skill (local, adversarial)                               │
  │     fresh subagent sees ONLY the diff + criteria, not your reasoning      │
  │     runs in: your machine         guarantee: independent second opinion   │
  ├─────────────────────────────────────────────────────────────────────────┤
  │  4. Managed Code Review app (research preview; Team/Enterprise)           │
  │     N specialist agents in parallel + a false-positive verification step  │
  │     runs in: Anthropic infra      guarantee: severity-tagged findings     │
  ├─────────────────────────────────────────────────────────────────────────┤
  │  (+) Dedicated security review: anthropics/claude-code-security-review    │
  │      separate action, diff-aware, line-level findings, CLAUDE_API_KEY     │
  └─────────────────────────────────────────────────────────────────────────┘

7.1 — claude-code-action on pull_request. Review every PR automatically on pull_request: types: [opened, synchronize] — no @claude mention needed. synchronize re-fires on each new commit, which is what closes the comment→fix→re-review loop. Run the bundled review skill (prompt: "/code-review:code-review ..." with plugins: code-review@claude-code-plugins) or one of the "Claude Auto Review" recipes from solutions.md.

7.2 — Inline comments + "Fix this" links. The action posts inline comments via mcp__github_inline_comment__create_inline_comment. With confirmed: true they post immediately; omit it and comments are buffered and classified after the session (via a Haiku pass) so real review comments post while test/probe subagent chatter is filtered out (classify_inline_comments, default true). include_fix_links (default true) adds "Fix this" links that open Claude Code pre-loaded with the context to fix that specific issue.

7.3 — The bundled /code-review skill (local, adversarial). Run it before treating work as done. It reviews the current diff for bugs in a fresh subagent context — the reviewer sees only the diff and your criteria, not the reasoning that produced the change. That adversarial separation is the entire point: a reviewer that shares the author's assumptions inherits the author's blind spots.

7.4 — Managed Code Review GitHub App (research preview; Team/Enterprise). An admin enables it org-wide; per-repo behavior is Once after PR creation / After every push / Manual (@claude review works in any mode). It runs multiple specialized agents in parallel on Anthropic infrastructure, each hunting a different issue class, then a verification step filters false positives. Comments are tagged by severity:

   Important   → fix before merge
   Nit         → minor / optional
   Pre-existing → not introduced by this PR (context, not a blocker)

7.5 — Dedicated security review. anthropics/claude-code-security-review is a separate official action: diff-aware (analyzes only changed files), posts findings as line-level PR comments, triggered on pull_request with comment-pr: true and pull-requests: write. Note the CLAUDE_API_KEY secret name (different from the main action). The OWASP-Top-10 "security review" in solutions.md is a recommended prompt recipe you copy, not a hardcoded mode.

Sources: …/docs/solutions.md, code.claude.com/docs/en/code-review, …/usage.md, github.com/anthropics/claude-code-security-review.

Merging & gates (confidence: high — first-party)

Claude never merges by fiat, and cannot self-approve. This is a deliberate security boundary, not a missing feature.

The managed Code Review check run always completes with a neutral conclusion — it never blocks merges through branch protection by itself. A failed run never blocks your PR. Your existing human review and CI gates stay intact and authoritative.
Claude cannot submit formal GitHub PR approvals/reviews (it comments via a bot identity). So "letting Claude merge" always routes through human approval or CI logic, never agent self-approval.
To gate on findings, read the machine-readable severity breakdown (e.g. a bughunter-severity JSON comment with per-severity counts) from the check-run output in your own CI — and decide there.

   PR ready
      │
      ▼
   ┌───────────────────────────────┐
   │  CI parses severity breakdown │
   └───────────────┬───────────────┘
                   │
        Important == 0 ?  ──no──►  ✗ blocked  (agent did not block this —
                   │                            your CI logic did)
                  yes
                   │
            CI green ?     ──no──►  ✗ blocked
                   │
                  yes
                   │
        human approval ?  ──no──►  ✗ blocked  (Claude CANNOT supply this —
                   │                            bot identity, by design)
                  yes
                   │
                   ▼
                 MERGE  ✓

The pattern teams converge on: auto-merge is enabled only when (a) zero Important findings, (b) all CI green, and (c) — for anything non-trivial — a human approval. The exact threshold is left to each team; the docs deliberately don't prescribe it. What the docs do guarantee is that the agent is never the thing that flips the final switch.

Sources: code.claude.com/docs/en/code-review, github.com/anthropics/claude-code-action.

CI, headless & autonomous runs (confidence: high for mechanics; "loops" narrative is qualitative)

Headless mode: claude -p "prompt" runs Claude non-interactively, with no persistent session. It's the documented entry point for CI pipelines, pre-commit hooks, and automated workflows, with text / JSON / streaming-JSON output for programmatic parsing.
Constraining unattended runs: in GitHub Actions, pass CLI flags via claude_args:
- --allowedTools — the permission allowlist of tools that execute without prompting. This is the primary safety mechanism for unattended runs. Example (review-only): --allowedTools "mcp__github_inline_comment__create_inline_comment,Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*)".
- --max-turns — caps agentic turns.
- --model — pins the model (keep model IDs out of long-lived docs; they go stale).

Self-verification loops ("loops, not prompts"). This is the qualitative core of the autonomous-run story, and it's the one verified Anthropic-sourced narrative worth repeating. The idea: don't hand the agent a prompt and grade the output once — give it a check it can run itself, and let it iterate against that check until it passes.

                ┌─────────────────────────────────────────┐
                │                                         │
                ▼                                         │
   ┌──────────────────┐    ┌──────────┐    pass? ──no──► loop again
   │  write / edit     │──►│  run      │──────┐  (catches its own
   │  code             │   │  build    │      │   mistakes, works
   └──────────────────┘   │  tests    │      │   longer unattended)
                          │  lint     │     yes
                          └──────────┘      │
                                            ▼
                              ≈80%-complete solution
                              → human reviews + takes over
                                for final refinement

Anthropic engineers describe setting up autonomous loops where Claude writes code, runs tests, and iterates continuously: give it an unfamiliar problem, let it run, then review the roughly 80%-complete solution and take over for final refinement. Operationally, the move is to configure Claude to run builds, tests, and lints automatically so it catches its own mistakes — especially effective when you ask it to generate tests before writing code. Patterns: pass/fail check loops, a Writer/Tester split, and Stop-hook goal gates.

The "80%" is descriptive narrative, not a measured benchmark. Sources: Anthropic "How Anthropic teams use Claude Code" PDF; code.claude.com/docs/en/best-practices.

Caveats & myth-busting (read before quoting any of this)

The mechanics above are first-party and verified. The efficacy claims floating around this topic are mostly not, and a few popular ones are wrong. Keeping these straight is the difference between a defensible internal proposal and a hype deck.

The viral stat is unverified. "~100% of PRs and 80–90% of review at Anthropic run by Claude Code" did not appear among the 25 verified claims. Treat it as practitioner/hype framing. The verified, Anthropic-sourced version is the qualitative "loops not prompts" / "80%-complete solution" narrative — cite that.
Claude can't self-approve PRs. It comments via a bot identity; formal approval/merge always needs a human or CI. Don't imply autonomous merge.
Version churn. Behaviors reflect claude-code-action v1 GA and late-2025/2026 docs. Model IDs in examples are illustrative and will go stale — never hardcode them in a blog post or a long-lived runbook.
Managed Code Review is a research preview, limited to Team/Enterprise plans; availability and behavior may change.
Secret-name inconsistency: ANTHROPIC_API_KEY (main action) vs. CLAUDE_API_KEY (security-review). Flag this for anyone copying configs between the two.
"Security review" is a prompt recipe, not a hardcoded built-in mode — the action runs whatever prompt you give it.
Vendor-framed sources. The mechanics are authoritative because they're first-party; the value claims are inherently vendor-framed. Hedge accordingly.

A concrete reference workflow teams can adopt

Repo scaffolding

CLAUDE.md at root (architecture map, conventions, build/test/lint commands), plus nested CLAUDE.md in large subdirectories.
Issue templates that capture acceptance criteria — these become Claude's spec.
gh CLI installed and authenticated wherever Claude runs.
/install-github-app to wire up the GitHub App + API secret.
Branch protection: require CI green + ≥1 human approval on protected branches.

The loop, per task

File an issue with clear acceptance criteria. Trigger Claude (@claude, assign, or label claude).
Claude runs Explore → Plan → Implement → Commit, self-verifying (tests/build/lint) as it goes; opens a PR with Closes #N.
Automated review fires on PR open (Code Review app or claude-code-action on pull_request), posting severity-tagged inline comments; the security-review action runs in parallel.
Claude (or a human) addresses comments; new commits re-trigger review via synchronize.
Gate: CI parses the severity breakdown — merge blocked unless Important == 0 and CI green.
Human approves and merges. The issue auto-closes; the blame → commit → PR → issue chain is now permanent.

Illustrative GitHub Actions snippet (model ID intentionally omitted; verify input names against current claude-code-action docs):

name: Claude Auto Review
on:
  pull_request:
    types: [opened, synchronize, reopened, ready_for_review]
permissions:
  contents: read
  pull-requests: write
jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: anthropics/claude-code-action@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
          prompt: "/code-review:code-review Review this PR for correctness, security, and clarity."
          claude_args: >-
            --allowedTools "mcp__github_inline_comment__create_inline_comment,Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*)"
            --max-turns 15

Takeaways

A few things that generalize beyond any specific version of the action:

Externalize state or pay for it every session. The agent's context is wiped each session by design. The teams that win treat GitHub and repo files as the memory, and the conversation as scratch. If your workflow needs the agent to "remember," it's already fragile.
CLAUDE.md is the highest-leverage file in the repo. It's auto-loaded every session and it compounds — write learnings back into it and each PR makes the next one cheaper. Keep it a map, not the territory, because you pay its token cost on every turn.
The four review surfaces are not interchangeable. A local adversarial /code-review, the inline-comment action, the managed parallel-agent app, and the dedicated security-review action solve different problems. Know which one you're actually deploying.
The agent that can't merge is a feature. Claude commenting via a bot identity and being structurally unable to self-approve is the boundary that makes autonomous-ish workflows safe to adopt. The final gate is always human or CI logic.
Separate verified mechanics from vendor narrative. The plumbing is first-party and trustworthy. The "100% of PRs" framing is not in the verified set — use the "loops, not prompts" / "~80%-complete solution" framing instead, and never hardcode a model ID into a doc that has to outlive a release.

The deeper point is that none of the value here comes from the model writing code faster. It comes from putting a capable agent inside a system whose state, traceability, and gates were already designed to survive humans forgetting things — and then refusing to let the agent route around any of it.