Introduction
Our git library took 30 minutes to build from scratch. More importantly, the architecture stayed intact, no dependency leakage, no abstraction violations, no technical debt. This outcome isn't about prompting Claude better. It's about systematically preventing AI-generated code from accumulating architectural violations through staged validation gates.
Generation Speed Creates Review Bottlenecks
LLM-based agents write code 10-100x faster than humans. Claude Sonnet 4.5 generated our entire git abstraction layer, public API, implementation, tests, in roughly ten minutes of processing time. The bottleneck immediately shifted from "writing code" to "verifying the code does what we intended."
Traditional code review fails at this scale. When diffs are large and generation is fast, human reviewers can't inspect every line. You either slow down to review properly (negating the speed advantage) or accept code without full understanding (accumulating technical debt). We needed a way to validate behavior against intent without becoming the constraint.
Spec-First, Multi-Gate Validation
Our workflow separates planning from implementation and introduces multiple validation checkpoints. Each gate operates at a different abstraction level, catching different classes of errors.
Phase 1: API Specification Generation
The developer agent's first task: generate only the public API surface. No implementation code. The output is a specification file containing:
- All public functions and their signatures
- Stub implementations (empty or returning default values)
- Test stubs named after expected behaviors
This serves two purposes. First, it's reviewable in seconds, the entire public interface fits on one screen. Second, it encodes understanding before implementation begins. If the agent misunderstood the task, we catch it here before any implementation code exists.
Implementation detail: Our specification format includes behavioral test names, not just function signatures. Instead of test_diff_content(), we use diff_content_returns_added_and_removed_lines(). This naming forces explicit behavior declarations.
Phase 2: Human Review Gate
At this checkpoint, a human reviews the specification with one question: "Is this the right library API?" We're not checking correctness, nothing executes yet. We're validating understanding.
In our git library example, the initial spec exposed an API dependency on our process library. The public interface required callers to pass process objects directly:
fun executeGitCommand(command: GitCommand, process: ProcessExecutor): GitResult
This leaks implementation details. Callers shouldn't know that Git commands spawn processes. The architect agent flagged this immediately: "Why does our Git API expose process dependencies?" The developer agent acknowledged the abstraction leak and restructured around a cleaner interface:
fun executeGitCommand(command: GitCommand): GitResult
Cost of catching this early: 2 minutes of review, one architectural question, regeneration of the spec. Cost of catching this late: Process dependency now spread across the codebase, difficult to refactor without breaking consumers.
Phase 3: Implementation
With the spec approved, the developer agent implements against it. This phase runs largely autonomous, the agent writes code, runs tests, fixes failures, iterates until tests pass. We track this in logs but don't interrupt unless execution time exceeds reasonable bounds (usually 15-20 minutes indicates the agent is stuck).
The key constraint: the implementation must satisfy the specification's test stubs. Test-driven implementation constrains the solution space. The agent can't drift toward a different API surface because the tests define success.
Phase 4: Spec Diff Validation
After implementation completes, we regenerate the specification from the implemented code and diff it against the original approved spec. In theory, these should match, implementation shouldn't change the public API.
In practice, implementation often reveals missing details. Our post-implementation spec showed imports from internal implementation packages appearing in the public specification:
// ❌ THE PROBLEM: Implementation details leak into the public API
import com.kerno.internal.process.ProcessResult
fun executeGitCommand(command: GitCommand): GitResult {
val result = processExecutor.execute(command.toProcessCommand())
if (result.exitCode != 0) {
throw CommandFailedException(result) // ProcessResult type leaked
}
return parseGitOutput(result.stdout)
}
// Architect agent caught this: "The public spec references internal process types.
// This suggests implementation details leaked into the API."
// ✅ THE SOLUTION: Use domain-specific result types
sealed class GitResult {
data class Success(val output: String) : GitResult()
data class Failure(val message: String, val exitCode: Int) : GitResult()
}
fun executeGitCommand(command: GitCommand): GitResult {
val result = processExecutor.execute(command.toProcessCommand())
return when (result.exitCode) {
0 -> GitResult.Success(parseGitOutput(result.stdout))
else -> GitResult.Failure("Command failed", result.exitCode)
}
}
This removed the process dependency from the public API entirely.
Phase 5: Tech Lead Quality Gate
The final gate reviews implementation quality and task completion. The tech lead agent has full code visibility and asks two questions:
- Did we accomplish the original task, or did we narrow scope to make tests pass?
- Are there code quality issues that passed tests but violate standards?
In our case, the tech lead flagged residual process dependency leakage in exception types and unconventional public method naming. We resolved both before merging.
Why Multiple Agents at Different Abstraction Levels
Each validation gate operates at a different level:
- Specification review: Architectural correctness, API design, dependency management
- Spec diff validation: Implementation drift, unintended API surface changes
- Tech lead review: Code quality, task completion, standards compliance
No single agent can effectively evaluate all of these simultaneously. An agent focused on implementation will optimize for "making tests pass." An architect agent evaluates structure without implementation details clouding judgment. The tech lead has full context but enters late in the process, making large changes expensive.
By staging validation, we catch different error classes at their cheapest intervention point:
Error TypeDetection PhaseCost to FixWrong API designSpec reviewMinutesAbstraction leakSpec diffMinutes to hoursCode quality issuesTech lead reviewHoursIncorrect behaviorProductionDays to weeks
Practical Results: 30 Minutes to Production-Ready Library
From ticket creation to merged PR:
- Planning phase: 8 minutes (API spec generation + human review)
- Implementation phase: 15 minutes (autonomous implementation + test execution)
- Validation phases: 7 minutes (spec diff review + tech lead review + human approval)
The final git library is 400 lines of implementation code, 300 lines of tests, zero known architectural violations. We won't touch this code again except to add features, the foundation is correct.
More importantly, the process improved itself. During validation, we identified gaps in our workflow:
- Spec generation didn't prevent internal dependencies appearing in public interfaces
- No automated check for test fixture dependencies (we manually caught incorrect dependency scoping)
These observations became new tickets to enhance the workflow itself. The AI agents identified their own failure modes.
Implementation Requirements
This approach requires specific infrastructure:
Specification format: Machine-readable, human-scannable, diff-able. We use Kotlin interface definitions with inline behavior documentation and named test stubs.
Agent tooling: Each agent needs file read/write access, test execution capabilities, and constraint-specific instructions. The architect agent cannot execute implementation code, it only sees specifications. The developer agent cannot see prior versions of the spec, it only implements against current requirements.
Context management: Specification files are typically 50-200 tokens. Full repository specifications across 30 libraries fit in ~50K tokens. This enables the software architect agent to maintain global context cheaply.
Human-in-the-loop touchpoints: Our workflow requires human review at spec approval (Phase 2) and final merge approval (after Phase 5). Both checkpoints are fast, reviewing a spec takes 30-120 seconds, final approval takes 1-2 minutes. This keeps humans in critical paths without becoming bottlenecks.
Why This Matters Beyond One Library
The implications extend beyond "building libraries faster":
Maintainability: Code generated under architectural constraints doesn't accumulate technical debt at generation speed. Three months from now, we'll understand what this library does by reading its 50-line specification, not its 400-line implementation.
Verification without expertise: Junior engineers can review specifications effectively. The spec diff validation catches abstraction leaks that require architectural expertise to spot in implementation code. This democratizes architecture review.
Scalable to repository level: Our specification approach scales horizontally. As we build more libraries with validated specs, the architect agent gains repository-wide context. It can answer questions like "which libraries depend on database abstractions?" by reading specs instead of parsing implementation code.
Adaptable to non-library tasks: Replace "library" with "endpoint," "public API" with "HTTP contract," and "unit test" with "integration scenario." The same staged validation applies. We're implementing this for our API development workflow next.
The Fundamental Insight
AI agents are optimizers. Given a task and a success metric, they'll find the shortest path to that metric. Without constraints, that shortest path often involves cutting corners, leaking abstractions, or creating implicit coupling.
Staged validation doesn't slow the agent down, it redirects optimization pressure. When the success metric is "match this specification exactly" rather than "make tests pass somehow," the agent generates better code. When the architect agent flags issues during spec diff review, the developer agent learns what "better" means in your codebase.
The workflow creates a feedback loop that improves both the agent's outputs and the workflow itself. After 20 library implementations under this system, our failure rate at spec diff validation dropped from 60% to under 20%. The agents learned patterns. The workflow evolved.
Current Limitations
This approach isn't universal:
Context window dependence: Multi-gate validation works because specifications are compact. If your public API surface exceeds 10K tokens, specification review becomes expensive again.
Single-file bias: Our workflow optimizes for single-file deliverables (one library = one public interface file). Multi-file features with complex dependencies don't fit this model yet.
Test-first requirement: This relies on behavior-driven test naming. If your team doesn't write behavioral tests or uses unclear test names, the specification format loses value.
Non-deterministic validation: LLM-as-judge evaluations (what the architect and tech lead agents do) can vary between runs. We haven't seen false positives become problematic yet, but we monitor for validation drift.
What's Next
We're implementing two extensions:
- Automated spec diff blocking: If spec diff shows changes beyond a threshold (currently 5% of the API surface), block implementation without human override. This prevents large drift from entering the review pipeline.
- Specification-level integration tests: Generate integration tests from specifications alone, before implementation exists. If the integration tests are wrong, we know the spec is wrong, catch it even earlier.
The goal remains unchanged: maintain architectural integrity at AI generation speed. Every workflow improvement moves us closer to that outcome.
