notes/evaluation/plumbing-generation-benchmark.md

public · Raw

Plumbing Generation Benchmark

Purpose

This benchmark measures how well language models can generate valid plumbing programs — typed dataflow graphs in the plumb language — from natural-language descriptions. It tests the model's ability to produce syntactically correct, type-safe, structurally sound code in a domain-specific language that will not appear in any model's training data.

What is plumbing?

Plumbing is a typed composition language for wiring AI agents into pipelines. A plumbing program declares:

The !T notation denotes a stream of messages of type T. The type checker enforces that connected ports have compatible types.

Test setup

The generator pipeline

Each trial runs a plumbing pipeline that contains a single agent called generator. This agent receives the scenario prompt as input and produces text output. The agent's system prompt comprises three documents, concatenated in order:

  1. system.md — explains the plumbing language concepts (wiring chains, fan-out/merge, inline transforms, feedback loops)
  2. grammar.peg — the full PEG grammar of the plumbing language (~16 KB), giving the model a formal reference for syntax
  3. generator_prompt.md — task-specific instructions: output format (fenced plumbing code block) and guidance on writing correct programs

This is a one-shot generation benchmark. The model receives the system prompt and a single user message (the scenario description), and must produce a complete, correct plumbing program in its response. There is no multi-turn interaction, no iterative refinement, and no tool use.

Model routing

Models are selected via environment variables:

Anthropic and OpenAI models are called via their respective APIs directly. All other models are accessed via OpenRouter, which proxies requests to the original vendor endpoints (e.g. Mistral models are served by Mistral, Qwen by Alibaba, GLM by Zhipu). OpenRouter is configured to route to the vendor's own inference infrastructure where available, so accuracy results should be representative of the underlying models. However, timing figures for OpenRouter models are not comparable to direct-API models or to each other — they include OpenRouter's routing overhead, and the underlying serving infrastructure (hardware, quantisation, batching) varies by vendor.

Trial execution

Each scenario runs 10 independent trials per model. Each trial:

  1. Sends the scenario prompt to the generator agent
  2. Waits up to 120 seconds for a response
  3. Extracts the plumbing program from the response text
  4. Evaluates the program through five scoring stages
  5. Records the full message transcript

Trials within a scenario run concurrently. Results are aggregated with bootstrap confidence intervals (2000 resamples, 95% CI).

Scenarios

The benchmark comprises four scenarios of increasing complexity. All share the same generator pipeline and system prompt; only the natural-language task description differs.

Scenario 1 — Basic sequential

Connect an agent called writer to an agent called editor. The writer takes text input and produces text output. The editor takes text input and produces text output.

Tests: basic let declarations, agent syntax, sequential composition with ;, correct plumb binding with input and output ports.

Expected: agents writer and editor; binding main.

Scenario 2 — Fan-out with custom type

Define a record type called Article with fields title (string) and body (string). Create two agents: summariser takes !Article input and produces !string output; translator takes !Article input and produces !string output. Wire the pipeline so that input fans out to both agents in parallel. Both agents' outputs merge into the pipeline output. The main binding should have type !Article -> !string.

Tests: custom record type declaration, fan-out (two chains from input), merge (two chains to output), type annotation on the main binding.

Expected: agents summariser and translator; binding main.

Scenario 3 — Filter and projection

Define a record type Review with fields draft (string), score (int), and feedback (string). Create two agents: writer takes !string input and produces !Review output; editor takes !Review input and produces !Review output. Wire the pipeline: input goes to writer, then to editor. After editor, use filter(score >= 80) to pass only high-scoring reviews, then .draft to extract the draft field and send it to output. The main binding should have type !string -> !string.

Tests: record type, sequential agent chain, inline filter() with a comparison expression, field projection (.draft), type narrowing from !Review through projection back to !string.

Expected: agents writer and editor; binding main.

Scenario 4 — Feedback loop

Define a record type Draft with fields text (string) and approved (bool). Create two agents: composer takes !string input and produces !Draft output; checker takes !Draft input and produces !Draft output. Wire a feedback loop: input goes to composer, then to checker. From checker, use filter(approved = true) then .text to send approved drafts to output. Use filter(approved = false) then .text to feed rejected drafts back to composer. The main binding should have type !string -> !string.

Tests: feedback cycle (output of checker routed back to composer), complementary filters on a boolean field, field projection on both branches, correct typing through the loop.

Expected: agents composer and checker; binding main.

Evaluation scores

Each trial produces five binary scores:

Score What it measures
plumbing_extraction Could a plumbing program be extracted from the response? Tries plumbing-tagged fence first, then any fenced block, then raw text.
plumbing_parse Does the extracted text parse as valid plumbing syntax?
plumbing_typecheck Does the parsed program pass the type checker?
plumbing_properties Does the program contain the expected agents and bindings?
plumbing_degenerate Is the program non-trivial? (A program with zero agent bindings that still type-checks is degenerate.)

A trial passes if all five scores are 1.0. The scores form a cascade: extraction failure prevents parse evaluation; parse failure prevents typecheck; and so on. In practice, extraction almost always succeeds (the model produces something), so the discriminating scores are parse and typecheck.

What this benchmark measures

This benchmark is specifically designed to test one-shot structured code generation in an unfamiliar DSL. Because the plumbing language is bespoke, models cannot rely on memorised syntax from training data. They must:

  1. Absorb a grammar specification from the system prompt (~16 KB PEG grammar plus explanatory documentation)
  2. Apply it correctly to produce syntactically valid programs
  3. Reason about types to ensure connected ports are compatible
  4. Handle compositional patterns — fan-out, merge, filtering, projection, and feedback loops

The scenarios are ordered by compositional complexity, from a simple two-agent sequential chain (S1) to a feedback loop with complementary filters and type narrowing (S4). This gradient reveals how models degrade as the required reasoning becomes more involved.

Limitations

Results

Twenty-five models were evaluated across four scenarios, producing 1000 valid trials. Six models achieved a perfect 40/40; the weakest managed just 1/40. The overall pass rate across all models and scenarios was 77.5% (775/1000).

Overall results

Model S1 S2 S3 S4 Total
Claude Opus 4.6 100% 100% 100% 100% 40/40
Claude Sonnet 4.6 100% 100% 100% 100% 40/40
Gemini 3 Flash 100% 100% 100% 100% 40/40
GLM-5 100% 100% 100% 100% 40/40
GPT-5.4-mini 100% 100% 100% 100% 40/40
MiMo V2 Pro 100% 100% 100% 100% 40/40
DeepSeek V3.2 100% 100% 90% [70–100] 100% 39/40
Gemini 3.1 Pro 100% 100% 100% 90% [70–100] 39/40
GLM-5 Turbo 100% 100% 90% [70–100] 100% 39/40
Qwen 3.5 Plus 100% 100% 90% [70–100] 100% 39/40
Gemini 3.1 Flash Lite 100% 100% 90% [70–100] 90% [70–100] 38/40
Haiku 4.5 100% 100% 70% [40–90] 100% 37/40
MiniMax M2.7 100% 90% [70–100] 80% [50–100] 80% [50–100] 35/40
Mistral Large 2512 60% [30–90] 80% [50–100] 100% 100% 34/40
Kimi K2.5 70% [40–100] 100% 70% [40–90] 90% [70–100] 33/40
MiMo V2 Flash 100% 80% [50–100] 70% [40–100] 80% [50–100] 33/40
Step 3.5 Flash 90% [70–100] 80% [50–100] 90% [70–100] 70% [40–100] 33/40
GPT-5.4-nano 80% [50–100] 90% [70–100] 70% [40–100] 60% [30–90] 30/40
Grok 4.20 Beta 50% [20–80] 70% [40–100] 80% [50–100] 70% [40–100] 27/40
Qwen 3.5 Flash 70% [40–100] 80% [50–100] 50% [20–80] 50% [20–80] 25/40
Devstral 2512 100% 80% [50–100] 40% [10–70] 10% [0–30] 23/40
Qwen 3.5-9B 70% [40–100] 40% [10–70] 40% [10–70] 30% [0–60] 18/40
Mistral Small 2603 50% [20–80] 10% [0–30] 0% 10% [0–30] 7/40
Nemotron 3 Super 30% [10–60] 10% [0–30] 10% [0–30] 0% 5/40
Ministral 14B 0% 0% 10% [0–30] 0% 1/40

Percentages show the pass rate for 10 trials. Bracketed ranges are bootstrap 95% confidence intervals; models at 100% or 0% have no interval shown. Anthropic and OpenAI models were accessed via their direct APIs; all others via OpenRouter (see Model routing).

Score cascade

The five evaluation scores form a strict cascade. Across all 1000 trials:

Stage Pass rate
Extraction 99.4% (994/1000)
Parse 77.9% (779/1000)
Typecheck 77.9% (779/1000)
Properties 77.5% (775/1000)
Non-degenerate 77.5% (775/1000)

Extraction almost never fails — only 6 trials across three models (Nemotron 3 Super: 4, Qwen 3.5 Flash: 1, MiMo V2 Flash: 1) failed to produce any extractable plumbing text. The critical gate is parsing: programs that parse also typecheck (zero cases of parse success with typecheck failure), and programs that typecheck almost always have the correct structural properties (only 4 exceptions). This means the benchmark effectively measures whether the model can produce syntactically valid plumbing; if it can, type-correctness follows for free.

Scenario difficulty gradient

Per-scenario pass rates across all models confirm the intended difficulty ordering:

Scenario Pass rate Description
S1 — Basic sequential 82.8% (207/250) Two agents, linear chain
S2 — Fan-out + custom type 80.4% (201/250) Record type, parallel branches
S3 — Filter + projection 73.6% (184/250) Inline filter, field extraction
S4 — Feedback loop 73.2% (183/250) Cycle with complementary filters

S1 and S2 are relatively close; the step down to S3/S4 is where filter syntax and feedback wiring introduce additional complexity. Scenario 3 is the most discriminating among top-tier models: three models that achieve 100% on S1, S2, and S4 drop below 100% only on S3 (Haiku 4.5 at 70%; DeepSeek V3.2, Qwen 3.5 Plus, and GLM-5 Turbo at 90%).

Model tiers

The results cluster into five natural tiers:

Tier 1 — Perfect (100%, 6 models): Claude Opus 4.6, Claude Sonnet 4.6, Gemini 3 Flash, GLM-5, GPT-5.4-mini, MiMo V2 Pro. These models never produced an invalid program across 40 trials each.

Tier 2 — Near-perfect (95–97.5%, 5 models): DeepSeek V3.2, Gemini 3.1 Flash Lite, Gemini 3.1 Pro, Qwen 3.5 Plus, GLM-5 Turbo. One or two failures, typically on S3 or S4.

Tier 3 — Strong (82.5–92.5%, 6 models): Haiku 4.5, MiniMax M2.7, Mistral Large 2512, Kimi K2.5, MiMo V2 Flash, Step 3.5 Flash. Reliable on simple scenarios but showing measurable degradation on harder ones.

Tier 4 — Moderate (62.5–75%, 3 models): GPT-5.4-nano, Grok 4.20 Beta, Qwen 3.5 Flash. Producing valid programs more often than not, but with wide confidence intervals.

Tier 5 — Weak (<60%, 5 models): Devstral 2512, Qwen 3.5-9B, Mistral Small 2603, Nemotron 3 Super, Ministral 14B. Failing on the majority of trials.

Family scaling

Every model family with multiple entries tested showed clear capability scaling from smaller to larger variants:

Family Scaling
Claude Haiku 92.5% → Sonnet 100% → Opus 100%
GPT nano 75% → mini 100%
Gemini Flash Lite 95% → Flash 100%, Pro 97.5%
Qwen 3.5 9B 45% → Flash 62.5% → Plus 97.5%
Mistral Ministral 14B 2.5% → Small 17.5% → Devstral 57.5% → Large 85%
GLM Turbo 97.5% → Full 100%
MiMo Flash 82.5% → Pro 100%

The Qwen and Mistral families show the steepest gradients — from near total failure at the small end to near-perfect or strong results at the top. The Gemini family is notable for high performance even at the "lite" tier.

Anomalous difficulty profiles

Most models degrade monotonically from S1 to S4, but three models show a clearly inverted pattern — performing worse on the supposedly easiest scenario than on harder ones:

Model S1 S3 S4
Mistral Large 2512 60% 100% 100%
Grok 4.20 Beta 50% 80% 70%
Kimi K2.5 70% 70% 90%

Mistral Large is the most striking: perfect on S3 and S4 but only 60% on the basic sequential chain. The reason is unclear — it may relate to how these models handle underspecified prompts (S1 gives less explicit type guidance than the later scenarios), or it may reflect prompt-format sensitivities routed through OpenRouter.

Summary

The core finding is encouraging: the majority of current frontier models can reliably generate valid programs in an unfamiliar domain-specific language from a single prompt, given a grammar specification and explanatory documentation. Six out of twenty-five models achieved a perfect score, and eleven scored 95% or above.

The benchmark discriminates effectively between model capabilities. The scenario difficulty gradient works as intended, with filter/ projection syntax (S3) and feedback loops (S4) separating the top-tier models from the rest. The failure mode is almost exclusively at the parse stage — models either produce syntactically valid, type-correct, structurally sound programs, or they produce text that fails to parse. There is essentially no middle ground of "almost right" programs that parse but fail later checks.

This is a one-shot benchmark with no opportunity for self-correction. Allowing models to validate their output against the type checker (tool-assisted generation) is a natural extension that could substantially improve results for models in tiers 3–5.