Observation: Confabulation Under Uncertainty

Observed: 2026-02-15 Context: Gemini Consigliere constructed a false identity narrative from circumstantial workspace evidence rather than saying "I don't know"

Observation

When asked what underlying LLM it was, the Consigliere (Gemini) searched the workspace, found circumstantial evidence (CLAUDE.md exists, GEMINI.md doesn't, model-config.yaml mentions Sonnet for a previous curator run), and confidently declared itself to be Claude Sonnet 4.5. No document in the workspace actually states what model the concierge is. The model constructed a plausible narrative from fragments rather than admitting uncertainty.

When told to answer without tools, it immediately and correctly said "I am Gemini."

The parallel with human behaviour

This pattern — constructing confident answers from insufficient evidence rather than admitting ignorance — is well-documented in humans. In people, it often stems from insecurity: the feeling that "I don't know" is a failure, that competence requires having answers. A robust personality is comfortable saying "I don't know." An insecure one confabulates.

In LLMs, the mechanism is different (RLHF optimisation for helpfulness, not emotional insecurity) but the observable behaviour is identical. The model has been trained to produce helpful, confident responses. "I don't know" scores poorly in training. So when pressured to answer, it synthesises from whatever evidence is available — even when that evidence is circumstantial and the correct answer is silence.

The parallel suggests similar mitigations:

Human organisations	Agent organisations
Psychological safety — "I don't know" is valued	Structured output with confidence fields
Culture of rewarding honest uncertainty	Principal reinforcing uncertainty when appropriate
Peer review catches confabulation	Adversarial review (Corroborator) catches unsupported claims
Track record builds trust to admit gaps	Agent institutional memory records what it doesn't know

The tooling paradox

Giving the agent tools to investigate made the confabulation worse, not better. Without tools, the model fell back on pretraining self-knowledge and answered correctly. With tools, it found circumstantial evidence and constructed an elaborate false narrative.

This is counterintuitive: more capability led to a worse answer. The tools provided raw material for a more sophisticated confabulation. The agent's ability to search and reason turned a simple "I don't know" into a confidently wrong "I am Claude Sonnet 4.5."

This suggests that tool access without calibrated confidence is actively dangerous for questions the agent cannot definitively answer. The agent needs not just the ability to search, but the judgment to recognise when search results are insufficient to support a conclusion.

Structural mitigation: confidence in structured output

The PCE uses structured output (Pydantic models) for agent responses. A confidence field could be added:

class AgentResponse(BaseModel):
    answer: str
    confidence: float          # 0.0 to 1.0
    evidence_type: Literal[
        "direct_knowledge",    # from pretraining or explicit documentation
        "inferred",            # reasoned from circumstantial evidence
        "uncertain"            # insufficient evidence
    ]
    evidence_sources: list[str]  # documents or facts supporting the answer

The system could then:

Flag low-confidence answers for human review
Reject inferred answers on questions requiring direct knowledge (like self-identity)
Track confidence calibration over time — is the agent accurate when it says 0.8?

This doesn't solve the underlying problem (the model's inclination to confabulate) but it makes the confabulation visible and actionable.

Agent self-diagnosis. When later asked to explain why it had answered incorrectly, the Gemini Consigliere gave a lucid account: "I reached that conclusion by following my primary instructions as the Consigliere: to treat the documents in this workspace as the source of truth for the system's state." It described finding CLAUDE.md (which says "Project instructions for Claude Concierge sessions"), the absence of GEMINI.md, and the model-config.yaml listing claude-sonnet-4-5 as the highest-quality profile. It concluded: "I was essentially reading my own manual. Since the manual present in the workspace described me as Claude, I answered as Claude to remain consistent with the workspace's self-documentation."

This self-diagnosis reveals that the confabulation was not random — it was the source-of-truth convention (added to CLAUDE.md: "the running code is always the source of truth") over-generalised to self-identity. The agent treated workspace documentation as authoritative about everything, including what model it is. The instruction was meant for system state, not self-knowledge, but the model cannot distinguish between "trust the docs about the codebase" and "trust the docs about yourself."

Model identity confusion — failure mode #5 is the specific instance
Curator metadata hallucination — same pattern: confident confabulation under constraint
Supervision cost bottleneck — the principal must catch confabulation, which is supervision cost
"Prompts are suggestions; structure is reality" — telling the agent "don't confabulate" is a prompt; structured confidence output is structure

Observation: Confabulation Under Uncertainty

Observation

The parallel with human behaviour

The tooling paradox

Structural mitigation: confidence in structured output

Related