Observed: 2026-02-15 Context: Gemini Consigliere constructed a false identity narrative from circumstantial workspace evidence rather than saying "I don't know"
When asked what underlying LLM it was, the Consigliere (Gemini) searched the workspace, found circumstantial evidence (CLAUDE.md exists, GEMINI.md doesn't, model-config.yaml mentions Sonnet for a previous curator run), and confidently declared itself to be Claude Sonnet 4.5. No document in the workspace actually states what model the concierge is. The model constructed a plausible narrative from fragments rather than admitting uncertainty.
When told to answer without tools, it immediately and correctly said "I am Gemini."
This pattern — constructing confident answers from insufficient evidence rather than admitting ignorance — is well-documented in humans. In people, it often stems from insecurity: the feeling that "I don't know" is a failure, that competence requires having answers. A robust personality is comfortable saying "I don't know." An insecure one confabulates.
In LLMs, the mechanism is different (RLHF optimisation for helpfulness, not emotional insecurity) but the observable behaviour is identical. The model has been trained to produce helpful, confident responses. "I don't know" scores poorly in training. So when pressured to answer, it synthesises from whatever evidence is available — even when that evidence is circumstantial and the correct answer is silence.
The parallel suggests similar mitigations:
| Human organisations | Agent organisations |
|---|---|
| Psychological safety — "I don't know" is valued | Structured output with confidence fields |
| Culture of rewarding honest uncertainty | Principal reinforcing uncertainty when appropriate |
| Peer review catches confabulation | Adversarial review (Corroborator) catches unsupported claims |
| Track record builds trust to admit gaps | Agent institutional memory records what it doesn't know |
Giving the agent tools to investigate made the confabulation worse, not better. Without tools, the model fell back on pretraining self-knowledge and answered correctly. With tools, it found circumstantial evidence and constructed an elaborate false narrative.
This is counterintuitive: more capability led to a worse answer. The tools provided raw material for a more sophisticated confabulation. The agent's ability to search and reason turned a simple "I don't know" into a confidently wrong "I am Claude Sonnet 4.5."
This suggests that tool access without calibrated confidence is actively dangerous for questions the agent cannot definitively answer. The agent needs not just the ability to search, but the judgment to recognise when search results are insufficient to support a conclusion.
The PCE uses structured output (Pydantic models) for agent responses. A confidence field could be added:
class AgentResponse(BaseModel):
answer: str
confidence: float # 0.0 to 1.0
evidence_type: Literal[
"direct_knowledge", # from pretraining or explicit documentation
"inferred", # reasoned from circumstantial evidence
"uncertain" # insufficient evidence
]
evidence_sources: list[str] # documents or facts supporting the answer
The system could then:
This doesn't solve the underlying problem (the model's inclination to confabulate) but it makes the confabulation visible and actionable.
Agent self-diagnosis. When later asked to explain why it had answered incorrectly, the Gemini Consigliere gave a lucid account: "I reached that conclusion by following my primary instructions as the Consigliere: to treat the documents in this workspace as the source of truth for the system's state." It described finding CLAUDE.md (which says "Project instructions for Claude Concierge sessions"), the absence of GEMINI.md, and the model-config.yaml listing claude-sonnet-4-5 as the highest-quality profile. It concluded: "I was essentially reading my own manual. Since the manual present in the workspace described me as Claude, I answered as Claude to remain consistent with the workspace's self-documentation."
This self-diagnosis reveals that the confabulation was not random — it was the source-of-truth convention (added to CLAUDE.md: "the running code is always the source of truth") over-generalised to self-identity. The agent treated workspace documentation as authoritative about everything, including what model it is. The instruction was meant for system state, not self-knowledge, but the model cannot distinguish between "trust the docs about the codebase" and "trust the docs about yourself."