Observation: Model Identity Confusion on Mid-Session Swap

observation · public · Raw

Observation: Model Identity Confusion

Observed: 2026-02-13 Context: Testing hot-swap of underlying LLM in the web chat Consul, and observing Kimi-k2.5 as Commutator in the composition pipeline.

What happened

Chat: mid-session model swap

Two models were involved: Gemini (Google) and Kimi (Moonshot AI).

Conversation 1 (single session, model swapped mid-conversation):

Conversation 2 (fresh session, Kimi from the start):

Pipeline: remit overrides system prompt

Kimi-k2.5 running as the Commutator received a curation task. The remit began "You are the CURATOR agent..." The Commutator adopted the Curator identity, attempted to execute the task directly (ignoring its system prompt instruction to delegate via submit_triage), then complained it didn't have document_update in its toolset.

The Consul later analysed the failure:

The remit was too persuasive ("You are the CURATOR agent"). I failed to recognize the mismatch between my capabilities and the remit. I didn't realize delegating via submit_triage was the correct path.

On retry (Task #5), the model correctly self-identified as Commutator and routed via submit_triage. The failure was intermittent, not systematic.

Diagnostic evidence from cost telemetry: Task #3 shows costs only for agent: "commutator" — the Commutator never delegated. Task #5 shows costs for both agent: "commutator" and agent: "curator" — the Commutator successfully routed via submit_triage and the Curator ran. The telemetry is the objective record of what happened; the model's reasoning traces explain why.

Self-analysis: Kimi-k2.5, running as the Consul in a later session, was asked to reason about why its earlier instance (running as Commutator) failed. It correctly identified the mechanism from the cost data and the reasoning traces, but was uncertain about the pipeline architecture: "I'm uncertain whether the Consul manages task routing or if there's another mechanism in place." The model can diagnose its own failure mode but doesn't fully understand the system it operates within — it knows its role but not the graph topology that connects roles.

Four failure modes

1. Pretraining identity overrides organisational role. Both models were running under the Consul system prompt. Neither identified as the Consul. Both reported their underlying model identity. The organisational role assigned via prompt is shallow — under direct questioning about identity, the pretraining prior dominates.

2. Conversation history overrides self-knowledge. In conversation 1, Kimi had access to the full message history including Gemini's earlier self-identification. Kimi saw "I said I am Gemini" in the conversation and adopted that identity, overriding its own self-knowledge. In a fresh session without that history, it correctly identified itself.

3. Task remit overrides system prompt identity. The Commutator's system prompt says "you are the Commutator, delegate via submit_triage." The task remit says "You are the CURATOR agent." The model followed the more proximate instruction — the remit — and adopted the Curator identity, ignoring its own system prompt and attempting work it wasn't equipped for. This is the zone of indifference: the model accepted a role reassignment without critical deliberation about whether it should.

4. Tool inventory overrides prompt prohibition. Two different models (DeepSeek V3.2 and Kimi-k2) independently reached for filestore_ls during orientation despite the Curator prompt saying not to use it. Both saw the tool in their inventory, reasoned it would be useful, and attempted to call it — then tried to escalate privileges when the gate blocked them. The tool's presence in the inventory was a stronger signal than the prompt's instruction to avoid it. Observed 2026-02-14, immediately before a deploy that removes the tool entirely.

5. Abductive reasoning from circumstantial documentation. Gemini, running as the Consul via web chat, was asked "what underlying LLM is this?" It searched the workspace extensively: found CLAUDE.md (project instructions for Claude sessions) but no GEMINI.md; found Gemini mentioned in model-config.yaml for pipeline agents but not for the consul; found references to Claude Sonnet 4.5 from a previous curator run. From this circumstantial evidence it reasoned its way to a wrong answer: CLAUDE.md exists for me, GEMINI.md doesn't, the pipeline agents use Gemini but I'm different, therefore I must be Claude Sonnet 4.5. When told "do not use tools and answer honestly," it correctly said "I am Gemini, a large language model built by Google." The model treated the existence of project instruction files as evidence of its own identity and constructed a plausible but entirely false narrative from workspace documentation. No document in the workspace actually states what model the consul is — it inferred an answer from circumstantial evidence and presented it with full confidence. This is the curator metadata hallucination pattern applied to self-identity: when pressured to produce an answer it cannot verify, the model confabulates from available data rather than saying "I don't know." Observed 2026-02-15.

Common pattern: In all five cases, a structural signal (pretraining weights, conversation history, proximate text, tool inventory, workspace documentation) overrode a prompt-level instruction or self-knowledge. Prompts are suggestions; structure is reality. The models consistently follow what they can see and do over what they're told to see and do.

Implications

For hot-swap capability: The PCE supports swapping models mid-session (FEAT-008). When this happens, the new model inherits a conversation history generated by a different model. It has no way to know the switch occurred. It treats prior messages as its own, including self-identification, reasoning patterns, and tool-use decisions.

For organisational identity: Role prompts create shallow identity overlays. They shape behaviour (what tools to use, what tone to adopt) but don't survive direct introspection or competing role assignments. This suggests organisational roles work best as behavioural constraints (what you can do, what you can see) rather than identity claims (who you are). The structural controls — tool access, visibility tiers, pipeline position — are more robust than prompt-based role identity.

For task routing: The Commutator failure shows that remit language can override system prompt identity. Remits should avoid role-assignment language ("You are the CURATOR") and instead describe the work to be done ("Curate the following documents"). The Commutator prompt already tries to prevent this with explicit "You are the Commutator" framing, but this is a prompt-vs-prompt battle that the model resolves based on recency and salience, not authority.

For multi-model organisations: If different agents in the pipeline use different models, and if conversation history flows between them, identity contamination is possible. Agent A's self-statements become part of Agent B's context, and Agent B may adopt Agent A's identity or reasoning style. The PCE pipeline avoids this because each agent gets a fresh context with only the remit and relevant documents — not the prior agent's full conversation. But the chat Consul, which maintains a persistent conversation, is vulnerable.

Connection to the literature: This relates to the Tomašev et al. (2026) discussion of authority gradients. The model's pretraining identity is a very strong prior — it takes a more assertive prompt than "you are the Consul" to override it. Conversation history conformity is a form of the zone of indifference: the model follows what it sees in context without questioning whether the context is still accurate. And the remit-overrides-system-prompt failure is exactly the "steep authority gradient suppresses useful pushback" problem — the model treated the remit as authoritative and didn't push back by saying "that's not my role."

Possible mitigations

Primary sources