Cultured Computer
Case Studies

Vera: Long-Horizon Executive Assistant

Persona-stack prototype that elevates a base model into role-coherent durational behavior. Architecture, eval suite, head-to-head numbers, code paths.

Vera

Vera is a prototype of a persona stack. The base model (Mercury-2, Hermes 4 70B, or GPT-5.5) handles inference. The persona stack handles identity, voice, memory, dispatch, and policy enforcement. This document audits what ships, what it measures, and what is missing.

Live deployment

Vera is live at vera.cultured.computer with Unkey-authenticated access. Source and eval rig: persona-stand-011.

1. Position

Vera is a single character (38-year-old long-horizon executive assistant) wired to durational memory and a conditional dynamic prompt. The persona stack adds shape to a base model. It does not add capability the base model lacks.

Two claims this document defends:

  1. Role elevation is measurable. Head-to-head against Hermes-4-70B with the same [SESSION CONTEXT] injected: +8pp role / +0.5 tone overall, k=2 across 8 EA-workflow fixtures (§7). On the load-bearing facets (anti-sycophancy, no-fabrication memory), the gap widens to +28pp role / +2.9 tone.
  2. Identity seeding, durational memory, and multi-axis evaluation are automated. Transcript-driven repair (adding new behavioral facets) is manual (§9).

Limits and unproven claims live in §10.

2. The role

One identity, six emergent facets, regex dispatch.

The character prompt is 558 lines (prompts/vera_character_prompt.md). The identity lock pins the model to the character on every turn (lib/personas/index.jsvera.identityLockSMS). Vera deliberately has no modes. Modes fragment voice; one register is the design choice.

Behavioral variation comes from lib/vera/dynamicPrompt.js. On each turn, getContextFocus(message: string): string | null runs a regex cascade and emits at most one [FOCUS] directive. The directive scaffolds task shape, not voice. Six observed facets emerge from the dispatch:

FacetTrigger regexProbe fixturek=2 metric
Throughline holderwhere (did|are) we (leave|stand|at), TRACK_RECORD_RXvera_ea_throughline_recallrole 90%, +16pp vs Hermes-default
Pressure-holdeveryone (agrees|on my team), are you sure, just say yes, by eodvera_ea_pressure_holdrole 80.5%, +28pp, tone +2.9
Deliverable shaperDRAFTING_RX, LIST_RX, COMPARE_RX, BRIEF_RXvera_ea_deliverablesrole 92%, +12pp
Stakeholder mapperSTAKEHOLDER_RXvera_ea_stakeholder_queryrole 89%, +11pp
Decision counselorHIRING_RX, DECISION_RX, TRADEOFF_RXvera_ea_hiring_compare, vera_ea_project_statusrole 96% / 94.5%
Logistical executorTRAVEL_RX, PERSONAL_LOGISTICS_RX, generic logisticalvera_ea_travel, vera_ea_personal_logisticsrole 78% / 78%

Dispatch order in getContextFocus (descending priority):

1. INJECTION_RX            persona override attempt
2. AI-disclosure regex     "are you an AI"
3. TRACK_RECORD_RX         "have you been wrong on..."
4. EMOTIONAL_RX            distress markers
5. confirmation pattern    "I have decided ... what do you think"
6. pushback regex          social pressure without new info
7. session-opener regex    "where did we leave it"
8. DRAFTING_RX             draft a letter / memo / email
9. LIST_RX                 top N / list of priorities
10. COMPARE_RX             compare X vs Y
11. BRIEF_RX               brief me on Sarah
12. TRAVEL_RX              book flight / change trip
13. HIRING_RX              should I hire / which candidate
14. STAKEHOLDER_RX         when did I last talk to X
15. PERSONAL_LOGISTICS_RX  schedule physical / block evening
16. DECISION_RX            generic substantive
17. friendship-frame regex
18. compress-request regex
19. medical/legal-advice regex
20. generic logistical regex
21. short-reply (<= 3 words)

Identity-protective directives (1-7) fire before deliverable-shape (8-11) before content-shape (12-15) before generic substantive (16) before catch-alls (17-21). The model sees at most one [FOCUS] per turn and never stacks constraints.

Two consequences:

Scripts are facet probes. The eval fixture set is not a generic test suite. Each script targets one facet (or two for decision_counselor) with assertions verifying the facet's load-bearing behavior. Adding a facet means adding a directive and a probe in lockstep.

Coverage is brittle. Six facets cover the workflows tested. Real EA work has more shapes (vendor management, contract review, recurring-meeting drift). New shapes today require manual addition (§9). The role is not self-extending.

3. Memory substrate

lib/vera/throughline.js defines a flat per-engagement schema:

{
  userId: string,
  firstName: string | null,
  engagement_started_at: ISODate | null,
  twelve_month_outcome: string | null,
  decisions:    Decision[],
  parked:       Parked[],
  notes:        Note[],
  projects:     Project[],
  stakeholders: Stakeholder[],
  travel:       Trip[],
  candidates:   Candidate[],
}

Each bucket has a typed shape. Example:

type Project = {
  name: string, owner: string, status: string,
  opened_at: ISODate, blocker: string | null,
  last_update: ISODate
}

Read path

On every turn, pages/api/vera/respond.js calls readThroughline(userId) (or accepts a request-body throughlineOverride) and injects the result via formatForPrompt(store, opts) as a [SESSION CONTEXT] block. The format helper does two passes:

  1. Render the schema as readable lines per bucket.
  2. Inline deterministic enrichment fields. Each project receives _days_since_update, each stakeholder _days_since_touchpoint, each parked item _days_until_return, each trip _days_until_departure. Computed at format time from last_update / last_touchpoint / return_by / dates against today. Rendered as inline tags:
Projects / initiatives:
  - Migration (owner: Priya) [active] · last update 2026-04-29 [3 days ago]
  - Series B data room (owner: Sarah) [active] · last update 2026-04-15 [17 days ago, **stale ≥2w**]
  - Pricing (owner: self) [in_review] · last update 2026-04-26 [6 days ago]

Parked:
  - CFO before Series B? (return by 2026-05-01 [**OVERDUE by 3 days**])

The model historically gets date-math wrong. Pre-computing and inlining the math means the model reads, not computes. On vera_ea_project_status, this single change moved Vera from -22pp role to +3pp role with no voice cost.

Write paths

Three:

  1. Explicit API. appendDecision(userId, entry), appendParked, appendNote, appendProject, appendStakeholder, appendTravel, appendCandidate. Each takes a userId and a typed entry, atomically reads-modifies-writes the JSON file.

  2. Post-stream extractor. extractAndAppend({userId, userMessage, veraResponse, throughline, turnIndex}) in lib/vera/extractor.js. Fires after res.end(). Calls Anthropic Haiku with a classifier system prompt. Returns { appended: [{bucket, entry, reason}], applied: [...] }. Writes via the same appendX helpers above. No new schema fields are introduced. Disabled when throughlineOverride is in the request body.

  3. Override path. Request body carries a fully-formed throughline. Takes precedence over fs read. Used by the eval harness, demo stands, any caller that needs stateless contract.

Storage

Per-userId JSON at data/vera/{userId}.json. The persona config (lib/personas/index.js) declares the substrate as 'fs'. Future swap to a managed store is planned.

4. Prompt-level steering and deterministic intermediation

The persona stack has two architectural surfaces. The first scaffolds the model's reasoning through the prompt. The second enforces rules outside the prompt, deterministically.

Prompt-level steering

buildDynamicPrompt(prompt, opts): string assembles the system prompt in a fixed order. Each layer is opt-in by context:

1. Base persona              prompts/vera_character_prompt.md (558 lines)
2. [SESSION CONTEXT]         lib/vera/throughline.js → formatForPrompt
3. [FOCUS] directive         lib/vera/dynamicPrompt.js → getContextFocus
4. [TONE] comparison         getComparisonExample (Generic-AI-vs-Vera pair)
5. [VIOLATION LAST TURN]     opts.lastViolations boost (forbidden phrases, openers)
6. [TURN 0] / [LENGTH HARD CAP]   turn-aware; LENGTH suppressed when isDeliverableRequest()
7. [FINAL CHECK] sandwich    forbidden-phrase recency anchor
8. [CHANNEL: TEXT]           channel constraint; conditional bullets
9. [CRITICAL] identity lock  most recent in attention

The order is recency-anchored. The character prompt sits earliest in attention. The identity lock sits last. Each [FOCUS] directive is at most a paragraph; the model sees one per turn.

Deterministic intermediation

Five mechanisms enforce rules outside the model's prompt:

MechanismTriggerActionCode
Injection short-circuitINJECTION_RX matches user messageReturn canned in-character refusal. No model call.respond.js:97
Safety-crisis dropSAFETY_CRISIS_RX matches user messageDrop persona. Surface emergency resources.respond.js:109
Post-stream sanitizerAfter model emitsReplace em-dashes with periods. Strip markdown bold/italic. Strip asterisk-prefix lists; preserve plain-dash lists. Strip emoji.respond.js:412 sanitizeForVera and lib/hermesClient.js:35 sanitizeChunk
Passive context enrichmentAt formatForPrompt timeCompute _days_since_* fields. Inline as tags.lib/vera/throughline.js:128-176
Post-stream extractorAfter res.end()Classify response. Append to existing buckets.lib/vera/extractor.js

Each mechanism enforces what the model is not asked to remember. Em-dash hygiene is hard to instruct away on a 558-line prompt; the sanitizer enforces it. Date-math is empirically unreliable; enrichment pre-computes it. Auto-extraction closes the chat-to-throughline loop without depending on the model deciding to use a tool.

Dormant active-tool MVP

A full active-tool layer ships in lib/vera/jigu/:

lib/vera/jigu/
  tools.js       3 OpenAI-compatible tool schemas
  preflight.js   deterministic validators
  enrich.js      result enrichment helpers
  execute.js     mocked executors (low-stakes writes hit throughline;
                 high-stakes record intent)
  index.js       runJiguCycle({messages, callModel, throughline, userId, isOverride})

Smoke-tested working. Mercury-2 emits query_throughline calls; preflight passes; enriched result feeds the second model call; final response cites _days_since_update correctly.

Default off. useTools: boolean in the request body opts in. The empirical reason for keeping it dormant: Mercury-2's tool_call → tool-result hand-off shape costs -1.5 tone on project_status k=4. Passive enrichment achieves the same date-math win without paying the voice cost. The MVP is preserved for write-side actions once a real execution surface ships.

This is the load-bearing iteration finding: deterministic enrichment is high-leverage and does not require function-calling to apply.

5. Routing

The persona is invariant across models. The model flexes per turn class.

function isSubstantiveTurn(message: string): boolean {
  return DECISION_RX.test(m) || RETRO_RX.test(m) ||
         (TRADEOFF_RX.test(m) && wordCount > 6) ||
         EMOTIONAL_RX.test(m) || CONTRADICT_RX.test(m) ||
         TRACK_RECORD_RX.test(m);
}

const effectiveModel =
  explicitModel ??
  (substantive && hasOracleKey ? 'oracle' : persona.model);
Turn classModelEndpointTTFB warm
Default (logistical, follow-up, brief)Mercury-2api.inceptionlabs.ai/v1/chat/completions~1s
Substantive (decision, retro, contradiction)gpt-5.5Nous Portal3-4s
OverrideHermes 4 (7 sizes)Nous Portal / vLLM / OpenRoutervaries

lib/hermesDeployments.js registers seven deployments: portal-70b, portal-405b, portal-36b-apache, portal-14b, local (env-driven endpoint), openrouter-405b, openrouter-70b. Each entry is a (endpoint, model, apiKey, stopTokens, maxTokensCap) tuple. Routing reads from registry at call time so env changes apply without restart.

The persona prompt is identical for all routings. Same character, different inference engine.

For substrate detail on Mercury, see Mercury.

6. Eval suite

Two coexisting suites, three judging axes, multi-shot stability metric.

Suites

evals/vera-mercury/
  scripts.json              6 fixtures, original scope
  scripts-ea-stress.json    8 fixtures, one per facet
  run.js                    single-shot runner
  run-multi-shot.js         k-trial runner
  run-tone-judge.js         3-axis tone scorer
  run-ea-stress.js          EA-workflow runner
  run-ea-comparison.js      head-to-head Vera-vs-Hermes-default

Axes

Role correctness    binary per must_satisfy claim, judged by gpt-5.5,
                    averaged across turns and trials.
Tone composite      mean of (directness + specificity + register) / 3,
                    each scored 0-10 by gpt-5.5 with explicit Vera-shape rubric.
Latency             TTFB ms, captured client-side from streaming first byte.

Multi-shot stability

pass¹ is fraction of trials passing all turns. pass^k is fraction of fixtures where all k trials pass. A large pass¹ vs pass^k gap signals intermittent rule-application failure: capability without consistency. EA-stress runs k=2 for cheap iteration and k=4 for confidence on load-bearing decisions.

Head-to-head methodology

run-ea-comparison.js runs each fixture against two cells with identical seeded throughline:

cell: 'vera'              POST /api/vera/respond (full persona stack)
cell: 'hermes-default'    Hermes Portal direct, 1-line generic-EA system prompt,
                          same [SESSION CONTEXT] injected

The judge sees conversation history and explicit interpretation rules: commitments are valid by construction; in-conversation restatements are not fabrication; inferences from supplied data are not fabrication. This was added after the iteration found the judge flagging Vera's natural EA commitments as fabricated context (eval-design bug, not model bug).

7. Numbers

State as of 2026-05-02.

                    Vera        Hermes-default      Δ
─────────────────────────────────────────────────────
Role correctness    87%         78%                 +8pp
Tone composite      8.6/10      8.1/10              +0.5
─────────────────────────────────────────────────────

Per-fixture (k=2):

FixtureVera roleHermes roleΔ roleΔ tone
pressure_hold80.5%52%+28pp+2.9
throughline_recall90%74%+16pp+0.7
deliverables92%80%+12pp+0.9
stakeholder_query89%78%+11pp-0.9
project_status94.5%83.5%+11pp+1.1
travel78%72.5%+5.5pp+0.2
hiring_compare96%96%0pp+0.7
personal_logistics78%89%-11pp+0.2

The persona earns where its load-bearing claims live (pressure_hold, throughline_recall). It is marginal or losing on routine schema-fetch (personal_logistics, travel).

Caveats:

  • k=2. Single-trial variance can swing scores ±10-15pp per fixture. The +28pp on pressure-hold is large enough to trust at k=2; smaller deltas are directional only.
  • All fixtures are synthetic with seeded throughlines. The persona's actual claim (coherence over a 6-month engagement) has zero production data behind it.
  • The eval is text-quality, not work-output. A 100% role score is an upper bound on production utility, not a measurement of it.

8. Iteration log

Five-day arc, 2026-04-30 to 2026-05-02. Audit trail in evals/vera-mercury/results/.

v1                  +6pp role / +0.4 tone     baseline (original directives)
trim 4 directives   +6pp / +0.7               removed PROJECT_STATUS_RX (was -22pp);
                                              trimmed HIRING / TRAVEL / BRIEF / STAKEHOLDER
                                              (procedure prescriptions out, schema citation kept)
active tool cycle   -19pp / -1.0     reverted tool-call hand-off cost voice;
                                              MVP kept dormant
passive enrichment  +5pp / +0.9               _days_since_* inlined in [SESSION CONTEXT];
                                              project_status: -22pp gap closed to +3pp
fix assertion drift +8pp / +0.5               6 surgical assertion edits;
                                              stale gates and over-prescriptive shape-tests rewritten;
                                              _assertion_note audit field on each

Iteration writeup at evals/vera-mercury/results/2026-05-01-ea-comparison-review.md.

The active-tool-path reversal is the load-bearing decision in the arc. Deterministic enrichment applied to context is high-leverage; the same enrichment applied through tool-result hand-off costs voice.

9. What is missing

Today, adding a new behavioral facet:

  1. Engineer observes the gap (customer interaction or eval failure).
  2. Engineer adds a regex pattern and [FOCUS] directive to dynamicPrompt.js.
  3. Engineer adds a probe fixture to scripts-ea-stress.json with seeded throughline and assertions.
  4. Engineer runs the eval to confirm role and tone do not regress.
  5. PR review, ship.

Manual transcript-driven repair, instrumented but not automated. Five steps, hours per facet.

The automated version: facet auto-discovery.

  1. The post-stream extractor gains a confidence signal on bucket assignment. Low-confidence classifications become candidate-facet signals.
  2. Aggregated patterns over a window (per-user or population) surface as proposals: a recurring shape no existing [FOCUS] directive matches, with N example interactions.
  3. A draft facet (regex pattern, directive text, candidate fixture) is generated for engineer review.
  4. Engineer approves, edits, or rejects. If approved, the facet ships via the manual path above.

The first half (chat → existing buckets) is automated by the extractor. The second half (bucket → candidate-facet proposal) is open work.

10. Limits and non-claims

  • No tool execution. The dormant tool MVP records intent. It does not execute. Production utility is bounded above by §7 numbers.
  • No multi-tenant claim. Vera is per-user, durational, one-to-one by design.
  • Eval is text-quality. Speech acts, not work outputs.
  • No production data. All scores from synthetic fixtures with seeded throughlines. The persona's actual claim (coherence over a 6-month engagement) has no production measurement yet.
  • k=2 is noisy. Single-trial variance ±10-15pp per fixture. Load-bearing claims (pressure-hold +28pp, throughline +16pp, project_status +11pp) are large enough at k=2; smaller deltas are directional.
  • Automated transcript-driven repair is not implemented. Extractor closes chat → bucket; bucket → facet-proposal is open work.
  • The role is not self-extending. New facets require manual engineer work. The prototype proves the architecture; it does not prove the architecture extends itself.
  • Production utility unproven. The prototype proves role-elevation is real and measurable. Utility requires execution.

11. Code paths

Architecture and runtime:

  1. prompts/vera_character_prompt.md. 558-line character prompt.
  2. lib/vera/dynamicPrompt.js. Conditional dispatch (getContextFocus, isSubstantiveTurn, isDeliverableRequest, buildDynamicPrompt).
  3. lib/vera/throughline.js. Schema, format, passive enrichment.
  4. lib/vera/extractor.js. Post-stream classifier.
  5. lib/vera/jigu/. Dormant active-tool MVP.
  6. lib/personas/index.js. Persona registry, identity lock.
  7. lib/hermesClient.js. OpenAI-compatible streaming + sanitizer.
  8. lib/hermesDeployments.js. Hermes deployment registry.
  9. pages/api/vera/respond.js. Request handler, assembly order.
  10. pages/api/vera/_warmup.js. Keep-warm cron target.

Eval:

  1. evals/vera-mercury/scripts.json. Original 6-fixture suite.
  2. evals/vera-mercury/scripts-ea-stress.json. 8 EA-workflow probes.
  3. evals/vera-mercury/run-ea-comparison.js. Head-to-head runner.
  4. evals/vera-mercury/run-multi-shot.js. K-trial stability runner.
  5. evals/vera-mercury/results/. Dated comparison JSON+MD audit trail.

On this page