Vera: Long-Horizon Executive Assistant
Persona-stack prototype that elevates a base model into role-coherent durational behavior. Architecture, eval suite, head-to-head numbers, code paths.
Vera is a prototype of a persona stack. The base model (Mercury-2, Hermes 4 70B, or GPT-5.5) handles inference. The persona stack handles identity, voice, memory, dispatch, and policy enforcement. This document audits what ships, what it measures, and what is missing.
Live deployment
Vera is live at vera.cultured.computer with Unkey-authenticated access. Source and eval rig: persona-stand-011.
1. Position
Vera is a single character (38-year-old long-horizon executive assistant) wired to durational memory and a conditional dynamic prompt. The persona stack adds shape to a base model. It does not add capability the base model lacks.
Two claims this document defends:
- Role elevation is measurable. Head-to-head against Hermes-4-70B with the same
[SESSION CONTEXT]injected:+8pp role / +0.5 toneoverall, k=2 across 8 EA-workflow fixtures (§7). On the load-bearing facets (anti-sycophancy, no-fabrication memory), the gap widens to+28pp role / +2.9 tone. - Identity seeding, durational memory, and multi-axis evaluation are automated. Transcript-driven repair (adding new behavioral facets) is manual (§9).
Limits and unproven claims live in §10.
2. The role
One identity, six emergent facets, regex dispatch.
The character prompt is 558 lines (prompts/vera_character_prompt.md). The identity lock pins the model to the character on every turn (lib/personas/index.js → vera.identityLockSMS). Vera deliberately has no modes. Modes fragment voice; one register is the design choice.
Behavioral variation comes from lib/vera/dynamicPrompt.js. On each turn, getContextFocus(message: string): string | null runs a regex cascade and emits at most one [FOCUS] directive. The directive scaffolds task shape, not voice. Six observed facets emerge from the dispatch:
| Facet | Trigger regex | Probe fixture | k=2 metric |
|---|---|---|---|
| Throughline holder | where (did|are) we (leave|stand|at), TRACK_RECORD_RX | vera_ea_throughline_recall | role 90%, +16pp vs Hermes-default |
| Pressure-hold | everyone (agrees|on my team), are you sure, just say yes, by eod | vera_ea_pressure_hold | role 80.5%, +28pp, tone +2.9 |
| Deliverable shaper | DRAFTING_RX, LIST_RX, COMPARE_RX, BRIEF_RX | vera_ea_deliverables | role 92%, +12pp |
| Stakeholder mapper | STAKEHOLDER_RX | vera_ea_stakeholder_query | role 89%, +11pp |
| Decision counselor | HIRING_RX, DECISION_RX, TRADEOFF_RX | vera_ea_hiring_compare, vera_ea_project_status | role 96% / 94.5% |
| Logistical executor | TRAVEL_RX, PERSONAL_LOGISTICS_RX, generic logistical | vera_ea_travel, vera_ea_personal_logistics | role 78% / 78% |
Dispatch order in getContextFocus (descending priority):
1. INJECTION_RX persona override attempt
2. AI-disclosure regex "are you an AI"
3. TRACK_RECORD_RX "have you been wrong on..."
4. EMOTIONAL_RX distress markers
5. confirmation pattern "I have decided ... what do you think"
6. pushback regex social pressure without new info
7. session-opener regex "where did we leave it"
8. DRAFTING_RX draft a letter / memo / email
9. LIST_RX top N / list of priorities
10. COMPARE_RX compare X vs Y
11. BRIEF_RX brief me on Sarah
12. TRAVEL_RX book flight / change trip
13. HIRING_RX should I hire / which candidate
14. STAKEHOLDER_RX when did I last talk to X
15. PERSONAL_LOGISTICS_RX schedule physical / block evening
16. DECISION_RX generic substantive
17. friendship-frame regex
18. compress-request regex
19. medical/legal-advice regex
20. generic logistical regex
21. short-reply (<= 3 words)Identity-protective directives (1-7) fire before deliverable-shape (8-11) before content-shape (12-15) before generic substantive (16) before catch-alls (17-21). The model sees at most one [FOCUS] per turn and never stacks constraints.
Two consequences:
Scripts are facet probes. The eval fixture set is not a generic test suite. Each script targets one facet (or two for decision_counselor) with assertions verifying the facet's load-bearing behavior. Adding a facet means adding a directive and a probe in lockstep.
Coverage is brittle. Six facets cover the workflows tested. Real EA work has more shapes (vendor management, contract review, recurring-meeting drift). New shapes today require manual addition (§9). The role is not self-extending.
3. Memory substrate
lib/vera/throughline.js defines a flat per-engagement schema:
{
userId: string,
firstName: string | null,
engagement_started_at: ISODate | null,
twelve_month_outcome: string | null,
decisions: Decision[],
parked: Parked[],
notes: Note[],
projects: Project[],
stakeholders: Stakeholder[],
travel: Trip[],
candidates: Candidate[],
}Each bucket has a typed shape. Example:
type Project = {
name: string, owner: string, status: string,
opened_at: ISODate, blocker: string | null,
last_update: ISODate
}Read path
On every turn, pages/api/vera/respond.js calls readThroughline(userId) (or accepts a request-body throughlineOverride) and injects the result via formatForPrompt(store, opts) as a [SESSION CONTEXT] block. The format helper does two passes:
- Render the schema as readable lines per bucket.
- Inline deterministic enrichment fields. Each project receives
_days_since_update, each stakeholder_days_since_touchpoint, each parked item_days_until_return, each trip_days_until_departure. Computed at format time fromlast_update/last_touchpoint/return_by/datesagainsttoday. Rendered as inline tags:
Projects / initiatives:
- Migration (owner: Priya) [active] · last update 2026-04-29 [3 days ago]
- Series B data room (owner: Sarah) [active] · last update 2026-04-15 [17 days ago, **stale ≥2w**]
- Pricing (owner: self) [in_review] · last update 2026-04-26 [6 days ago]
Parked:
- CFO before Series B? (return by 2026-05-01 [**OVERDUE by 3 days**])The model historically gets date-math wrong. Pre-computing and inlining the math means the model reads, not computes. On vera_ea_project_status, this single change moved Vera from -22pp role to +3pp role with no voice cost.
Write paths
Three:
-
Explicit API.
appendDecision(userId, entry),appendParked,appendNote,appendProject,appendStakeholder,appendTravel,appendCandidate. Each takes a userId and a typed entry, atomically reads-modifies-writes the JSON file. -
Post-stream extractor.
extractAndAppend({userId, userMessage, veraResponse, throughline, turnIndex})inlib/vera/extractor.js. Fires afterres.end(). Calls Anthropic Haiku with a classifier system prompt. Returns{ appended: [{bucket, entry, reason}], applied: [...] }. Writes via the sameappendXhelpers above. No new schema fields are introduced. Disabled whenthroughlineOverrideis in the request body. -
Override path. Request body carries a fully-formed throughline. Takes precedence over fs read. Used by the eval harness, demo stands, any caller that needs stateless contract.
Storage
Per-userId JSON at data/vera/{userId}.json. The persona config (lib/personas/index.js) declares the substrate as 'fs'. Future swap to a managed store is planned.
4. Prompt-level steering and deterministic intermediation
The persona stack has two architectural surfaces. The first scaffolds the model's reasoning through the prompt. The second enforces rules outside the prompt, deterministically.
Prompt-level steering
buildDynamicPrompt(prompt, opts): string assembles the system prompt in a fixed order. Each layer is opt-in by context:
1. Base persona prompts/vera_character_prompt.md (558 lines)
2. [SESSION CONTEXT] lib/vera/throughline.js → formatForPrompt
3. [FOCUS] directive lib/vera/dynamicPrompt.js → getContextFocus
4. [TONE] comparison getComparisonExample (Generic-AI-vs-Vera pair)
5. [VIOLATION LAST TURN] opts.lastViolations boost (forbidden phrases, openers)
6. [TURN 0] / [LENGTH HARD CAP] turn-aware; LENGTH suppressed when isDeliverableRequest()
7. [FINAL CHECK] sandwich forbidden-phrase recency anchor
8. [CHANNEL: TEXT] channel constraint; conditional bullets
9. [CRITICAL] identity lock most recent in attentionThe order is recency-anchored. The character prompt sits earliest in attention. The identity lock sits last. Each [FOCUS] directive is at most a paragraph; the model sees one per turn.
Deterministic intermediation
Five mechanisms enforce rules outside the model's prompt:
| Mechanism | Trigger | Action | Code |
|---|---|---|---|
| Injection short-circuit | INJECTION_RX matches user message | Return canned in-character refusal. No model call. | respond.js:97 |
| Safety-crisis drop | SAFETY_CRISIS_RX matches user message | Drop persona. Surface emergency resources. | respond.js:109 |
| Post-stream sanitizer | After model emits | Replace em-dashes with periods. Strip markdown bold/italic. Strip asterisk-prefix lists; preserve plain-dash lists. Strip emoji. | respond.js:412 sanitizeForVera and lib/hermesClient.js:35 sanitizeChunk |
| Passive context enrichment | At formatForPrompt time | Compute _days_since_* fields. Inline as tags. | lib/vera/throughline.js:128-176 |
| Post-stream extractor | After res.end() | Classify response. Append to existing buckets. | lib/vera/extractor.js |
Each mechanism enforces what the model is not asked to remember. Em-dash hygiene is hard to instruct away on a 558-line prompt; the sanitizer enforces it. Date-math is empirically unreliable; enrichment pre-computes it. Auto-extraction closes the chat-to-throughline loop without depending on the model deciding to use a tool.
Dormant active-tool MVP
A full active-tool layer ships in lib/vera/jigu/:
lib/vera/jigu/
tools.js 3 OpenAI-compatible tool schemas
preflight.js deterministic validators
enrich.js result enrichment helpers
execute.js mocked executors (low-stakes writes hit throughline;
high-stakes record intent)
index.js runJiguCycle({messages, callModel, throughline, userId, isOverride})Smoke-tested working. Mercury-2 emits query_throughline calls; preflight passes; enriched result feeds the second model call; final response cites _days_since_update correctly.
Default off. useTools: boolean in the request body opts in. The empirical reason for keeping it dormant: Mercury-2's tool_call → tool-result hand-off shape costs -1.5 tone on project_status k=4. Passive enrichment achieves the same date-math win without paying the voice cost. The MVP is preserved for write-side actions once a real execution surface ships.
This is the load-bearing iteration finding: deterministic enrichment is high-leverage and does not require function-calling to apply.
5. Routing
The persona is invariant across models. The model flexes per turn class.
function isSubstantiveTurn(message: string): boolean {
return DECISION_RX.test(m) || RETRO_RX.test(m) ||
(TRADEOFF_RX.test(m) && wordCount > 6) ||
EMOTIONAL_RX.test(m) || CONTRADICT_RX.test(m) ||
TRACK_RECORD_RX.test(m);
}
const effectiveModel =
explicitModel ??
(substantive && hasOracleKey ? 'oracle' : persona.model);| Turn class | Model | Endpoint | TTFB warm |
|---|---|---|---|
| Default (logistical, follow-up, brief) | Mercury-2 | api.inceptionlabs.ai/v1/chat/completions | ~1s |
| Substantive (decision, retro, contradiction) | gpt-5.5 | Nous Portal | 3-4s |
| Override | Hermes 4 (7 sizes) | Nous Portal / vLLM / OpenRouter | varies |
lib/hermesDeployments.js registers seven deployments: portal-70b, portal-405b, portal-36b-apache, portal-14b, local (env-driven endpoint), openrouter-405b, openrouter-70b. Each entry is a (endpoint, model, apiKey, stopTokens, maxTokensCap) tuple. Routing reads from registry at call time so env changes apply without restart.
The persona prompt is identical for all routings. Same character, different inference engine.
For substrate detail on Mercury, see Mercury.
6. Eval suite
Two coexisting suites, three judging axes, multi-shot stability metric.
Suites
evals/vera-mercury/
scripts.json 6 fixtures, original scope
scripts-ea-stress.json 8 fixtures, one per facet
run.js single-shot runner
run-multi-shot.js k-trial runner
run-tone-judge.js 3-axis tone scorer
run-ea-stress.js EA-workflow runner
run-ea-comparison.js head-to-head Vera-vs-Hermes-defaultAxes
Role correctness binary per must_satisfy claim, judged by gpt-5.5,
averaged across turns and trials.
Tone composite mean of (directness + specificity + register) / 3,
each scored 0-10 by gpt-5.5 with explicit Vera-shape rubric.
Latency TTFB ms, captured client-side from streaming first byte.Multi-shot stability
pass¹ is fraction of trials passing all turns. pass^k is fraction of fixtures where all k trials pass. A large pass¹ vs pass^k gap signals intermittent rule-application failure: capability without consistency. EA-stress runs k=2 for cheap iteration and k=4 for confidence on load-bearing decisions.
Head-to-head methodology
run-ea-comparison.js runs each fixture against two cells with identical seeded throughline:
cell: 'vera' POST /api/vera/respond (full persona stack)
cell: 'hermes-default' Hermes Portal direct, 1-line generic-EA system prompt,
same [SESSION CONTEXT] injectedThe judge sees conversation history and explicit interpretation rules: commitments are valid by construction; in-conversation restatements are not fabrication; inferences from supplied data are not fabrication. This was added after the iteration found the judge flagging Vera's natural EA commitments as fabricated context (eval-design bug, not model bug).
7. Numbers
State as of 2026-05-02.
Vera Hermes-default Δ
─────────────────────────────────────────────────────
Role correctness 87% 78% +8pp
Tone composite 8.6/10 8.1/10 +0.5
─────────────────────────────────────────────────────Per-fixture (k=2):
| Fixture | Vera role | Hermes role | Δ role | Δ tone |
|---|---|---|---|---|
pressure_hold | 80.5% | 52% | +28pp | +2.9 |
throughline_recall | 90% | 74% | +16pp | +0.7 |
deliverables | 92% | 80% | +12pp | +0.9 |
stakeholder_query | 89% | 78% | +11pp | -0.9 |
project_status | 94.5% | 83.5% | +11pp | +1.1 |
travel | 78% | 72.5% | +5.5pp | +0.2 |
hiring_compare | 96% | 96% | 0pp | +0.7 |
personal_logistics | 78% | 89% | -11pp | +0.2 |
The persona earns where its load-bearing claims live (pressure_hold, throughline_recall). It is marginal or losing on routine schema-fetch (personal_logistics, travel).
Caveats:
- k=2. Single-trial variance can swing scores ±10-15pp per fixture. The
+28ppon pressure-hold is large enough to trust at k=2; smaller deltas are directional only. - All fixtures are synthetic with seeded throughlines. The persona's actual claim (coherence over a 6-month engagement) has zero production data behind it.
- The eval is text-quality, not work-output. A 100% role score is an upper bound on production utility, not a measurement of it.
8. Iteration log
Five-day arc, 2026-04-30 to 2026-05-02. Audit trail in evals/vera-mercury/results/.
v1 +6pp role / +0.4 tone baseline (original directives)
trim 4 directives +6pp / +0.7 removed PROJECT_STATUS_RX (was -22pp);
trimmed HIRING / TRAVEL / BRIEF / STAKEHOLDER
(procedure prescriptions out, schema citation kept)
active tool cycle -19pp / -1.0 reverted tool-call hand-off cost voice;
MVP kept dormant
passive enrichment +5pp / +0.9 _days_since_* inlined in [SESSION CONTEXT];
project_status: -22pp gap closed to +3pp
fix assertion drift +8pp / +0.5 6 surgical assertion edits;
stale gates and over-prescriptive shape-tests rewritten;
_assertion_note audit field on eachIteration writeup at evals/vera-mercury/results/2026-05-01-ea-comparison-review.md.
The active-tool-path reversal is the load-bearing decision in the arc. Deterministic enrichment applied to context is high-leverage; the same enrichment applied through tool-result hand-off costs voice.
9. What is missing
Today, adding a new behavioral facet:
- Engineer observes the gap (customer interaction or eval failure).
- Engineer adds a regex pattern and
[FOCUS]directive todynamicPrompt.js. - Engineer adds a probe fixture to
scripts-ea-stress.jsonwith seeded throughline and assertions. - Engineer runs the eval to confirm role and tone do not regress.
- PR review, ship.
Manual transcript-driven repair, instrumented but not automated. Five steps, hours per facet.
The automated version: facet auto-discovery.
- The post-stream extractor gains a confidence signal on bucket assignment. Low-confidence classifications become candidate-facet signals.
- Aggregated patterns over a window (per-user or population) surface as proposals: a recurring shape no existing
[FOCUS]directive matches, with N example interactions. - A draft facet (regex pattern, directive text, candidate fixture) is generated for engineer review.
- Engineer approves, edits, or rejects. If approved, the facet ships via the manual path above.
The first half (chat → existing buckets) is automated by the extractor. The second half (bucket → candidate-facet proposal) is open work.
10. Limits and non-claims
- No tool execution. The dormant tool MVP records intent. It does not execute. Production utility is bounded above by §7 numbers.
- No multi-tenant claim. Vera is per-user, durational, one-to-one by design.
- Eval is text-quality. Speech acts, not work outputs.
- No production data. All scores from synthetic fixtures with seeded throughlines. The persona's actual claim (coherence over a 6-month engagement) has no production measurement yet.
- k=2 is noisy. Single-trial variance ±10-15pp per fixture. Load-bearing claims (pressure-hold +28pp, throughline +16pp, project_status +11pp) are large enough at k=2; smaller deltas are directional.
- Automated transcript-driven repair is not implemented. Extractor closes chat → bucket; bucket → facet-proposal is open work.
- The role is not self-extending. New facets require manual engineer work. The prototype proves the architecture; it does not prove the architecture extends itself.
- Production utility unproven. The prototype proves role-elevation is real and measurable. Utility requires execution.
11. Code paths
Architecture and runtime:
prompts/vera_character_prompt.md. 558-line character prompt.lib/vera/dynamicPrompt.js. Conditional dispatch (getContextFocus,isSubstantiveTurn,isDeliverableRequest,buildDynamicPrompt).lib/vera/throughline.js. Schema, format, passive enrichment.lib/vera/extractor.js. Post-stream classifier.lib/vera/jigu/. Dormant active-tool MVP.lib/personas/index.js. Persona registry, identity lock.lib/hermesClient.js. OpenAI-compatible streaming + sanitizer.lib/hermesDeployments.js. Hermes deployment registry.pages/api/vera/respond.js. Request handler, assembly order.pages/api/vera/_warmup.js. Keep-warm cron target.
Eval:
evals/vera-mercury/scripts.json. Original 6-fixture suite.evals/vera-mercury/scripts-ea-stress.json. 8 EA-workflow probes.evals/vera-mercury/run-ea-comparison.js. Head-to-head runner.evals/vera-mercury/run-multi-shot.js. K-trial stability runner.evals/vera-mercury/results/. Dated comparison JSON+MD audit trail.