Consistent participants, linked documents, realistic workflow state. Pointing Kelso at it reveals issues we would have otherwise missed.Turja Chowdhury, CTO, Village Labs (YC S24)
The evaluation setup couldn't keep up with the agent or the customer base.
A single valuation audit in Peninsula requires Kelso to pull the plan document, cross-reference census data, check the 409(p) valuation report, and flag discrepancies against board resolutions — and that workflow looks different for every client and every new capability Kelso ships. Fixed fixtures couldn't keep up.
A synthetic environment matching Peninsula's data model.
Kelso executes against the same APIs, retrieval patterns, and action interfaces it uses in production, and real production interactions are captured, anonymized via deterministic entity replacement, and replayed into the corpus so the suite grows from observed behavior rather than hand-authored test cases.
| Component | Spec |
|---|---|
| Document corpus | 115k+ synthetic ESOP documents across 8 entity types, with a synthetic identity graph holding participants consistent across every cross-referenced record. |
| Platforms | Three live platforms: Peninsula (ESOP administration), Azure File Share (document storage), PostgreSQL with pgvector (semantic retrieval). Real auth, real file systems, real API responses. |
| Simulated activity | Environments seeded through real API calls and file system operations. Plan documents uploaded to Azure File Share. Workflow state propagates causally — a valuation triggers a board resolution, which triggers a participant notice. |
| Eval harness | Replayable runs with ground-truth scoring. Six workflow domains across valuation audit, census validation, repurchase modeling, and exception escalation. Runs before every deploy. |
Why referential integrity is the hard problem.
Every dashed edge must resolve to the same synthetic identity — or every cross-document retrieval task breaks.



