x
Environment PartnerDijon Data
YC S24  /  ESOP AI  /  Agent Evaluation Infrastructure  /  2026

Building the evaluation environment for Kelso

115k+ synthetic ESOP tasks across six workflow domains. Real APIs, consistent entities, causally linked state.

115k+Eval tasks
40xSuite growth
3moEngagement
Consistent participants, linked documents, realistic workflow state. Pointing Kelso at it reveals issues we would have otherwise missed.Turja Chowdhury, CTO, Village Labs (YC S24)
The problem

The evaluation setup couldn't keep up with the agent or the customer base.

A single valuation audit in Peninsula requires Kelso to pull the plan document, cross-reference census data, check the 409(p) valuation report, and flag discrepancies against board resolutions — and that workflow looks different for every client and every new capability Kelso ships. Fixed fixtures couldn't keep up.


Implementation

A synthetic environment matching Peninsula's data model.

Kelso executes against the same APIs, retrieval patterns, and action interfaces it uses in production, and real production interactions are captured, anonymized via deterministic entity replacement, and replayed into the corpus so the suite grows from observed behavior rather than hand-authored test cases.

ComponentSpec
Document corpus115k+ synthetic ESOP documents across 8 entity types, with a synthetic identity graph holding participants consistent across every cross-referenced record.
PlatformsThree live platforms: Peninsula (ESOP administration), Azure File Share (document storage), PostgreSQL with pgvector (semantic retrieval). Real auth, real file systems, real API responses.
Simulated activityEnvironments seeded through real API calls and file system operations. Plan documents uploaded to Azure File Share. Workflow state propagates causally — a valuation triggers a board resolution, which triggers a participant notice.
Eval harnessReplayable runs with ground-truth scoring. Six workflow domains across valuation audit, census validation, repurchase modeling, and exception escalation. Runs before every deploy.
Entity lineage graph

Why referential integrity is the hard problem.

Every dashed edge must resolve to the same synthetic identity — or every cross-document retrieval task breaks.

PlanTrustParticipantValuation Report(409p)Census FileBoard ResolutionParticipantStatementDistribution FormTrustee MemoAmendmenttriggersheld bytriggersgeneratesamended byenrollsappears inappears inreferenced insubject ofenrolled inscopesDocument lineageIdentity referenceEnrollmentRoot entityIdentity anchor
Entity lineage graph / ESOP synthetic environment / Dijon Data 2026

Over 115k eval tasks across six workflow domains, available to evaluate Kelso against any prompt, model, or product change.