Problem
The problem
Single shot LLM calls are unreliable for multi step tasks. Chain of thought prompting hides failure modes. OnboardAI treats onboarding as an autonomous workflow: an explicit planner decomposes the task, an executor takes actions, and a validator gates every result — with a decision loop for replanning when things go wrong.
System Design
How it's built
Architecture
Planner Executor Validator Loop
01
User Intent
Natural language goal
step
02
Planner
Decomposes into steps
step
03
Executor
Tool-use + actions
step
04
Validator
Checks outputs
step
05
Decision Loop
Retry / replan / accept
step
06
Final Output
Grounded, verified result
step
Tech Stack
What powers it
PythonGemini APIAgent SystemsFastAPIAsync
Challenges
What was hard
- Designing a shared state schema every agent can read and write safely
- Preventing infinite loops in the replan cycle without giving up too early
- Handling tool errors as first-class signals instead of noisy exceptions
- Keeping latency reasonable when three agents are stacked per step
- Making the trace legible to a human reviewer
Design Decisions
Why it's built this way
Explicit roles over free form swarms
Planner, Executor, Validator have distinct contracts. Bounded roles are debuggable; free form swarms are not.
Validator as a hard gate
Nothing is returned to the user until the validator signs off — trading a bit of latency for correctness.
Gemini for planning
Long context and structured output support made it a natural fit for the planner role.
Bounded decision loop
Hard cap on replans + explicit failure state prevent runaway cost and infinite loops.
Lessons Learned
What I'd tell my past self
- Reliability in agent systems comes from constraints, not from model scale.
- A validator step is worth more than a bigger planner model.
- Trace logs are the primary UX of agent systems — invest in them early.