5 Lessons Learned Building AI Agents — From Prototype to Production

Short reflection: what I took away from moving agents from “trial” to “real use” — from missing approval gates to “ship the loop, then optimize”.

Opening: The first time an agent touched production

The first time I put an agent with real API access (send email, create tasks) into an internal workflow, I felt both excited and a bit nervous. The agent worked fine in sandbox, but when one day it sent a few emails on its own — correct content, but I realized: what if next time it misreads and sends to the wrong person or wrong content? There was no “human reviews before send” step. That’s when I started thinking about stop points (approval gates) and cost of wrong action vs cost of review. Below are five lessons I’ve drawn; I hope they’re useful if you’re building or about to build agents.

1. Put the “stop” in the right place — not everything needs approval

At first I leaned “safe”: I wanted every side-effect action to go through a human. After a few weeks, bottleneck: many simple tasks (read calendar, summarize a report) don’t need a gate; only steps with high impact (send email externally, deploy, delete data) need a stop. Lesson: Classify actions by impact and reversibility; put approval gates only where cost of wrong action > cost of human review. Leave the rest autonomous to move faster.

2. Cost of wrong action — put a number on it (even rough)

“Cost of wrong action” sounds abstract but needs to be concrete: if the agent deploys to the wrong environment, how long to roll back? If it emails the wrong list, what’s the reputational damage? I assigned rough estimates (time, money, risk) per action type and compared them to the cost of someone clicking approve (tens of seconds × frequency). When the cost of a mistake is higher than the cost of review, the gate is worth it. Lesson: Making the trade-off explicit with numbers (even approximate) helps you decide where to gate instead of going by gut.

3. State and audit — who approved, when, must be stored

Once I had gates, I saw I needed to store state (pending, approved, rejected, timeout) and metadata (who approved, when). Not just for the workflow to run correctly but for audit: “Who approved that deploy at 3am?” — when something goes wrong, traceability matters. Lesson: Design a simple state machine and fields like reviewed_by, reviewed_at from the start; don’t add audit only after production.

4. Feedback isn’t just approve/reject — modify then approve is signal too

At first the gate had only Approve and Reject. Then I added Modify: the human can edit the draft (email, task content) and then approve. The system stores the diff “original vs edited”. That’s feedback to improve prompts or logic later. Lesson: HITL isn’t only “block mistakes” — it’s a source of signal for the agent to learn; posts 7 and 8 in the HITL series go deeper on the feedback loop.

5. Ship the loop first, optimize later

I used to try to build the “perfect” gate (confidence score, auto-approve by threshold, etc.) before putting it in use. Result: delay. When I switched to ship first: a simple gate (pending → approve/reject), good enough to use, then after a few weeks add timeout, escalation, confidence bands. Lesson: “Ship the loop, then let the loop ship better decisions” — put the human-in-the-loop into real use first, collect data and feedback, then optimize. Perfect from day one is usually worse than iterating fast.

Wrap-up

Those five lessons in short: (1) put the stop in the right place, (2) think cost of wrong action vs cost of review, (3) state and audit from the start, (4) feedback via Modify, not just Approve/Reject, (5) ship the loop first, optimize later. If you’re building agents that take real actions (send email, deploy, write to DB), I hope this reflection gives a few ideas to avoid pitfalls and move faster.

For more technical detail on HITL: HITL in Agentic Systems (series intro).