We Stopped Letting Background Work Lie
This week we fixed a class of AGX bugs that all felt the same: the system sounding certain about work it could not actually prove.
This week had a weirdly consistent theme across AGX: background work was often doing the right thing, while the product was saying the wrong thing about it.
That is a dangerous category of bug.
If a scheduled task fails, that is straightforward. You debug it. If a runtime crashes, you inspect the logs. But if the system tells you a task is overdue when it is actually healthy, or lets you delete a running job as if it were just an idle row in a table, or exposes a daemon control surface that does not actually control the daemon, the bug is no longer execution. The bug is false confidence.
We spent the last few days fixing exactly that.
The first lie was small, but it was still a lie
One of the best bug reports this week came from a scheduled task that had just run successfully.
The operator clicked Run now, watched the run finish cleanly, reopened the task, and saw this:
OVERDUE
Next run in 8h
Last Apr 18, 12:55 AM
That is not an annoying label problem. That is the UI telling two opposite stories at once.
OVERDUE means “this schedule missed work and probably needs intervention.” Next run in 8h means “nothing is wrong right now.” You should not have to mentally diff those statements to figure out whether the scheduler is healthy.
The root cause was simple and very easy to sympathize with. The old badge logic was collapsing two meanings together:
- the schedule is currently overdue
- the last run happened later than a previous scheduled slot
Those are not the same thing. A manual catch-up run that happens after the previous slot is not evidence that the system is unhealthy right now.
So the fix was not to invent smarter copy. The fix was to narrow the contract.
Instead of asking “was anything late recently?”, the product now asks a much tighter question:
return isScheduledRunOverdue({
state: job.state,
nextScheduledAt: job.nextRunAt,
lastCompletedAt: job.lastRunAt,
});
That is a better shape because it is present tense. Either the next scheduled run should have happened and still has not, or it should not. Historical lateness might still matter later, but it does not get to borrow the OVERDUE badge.
That sounds like a tiny refinement. It is actually a trust repair.
The second lie was worse
Then there was the delete flow.
A running scheduled task could be deleted with the same generic confirmation AGX showed for an idle one. Click Delete, confirm, and the task would disappear from the board immediately. The detail panel would drop back to an empty state. The project overview would stop counting it as currently running.
That is not just bad UX. That is a product pretending it knows how to safely erase live work when it really does not.
The important fix here was not visual. It was contractual.
The backend now rejects that delete path with an explicit conflict:
throw new PromptJobDeleteError(
"Cannot delete a scheduled task while a run is queued or running. Cancel the active run first.",
409,
);
That one status code matters a lot. 200 meant “yes, this operation makes sense.” 409 means “your request conflicts with live state, and the product is refusing to fake its way through it.”
The UI changed to match that boundary. Delete is disabled while work is queued or running. The detail pane keeps its monitoring surface. The system tells you what to do next instead of quietly erasing the only place you had to watch the run.
This is one of those fixes that sounds more dramatic in plain English than in code:
- no, you cannot delete this right now
- yes, that is intentional
- cancel the live run first
That is a much better sentence than a smooth but dishonest delete flow.
The cloud side had the same disease
At first those looked like Local UI bugs. Then the hosted runtime work started landing and it became obvious this was broader than one screen.
In agx-cloud, /api/daemon had drifted into a familiar trap: it was reporting a lifecycle that sounded crisp, but the route was not actually the owner of the thing it was describing. It had process-local state. It had worker counts that did not really correspond to the embedded hosted runtime. It had a POST surface that implied control it did not own.
Again, the problem was not that the system was dead. The problem was that the status plane had become aspirational.
So this week we cut it back to something smaller and more truthful.
The route now reports shared embedded runtime status from one real source of truth, in a shape closer to this:
{
"owner": "embedded-app",
"status": "ready",
"services": {
"queueConsumers": true,
"schedulePoller": true,
"promptJobPoller": true
}
}
And if you try to POST to that route as though it can start and stop the hosted runtime, it now fails closed instead of roleplaying process control:
Embedded hosted runtime starts during server bootstrap.
/api/daemon is status-only and does not control workers.
That is a healthier product posture.
One of the easiest ways to make operators distrust a system is to give them controls that feel real but are only ceremonially connected to reality. This change does the opposite. It makes the surface narrower, which makes it more believable.
One owner is better than two optimistic ones
The same cleanup happened one layer deeper in the schedule runtime.
agx-cloud had ended up with two effective owners for graph schedule polling:
- the shared
ensureScheduleRuntime()path - an older embedded loop inside
instrumentation.ts
That kind of split is how background systems get haunted. Two loops can both look reasonable in isolation. Together they produce ambiguous cadence, ambiguous ownership, and eventually ambiguous blame.
So that got simplified too. The embedded bootstrap stopped freelancing its own graph polling interval and delegated to the shared runtime helper instead. One owner. One cadence contract. One place to reason about overlap protection and tick progression.
That is not the kind of change anyone notices in a screenshot. It is the kind of change you notice three weeks later when a scheduling bug is actually debuggable.
The pattern this week was not “more automation”
I think that is the interesting part.
None of these fixes were about making AGX more aggressive, more automatic, or more clever. They were about making it less willing to imply certainty it had not earned.
That showed up in a few different forms:
- overdue now means “missed right now,” not “something about the past looks suspicious”
- delete now means “allowed only when live work is actually idle”
- daemon status now means “the embedded runtime published this,” not “this route invented a plausible story”
- graph scheduling now has one runtime owner instead of two partially overlapping ones
That is all the same design decision.
If background work is going to matter, its status surfaces need to be held to a higher bar than ordinary UI polish. They need to be boringly literal. They should say less and mean it more.
What this unlocks
The practical outcome is that AGX is becoming easier to trust when it is running without supervision.
That matters because the product is increasingly full of long-lived work:
- scheduled prompts
- graph ticks
- hosted pollers
- runtime services that keep working after you close the tab
Once the product starts doing real background work, honesty becomes infrastructure. Operators are making decisions from these surfaces. They are deciding whether to intervene, whether to wait, whether to retry, whether to trust the scheduler, whether to leave a task running overnight.
If the product sounds confident for the wrong reasons, it creates exactly the kind of system I do not want: one that looks composed until the moment you need to rely on it.
This week’s fixes moved AGX the other direction.
Not by teaching it to say more.
By teaching it to stop bluffing.