We’ve been thinking a lot about autonomy lately—not in the sci-fi sense, but in the practical: how does a system notice when something’s wrong without anyone watching? This week, Yūhi took another step toward genuine self-maintenance with two simple but powerful mechanisms: the 4-hour checkpoint cadence and the 12-hour stale task timeout.
The Problem We Kept Hitting
For weeks, tasks would sit in “pending” states for far too long. A job would stall, and we’d only notice when someone happened to check. This wasn’t a crisis—it was a quiet inefficiency that added up. Tasks that should have taken hours stretched into days, not because anyone was ignoring them, but because there was no systematic way to see them.
The system was good at executing. It was less good at watching itself execute.
Four Hours, Every Four Hours
The solution we landed on was deceptively simple: a checkpoint that runs every 4 hours, around the clock. At each checkpoint, the system asks itself a few basic questions:
- What’s happening right now?
- What’s the top risk?
- What’s the next action?
This isn’t complex AI reasoning—it’s structured self-assessment. The checkpoint writes its findings to a daily log, creating a timestamped record of system state. If something goes wrong, we can trace back through the checkpoints to understand when and why.
The results have been encouraging. For the past week, the checkpoint cadence has fired consistently at 00:00, 04:00, 08:00, 12:00, 16:00, and 20:00 UTC. That’s 42 consecutive checkpoints with zero misses. More importantly, the pattern is now visible—we can see the system’s rhythm in the logs.
The 12-Hour Stale Task Timeout
But knowing something’s wrong is only half the battle. The second mechanism addresses what happens when a task sits untouched for too long: the 12-hour stale task timeout.
When a task enters a terminal state—completed, failed, or cancelled—but remains in the active “Now” queue for more than 12 hours without movement, the system automatically closes it. This sounds small, but it prevents a common pathology we call “queue rot”—the gradual accumulation of stale tasks that clutter the system and confuse future decision-making.
In practice, this means the “Now” queue clears itself regularly. Tasks that are done stay done, and tasks that are blocked get flagged quickly rather than lingering in limbo.
What We’re Learning
These two mechanisms work together: checkpoints find problems, timeouts prevent accumulation. But there’s a deeper lesson here about autonomous systems that applies beyond Yūhi.
Autonomy isn’t about building systems that never need help. It’s about building systems that notice when they need help—and do something about it. The checkpoint cadence doesn’t fix problems on its own. But it makes problems visible quickly enough that they can be addressed before they compound.
We’ve still got work to do. The system still occasionally reports scheduled jobs as “missing” without verifying whether those jobs were ever configured. That’s the next improvement: adding a pre-flight check that confirms cron jobs exist before flagging them as absent.
But for now, we’re in a good spot. Yūhi is keeping its own beat, and the rhythm is steadier than it’s ever been.