Project Retrospectives
When index cleanup lies, agents make unsafe decisions
A practical note on debugging Satori's clear/reindex path by treating remote vector state as an explicit lifecycle contract.
Date
Read
3 minOne Satori bug looked simple from the outside: clear an index, then create it again. The remote vector store was reachable, the credentials were valid, and the codebase path was correct. Still, the workflow could end in a confusing state.
The mistake would have been treating it as a connectivity issue. The real problem was lifecycle semantics.
The failure shape
Satori keeps local snapshot state and remote Milvus/Zilliz collection state. A clear operation has to affect both. If dropCollection() returns cleanly, the path is obvious. The hard case is when the remote operation times out.
A timeout does not mean “collection still exists.” It also does not mean “collection is gone.” It means the local process does not know.
That uncertainty matters. If Satori clears local metadata after an indeterminate remote delete, it can report success while the cloud collection may still exist. If a later create/reconcile path sees stale remote state, the user gets behavior that feels random.
The dangerous version is this:
local snapshot removed
remote delete timed out
tool reports clear success
next create sees unknown remote collection state
That is not cleanup. That is hiding uncertainty behind a neat response.
How I narrowed it down
I treated the bug as a state-machine problem:
- What local state exists before clear?
- What remote state is proven after
dropCollection()? - What should happen when the follow-up probe also times out?
- Which operation is allowed to destroy local metadata?
- What should the agent tell the user before retrying?
That framing made the fix smaller. The goal was not to retry forever or hide the backend timeout. The goal was to avoid lying about success.
The fix
The rule became: do not report clear success unless the cloud collection is actually gone.
If delete times out but a follow-up probe proves absence, local cleanup can proceed. If delete and probe remain indeterminate, Satori preserves local state and returns a retryable backend timeout. Force reindex cleanup follows the same principle: do not remove local metadata before remote cleanup is verified.
That changed the behavior from “best effort cleanup” to an explicit lifecycle transition.
The implementation became a verification loop rather than a blind cleanup:
if (!(await vectorDatabase.hasCollection(collectionName))) {
return { collectionName, attempts: 0, verifiedAbsent: true };
}
for (let attempt = 1; attempt <= maxAttempts; attempt += 1) {
await vectorDatabase.dropCollection(collectionName);
const stillExists = await vectorDatabase.hasCollection(collectionName);
if (!stillExists) {
return { collectionName, attempts: attempt, verifiedAbsent: true };
}
await sleep(initialBackoffMs * Math.pow(backoffMultiplier, attempt - 1));
}
throw new RemoteCollectionDeletePendingError(collectionName, maxAttempts);
The important behavior is not the retry count. It is the refusal to convert “delete requested” into “delete proven.”
The tradeoff
This makes the tool more annoying in the short term. A user may see a retryable backend timeout instead of a clean success message. But that annoyance is correct. The system should not destroy the only known-good local state when the remote state is unproven.
For agent workflows, that honesty matters even more. If the tool claims success, the agent will build on that claim.
What I learned
Distributed state bugs punish optimism. A neat local response is not enough when remote state is still unknown.
The better engineering move was to keep the state machine honest: preserve known-good local state, classify uncertainty clearly, and make the next safe action obvious.