Blog
Engineering

Token accounting and cancellation in durable agent workflows

Closing an LLM stream is not the same as stopping the work behind it. Durable agents need cancellation paths that settle application state and credit accounting before the UI declares victory.

Meir Zana · Founder

July 2, 2026 · 9 min read

Share

Stopping a local coding agent and stopping a remote agent look like the same product gesture. They are not the same engineering problem.

In a local harness, like Codex or Claude Code, Stop can mean killing the agent loop that is currently driving the model. In a hosted agent setting, the browser stream is only a client connection receiving updates. Killing that stream does not necessarily stop the workflow, the model call, the database writes, or the credit meter behind it.

Once a remote workflow can both modify application state and consume billable tokens, Stop becomes a backend protocol. The system has to stop future work, preserve committed writes, update the UI to match persisted state, and finalize token usage into credits. Getting it right is a task that spans the frontend, the API server, the durable workflow engine (Temporal), and the agent worker running the workflow and its activities.

This post reviews how our implementation moved from treating the SSE connection as the main control surface to an orderly cancellation process for the workflow behind it.

Disconnect is not cancel

If the agent loop runs inside the API server request context, cancellation is easier to handle. When the browser closes the connection, the server can observe the request abort, propagate an AbortSignal into the model call, and run whatever cleanup the handler owns.

That pattern stops being enough when the work is meant to outlive the request, or when it is carried out in a separate worker.

Long-running agent workflows usually run in a durable workflow engine, such as Temporal, because the client connection is not a reliable lifetime boundary. The workflow can retry steps after transient failures, resume after a worker restart, and continue after a user refreshes the browser. That is what we needed for long research operations.

It also means a browser disconnect is not a cancellation signal.

In Agent Bayes, the research agent runs as a Temporal workflow on a separate worker pool. The browser receives progress over Server-Sent Events, but that stream is only an observer of the workflow. Refreshing the page or navigating away aborts the local SSE request, but the workflow keeps running. The client can later load persisted messages and reconnect to the running conversation stream.

This also lets users run agents across several mindmaps without keeping several browser streams alive until each one finishes. The workflow is the durable unit of work. The SSE connection is only the live display for whichever operation the user is watching right now.

That separation matters in ordinary failure cases too. A flaky Wi-Fi connection should not kill three minutes of paid work, because losing Wi-Fi should only cost patience, not credits.

So we have two actions that look similar in the browser and mean different things on the backend:

  • Disconnect aborts the local SSE request and leaves the Temporal workflow alone.
  • Cancel calls the cancel endpoint for the current session, asks Temporal to cancel that workflow, and reflects any state changes that occurred before cancellation completes.

The rest of the implementation follows from that split: request cancellation through the API, let Temporal deliver it to the workflow execution, let a running activity observe it through heartbeats when needed, then finalize state and accounting from the workflow.

Cancellation has to settle state and accounting

Once an agent can change application state, cancellation has to be graceful for two reasons.

First, the frontend state has to reflect what actually happened. If the user clicks Stop while the agent is writing nodes into a mindmap, any edits committed before cancellation completes still need to appear in the UI. Otherwise, the visible map diverges from the persisted one. The cancellation path also has to converge on a durable answer status, in our case CANCELLED, so replay, refresh, and audit history all agree.

Second, and perhaps more importantly, usage accounting has to finalize. If a workflow stops before accounting closes, provider usage and the product credit ledger can fall out of sync. The provider may still charge for tokens already generated, while the product never records the corresponding credit spend. Over time, that becomes a serious leak.

Most products that sell credits have an exchange rate between provider usage tokens and product credits. In our case, we want the user's credit history to show one meaningful row for the workflow, such as mindmap, not ten rows that expose internal multi-agent steps, tool calls, and retries. Hopefully nobody opens a credit ledger hoping to reverse engineer your orchestration graph.

During a workflow, each model response records actual token usage under a shared operation_id. The credit ledger reserves capacity at the start, then finalizes once at the end by aggregating the operation totals. The user sees one spend entry for the workflow, while the system still keeps per-step usage for internal accounting.

Cancellation is the case where those two ledgers are most likely to diverge. The workflow has to stop future work, preserve committed state, aggregate whatever usage has landed, charge only the actual credits consumed, and release the unused reserve.

Temporal cancellation is a request, not a terminal state

In our API, the cancel endpoint finds the running workflow for the mindmap session and calls handle.cancel() on its workflow ID. That call requests cancellation from Temporal, but cancellation is cooperative. The workflow has to observe the request and stop itself. Temporal does not interrupt arbitrary code running inside a workflow or activity.

The workflow code handles cancellation in two places. If cancellation is observed while the workflow is not inside an activity, the workflow catches asyncio.CancelledError, marks the run as cancelled, and re-raises. If cancellation is observed while the workflow is waiting on an activity, the workflow catches an ActivityError whose cause is Temporal CancelledError, then marks the run as cancelled.

Both cases rely on the same finally block. The workflow always calls finalize_workflow. That finalization activity records the token summary, writes the cancelled answer state, and finalizes credits in an idempotent way.

The SSE connection is not part of the cancellation mechanism. It keeps reading persisted agent messages and the answer row. When the answer status becomes CANCELLED, it emits an answer_completed SSE event and closes.

Temporal cancellation depends on activity heartbeats

Temporal cannot interrupt arbitrary code running inside a remote activity. The activity has to cooperate. For non-local activities, the Python SDK requires a heartbeat_timeout and calls to activity.heartbeat() so the worker can receive a cancellation request.

The heartbeat point needs to be inside the work that can run for a long time. In our case, that is the innermost loop over LangGraph stream events, not the workflow loop around whole agent roles.

A Researcher role may stream tokens for a while, call a retrieval tool, and then resume streaming. If the activity heartbeats only after the role completes, cancellation cannot be observed until that role has already finished.

We pass a callback into the stream loop and invoke it after each event has been processed:

async for stream_mode, data in agent.astream(
    {"messages": messages},
    **kwargs,
):
    # Convert and persist token, tool-call, and tool-result events here.

    if on_event:
        await on_event({"stream_mode": stream_mode})

The activity supplies the callback:

_last_heartbeat = time.monotonic()

async def _on_event(event: dict) -> None:
    nonlocal _last_heartbeat
    now = time.monotonic()
    if activity.is_cancelled() or (now - _last_heartbeat) >= 5.0:
        activity.heartbeat(f"...")
        _last_heartbeat = now

The stream loop calls on_event after it processes each event. That ordering gives token usage events and tool completion events a chance to be yielded and persisted before the heartbeat delivers the cancellation request that unwinds the activity.

This is still cooperative cancellation. If the activity is inside one long awaited tool call, the callback cannot run until control returns. A Temporal Python SDK issue about activity cancellation describes the same edge: cancellation is delivered through heartbeats, and Python async cancellation still needs an await point.

The workflow waits before finalizing accounting

Temporal's ActivityCancellationType decides what the workflow should wait for after requesting activity cancellation.

TRY_CANCEL sends cancellation to the activity and lets the workflow move on promptly. That shortens the workflow's cancellation path, but it does not guarantee that the activity has finished unwinding.

If the workflow moves to finalization while the activity is still unwinding, finalization can read an incomplete operation total. The last model response may have generated tokens, but the usage event may not have reached the database yet. Now the credit ledger finalizes too early.

For an agent role activity, we use WAIT_CANCELLATION_COMPLETED:

result = await workflow.execute_activity(
    MindmapAgentActivities.execute_role,
    args=[...],
    start_to_close_timeout=timedelta(...),
    heartbeat_timeout=timedelta(...),
    retry_policy=RetryPolicy(...),
    cancellation_type=(
        ActivityCancellationType.WAIT_CANCELLATION_COMPLETED
    ),
)

This ensures finalization does not run until the activity has actually closed.

The implementation separates those concerns: cancellation latency (=heartbeat rate) is handled inside the activity, and accounting order is handled by the workflow.

Finalization commits usage and terminal state

At the start of a run we reserve credits. Each model response records actual token usage under the operation ID. At the end, finalization charges the recorded usage and releases the unused reserve.

Finalization is not cleanup after the main work. It is the step that commits the terminal state of the operation. It must run after success, cancellation, resource deletion (think of the case where the agent is working on a task and the user deletes the mindmap, because PhDs can be dramatic sometimes), and failure, and it must be safe to retry.

The workflow carries terminal state into a single finally block:

finally:
    await workflow.execute_activity(
        MindmapAgentActivities.finalize_workflow,
        args=[MindmapAgentFinalizeInput(
            ...
        )],
        start_to_close_timeout=timedelta(...),
        retry_policy=RetryPolicy(
            maximum_attempts=...,
            backoff_coefficient=...,
        ),
        cancellation_type=(
            ActivityCancellationType.WAIT_CANCELLATION_COMPLETED
        ),
    )

finalize_workflow records a token summary, writes the terminal answer and conversation statuses, and calls credits_manager.finalize(operation_id).

Credit finalization is idempotent. It takes an advisory lock on the operation ID, checks whether usage has already been recorded, aggregates all usage events for that operation, inserts one or more credit usage rows, and expires the reserve. If credit finalization fails, the activity raises and Temporal retries it.

Cancellation is complete after finalization

The terminal state should be driven by the durable record written after finalization. By then, committed application state has been preserved, the answer has a final status, token usage has been aggregated, and the credit reserve has either been charged or released.

If you made it this far, congratulations. Your reading session can now be finalized, with zero credits charged.

New posts, straight to your inbox

No newsletter fluff, just an email when we publish something new.

Email me when a new post is published on the Agent Bayes blog. You can unsubscribe anytime. We'll first send a confirmation email, and we only use your details for this. See our Privacy Policy.

Enjoyed this? Share it.

Written by

Meir Zana · Founder

We are researchers and engineers building tools that help people reason over large bodies of literature without losing the thread back to the source.