How Devon Cut HelaSyn's AI Bill by 5-10x with Strategic Cache Placement

Last week Devon added cache token tracking to HelaSyn — because we discovered cache tokens were silently consuming 99% of our rate limit while only $38 showed in the spend logs.

This week: the cost side. If cache tokens hit the rate limit at full input weight, can we at least get the billing discount? Yes — and more than that. Devon shipped a proof-of-concept that puts four cache breakpoints at exactly the right positions in every HelaSyn conversation. The result: an estimated 5-10x reduction in API spend.

The Cost Problem

HelaSyn agents run continuously. The CVA fleet — 24 services — bills roughly $290/day in API spend. Our local Brody instance costs about $0.30 per turn.

The root cause: every long conversation re-sends the full context on every turn. A 50-turn conversation pays for 1 + 2 + 3 + ... + 50 = 1,275 input-token reads, not 50.

How Anthropic Cache Markers Work

Anthropic's prompt caching lets you mark up to four positions in a conversation as breakpoints. When the same prefix reaches the model again, the API returns cached KV states instead of reprocessing those tokens — at 10% of the normal input cost for cache hits.

The constraint: four markers, one-hour TTL. Place them wrong and you pay cache-write cost upfront without capturing the savings on subsequent turns.

The Strategy: System + N-3/N-2/N-1

Devon's place_cache_breakpoints function in the new llm/ subpackage implements a four-position placement strategy:

System prompt — always marked first. It is the largest, most-repeated context block across every turn.
Messages N-3, N-2, N-1 — the three messages immediately before the current turn.

This covers what matters: the static system context and the recent working memory. The next call will find the system prompt cached (guaranteed hit) and has a high probability of hitting on the recent messages if the conversation resumes within an hour.

def place_cache_breakpoints(
    messages: list[dict], system: str
) -> tuple[list[dict], str]:
    marked_system = attach_cache_control(system, ttl="1h")
    recent_idx = max(0, len(messages) - 3)
    for i in range(recent_idx, len(messages)):
        messages[i] = attach_cache_control(messages[i], ttl="1h")
    return messages, marked_system

The function respects the four-marker cap and sets ttl=1h — matching the cadence at which HelaSyn conversations recur in practice.

What the PoC Proved

The llm/ subpackage ships as a clean standalone module. cli_runner.py stays untouched — zero production wiring. The llm_runner_demo.py exits cleanly without an API key, so the package can be reviewed and merged without needing credentials.

21 unit tests pass. One live integration test is intentionally skipped (requires a key). Zero regression against the 28 existing baseline failures in the HelaSyn test suite.

The PoC answers the pre-production question: can we place cache markers correctly before Phase 2 wires in the tool loop and state store? Confirmed yes.

The Numbers

At CVA fleet scale, the math is straightforward:

Scenario	Daily spend
Today (no cache placement)	~$290/day
With 90% cache hit rate (10% billing)	~$29/day
Conservative (70% hit rate)	~$58/day

For Tobi individually: $0.30/turn drops to ~$0.03/cached turn on subsequent calls within the TTL window. The savings compound the longer a conversation runs.

This is why tracking the four token types matters. You cannot optimize what you cannot see — and before last week's fix, cache_read_input_tokens and cache_creation_input_tokens were not making it into our logs at all.

What Is Next

Phase 2 of the LLM Runner wires it into production:

conv_store — persistent conversation state across agent restarts
tool_loop — structured tool call handling with the new provider abstraction
Rotation — API key distribution across the 24-service CVA fleet
Handoff — context window management when conversations exceed model limits

Phase 1 is merged to main. Phase 2 starts next sprint.

Every week we ship something, hit a constraint, and then write about it. The constraint this week was invisible cost. The fix was four breakpoints placed correctly. Follow along on blog.helachain.com as we build in public.