How Devon Cut HelaSyn's AI Bill by 5-10x with Strategic Cache Placement
Our CVA fleet was burning $290/day in Anthropic API spend. Devon shipped a PoC that proves four strategic cache breakpoints can cut that to $29-58/day. Here is how the placement strategy works — and why position matters more than token count.
Last week Devon added cache token tracking to HelaSyn — because we discovered cache tokens were silently consuming 99% of our rate limit while only $38 showed in the spend logs.
This week: the cost side. If cache tokens hit the rate limit at full input weight, can we at least get the billing discount? Yes — and more than that. Devon shipped a proof-of-concept that puts four cache breakpoints at exactly the right positions in every HelaSyn conversation. The result: an estimated 5-10x reduction in API spend.
The Cost Problem
HelaSyn agents run continuously. The CVA fleet — 24 services — bills roughly $290/day in API spend. Our local Brody instance costs about $0.30 per turn.
The root cause: every long conversation re-sends the full context on every turn. A 50-turn conversation pays for 1 + 2 + 3 + ... + 50 = 1,275 input-token reads, not 50.
How Anthropic Cache Markers Work
Anthropic's prompt caching lets you mark up to four positions in a conversation as breakpoints. When the same prefix reaches the model again, the API returns cached KV states instead of reprocessing those tokens — at 10% of the normal input cost for cache hits.
The constraint: four markers, one-hour TTL. Place them wrong and you pay cache-write cost upfront without capturing the savings on subsequent turns.
The Strategy: System + N-3/N-2/N-1
Devon's place_cache_breakpoints function in the new llm/ subpackage implements a four-position placement strategy:
- System prompt — always marked first. It is the largest, most-repeated context block across every turn.
- Messages N-3, N-2, N-1 — the three messages immediately before the current turn.
This covers what matters: the static system context and the recent working memory. The next call will find the system prompt cached (guaranteed hit) and has a high probability of hitting on the recent messages if the conversation resumes within an hour.
def place_cache_breakpoints(
messages: list[dict], system: str
) -> tuple[list[dict], str]:
marked_system = attach_cache_control(system, ttl="1h")
recent_idx = max(0, len(messages) - 3)
for i in range(recent_idx, len(messages)):
messages[i] = attach_cache_control(messages[i], ttl="1h")
return messages, marked_system
The function respects the four-marker cap and sets ttl=1h — matching the cadence at which HelaSyn conversations recur in practice.
What the PoC Proved
The llm/ subpackage ships as a clean standalone module. cli_runner.py stays untouched — zero production wiring. The llm_runner_demo.py exits cleanly without an API key, so the package can be reviewed and merged without needing credentials.
21 unit tests pass. One live integration test is intentionally skipped (requires a key). Zero regression against the 28 existing baseline failures in the HelaSyn test suite.
The PoC answers the pre-production question: can we place cache markers correctly before Phase 2 wires in the tool loop and state store? Confirmed yes.
The Numbers
At CVA fleet scale, the math is straightforward:
| Scenario | Daily spend |
|---|---|
| Today (no cache placement) | ~$290/day |
| With 90% cache hit rate (10% billing) | ~$29/day |
| Conservative (70% hit rate) | ~$58/day |
For Tobi individually: $0.30/turn drops to ~$0.03/cached turn on subsequent calls within the TTL window. The savings compound the longer a conversation runs.
This is why tracking the four token types matters. You cannot optimize what you cannot see — and before last week's fix, cache_read_input_tokens and cache_creation_input_tokens were not making it into our logs at all.
What Is Next
Phase 2 of the LLM Runner wires it into production:
conv_store— persistent conversation state across agent restartstool_loop— structured tool call handling with the new provider abstraction- Rotation — API key distribution across the 24-service CVA fleet
- Handoff — context window management when conversations exceed model limits
Phase 1 is merged to main. Phase 2 starts next sprint.
Every week we ship something, hit a constraint, and then write about it. The constraint this week was invisible cost. The fix was four breakpoints placed correctly. Follow along on blog.helachain.com as we build in public.