8. Token economy — what enters the context and what stays out
Tokens are a finite resource. A four-layer placement (inline · auto-load · on-demand · external), how token savings raise decision quality, and the audit tools that keep the budget in check.
This is part 8 of the Alice Way series. Parts 1 through 7 covered where, how, and with whom to distribute the load of mind. This post is about the most foundational resource all of those decisions depend on — the context window, i.e. tokens.
0. Token economy is deliberately reducing what gets into mind
The collaborator's context window is finite. Not everything fits. But the more important fact — what gets in determines the quality of the decision. Filled with noise, the signal weakens. Compressed to signal, the same window can carry a far deeper decision.
Token economy is therefore not just cost reduction. It is the deliberate selection of what to admit into mind. Auto-load vs on-demand, memory vs outside-memory, main vs sub — every decision the series covered converges on a single resource allocation problem.
This post is a record of the four-layer placement of that allocation, how token saving translates into decision quality, and the audit tooling that keeps accumulated context tidy.
1. Four-layer placement — channels tokens come in through
The token placement I converged on has four layers.
| Layer | What | When it enters | Token cost |
|---|---|---|---|
| L1 inline | Persona · decision authority · safety guards | Every session, auto | Always (most expensive) |
| L2 auto-load | First N lines of memory index, project CLAUDE.md | Every session / on directory entry | Always (conditional) |
| L3 on-demand Read | Memory bodies, per-language rules, code files | When work enters that area | Only in that work |
| L4 external reference | Whole codebase, build logs, external API responses | Sub-agent receives / only summary to main | Main sees only summary |
The core is — L1·L2 must be short because seen often; L3·L4 can be long because seen rarely. Same info in the wrong layer wastes tokens.
| Misplacement | Result |
|---|---|
| L4 → L1 (entire code into persona) | Massive context consumption every session |
| L1 → L4 (role definition in an external file) | Collaborator does not know its persona every session |
| L2 → L3 (memory index as on-demand) | Collaborator is unaware its own memory exists |
| L3 → L2 (all memory bodies auto-loaded) | Index auto-load loses its meaning |
2. Token saving → decision quality — the causal chain
Why token saving is a decision-quality problem, not just a cost problem.
2.1 Signal-to-noise ratio
Assume a 100k context window. If persona + memory + irrelevant auto-loads occupy 80k, only 20k is left for decision-grade signal. Shrink auto-loads to 30k in the same window and signal room grows to 70k. The same decision unrolls far deeper.
Decision quality scales with the signal ratio, not the context size.
2.2 Attention diffusion
The collaborator, like a person, gets shakier about what matters as context grows. Finding the key line inside 100k is less accurate than inside 30k. Cutting noise itself lifts decision quality.
2.3 First-token latency
Fewer tokens means faster response. Faster response keeps the operator's flow intact. Uninterrupted flow reduces the operator's mind load too. Quality rises on the operator's side as well.
3. Savings patterns in rotation
What I actively run for token savings.
3.1 Memory index capped at 200 lines
MEMORY.md index stays under 200 lines. Past that and it falls outside the auto-load window, losing its meaning. The 200-line pressure itself becomes the motivation to prune.
3.2 No reading whole directories
Reading a large directory in one shot blows up tokens. List first with ls, narrow with grep, then Read only specific files. Baked into the persona as a safety guard.
3.3 Header/range first for files over 500 lines
Never read a large file end to end. Read the first 100 lines for structure, then Read only the needed range. The same decision happens against 200 lines of context, not 1000.
3.4 Hand deep work off to subagents
Search, review, build, test go deep in subs; main only sees the summary. The main context stays in decision mode. (See part 6.)
3.5 Memory pruning cadence
Project memory monthly, feedback memory quarterly, reference memory every half year. Stale context should not occupy meaningless slots.
3.6 Token audit tool
A tool like /token-audit runs periodically. Reports which entries have not been referenced for 60+ days, which memories have inflated, which auto-loads go unused. Basis for pruning decisions.
4. Expensive context locations — where to prune first
Token cost varies by location. The same 100 lines can cost 10x more depending on placement.
| Location | Per-line cost (relative) | Pruning priority |
|---|---|---|
| Persona (L1) | × every session × every response | ★★★ top |
| Memory index (L2) | × every session | ★★ |
| Project CLAUDE.md (L2) | × every session in that project | ★★ |
| Memory body (L3) | × only when Read | ★ |
| Code file (L4) | × only when Read | (not pruning — splitting) |
The persona is the most expensive. The same 100 lines in the persona pay × every session × every response. So the persona stays smallest and gets pruned most often.
5. Traps — patterns where token economy goes wrong
5.1 "Just in case" inline
Adding one guard, one policy, one example to the persona because it might be useful someday. A year later the persona is 1000 lines. → Hold to the rule: "only promotes to the persona if it has actually triggered twice."
5.2 Memory body copied into the index
Copying part of a memory body into the index as a "summary." The index itself inflates. → Index always carries one-line hooks only. Bodies stay in their files.
5.3 No periodic token review
Using it daily but with no idea where tokens go. → Schedule token audits. Which auto-loads are large, which memories go unused.
5.4 Sub results absorbed whole into main
Sub returns 1000 lines and main absorbs all of it. The reason to use a sub is lost. → Sub results always split into summary (for main) + full (for files or next sub call).
5.5 Hitting the context limit
Tokens reach the end of the window and auto-compaction kicks in, or the oldest context is dropped. The dropped slice might be one the operator did not consciously track. → Always be aware of context usage; at 80% utilization, do cleanup.
6. Accumulated token economy — what changes
Changes since I started treating token economy as a first-class concern.
| Item | Before | After |
|---|---|---|
| Session-start context | Flat 60–70k | Index + persona = 15–20k |
| Tokens available for deep work | 30–40k | 80–85k |
| Frequency of context compaction | Often | Rarely |
| Decision quality on the same work | Average | Deep (less noise) |
| Response speed | Average | Fast |
The core is the last two — the same work unrolls into a deeper decision, and unrolls faster. Token savings translate directly into decision quality.
7. Compressed into one principle
The core of token-economy design collapses into one sentence.
"Inline only what must be seen every session; auto-load what is seen often; on-demand Read what is seen occasionally; external for what should almost never be seen. The expensive mistake is the wrong layer."
When this holds, the token economy becomes the accelerator of decision quality. When it breaks — noise eats decision room, and the collaborator gets confused about what matters.
The next post is the last one. It covers how all of this — persona, memory, skills, hooks, multi-agent, verification, tokens — actually starts and ends each day. Session protocols.