Tokens are a finite resource. A four-layer placement (inline · auto-load · on-demand · external), how token savings raise decision quality, and the audit tools that keep the budget in check.

This is part 8 of the Alice Way series. Parts 1 through 7 covered where, how, and with whom to distribute the load of mind. This post is about the most foundational resource all of those decisions depend on — the context window, i.e. tokens.

0. Token economy is deliberately reducing what gets into mind

The collaborator's context window is finite. Not everything fits. But the more important fact — what gets in determines the quality of the decision. Filled with noise, the signal weakens. Compressed to signal, the same window can carry a far deeper decision.

Token economy is therefore not just cost reduction. It is the deliberate selection of what to admit into mind. Auto-load vs on-demand, memory vs outside-memory, main vs sub — every decision the series covered converges on a single resource allocation problem.

This post is a record of the four-layer placement of that allocation, how token saving translates into decision quality, and the audit tooling that keeps accumulated context tidy.

1. Four-layer placement — channels tokens come in through

The token placement I converged on has four layers.

Layer	What	When it enters	Token cost
L1 inline	Persona · decision authority · safety guards	Every session, auto	Always (most expensive)
L2 auto-load	First N lines of memory index, project CLAUDE.md	Every session / on directory entry	Always (conditional)
L3 on-demand Read	Memory bodies, per-language rules, code files	When work enters that area	Only in that work
L4 external reference	Whole codebase, build logs, external API responses	Sub-agent receives / only summary to main	Main sees only summary

The core is — L1·L2 must be short because seen often; L3·L4 can be long because seen rarely. Same info in the wrong layer wastes tokens.

Misplacement	Result
L4 → L1 (entire code into persona)	Massive context consumption every session
L1 → L4 (role definition in an external file)	Collaborator does not know its persona every session
L2 → L3 (memory index as on-demand)	Collaborator is unaware its own memory exists
L3 → L2 (all memory bodies auto-loaded)	Index auto-load loses its meaning

2. Token saving → decision quality — the causal chain

Why token saving is a decision-quality problem, not just a cost problem.

2.1 Signal-to-noise ratio

Assume a 100k context window. If persona + memory + irrelevant auto-loads occupy 80k, only 20k is left for decision-grade signal. Shrink auto-loads to 30k in the same window and signal room grows to 70k. The same decision unrolls far deeper.

Decision quality scales with the signal ratio, not the context size.

2.2 Attention diffusion

The collaborator, like a person, gets shakier about what matters as context grows. Finding the key line inside 100k is less accurate than inside 30k. Cutting noise itself lifts decision quality.

2.3 First-token latency

Fewer tokens means faster response. Faster response keeps the operator's flow intact. Uninterrupted flow reduces the operator's mind load too. Quality rises on the operator's side as well.

3. Savings patterns in rotation

What I actively run for token savings.

3.1 Memory index capped at 200 lines

MEMORY.md index stays under 200 lines. Past that and it falls outside the auto-load window, losing its meaning. The 200-line pressure itself becomes the motivation to prune.

3.2 No reading whole directories

Reading a large directory in one shot blows up tokens. List first with ls, narrow with grep, then Read only specific files. Baked into the persona as a safety guard.

3.3 Header/range first for files over 500 lines

Never read a large file end to end. Read the first 100 lines for structure, then Read only the needed range. The same decision happens against 200 lines of context, not 1000.

3.4 Hand deep work off to subagents

Search, review, build, test go deep in subs; main only sees the summary. The main context stays in decision mode. (See part 6.)

3.5 Memory pruning cadence

Project memory monthly, feedback memory quarterly, reference memory every half year. Stale context should not occupy meaningless slots.

3.6 Token audit tool

A tool like /token-audit runs periodically. Reports which entries have not been referenced for 60+ days, which memories have inflated, which auto-loads go unused. Basis for pruning decisions.

4. Expensive context locations — where to prune first

Token cost varies by location. The same 100 lines can cost 10x more depending on placement.

Location	Per-line cost (relative)	Pruning priority
Persona (L1)	× every session × every response	★★★ top
Memory index (L2)	× every session	★★
Project CLAUDE.md (L2)	× every session in that project	★★
Memory body (L3)	× only when Read	★
Code file (L4)	× only when Read	(not pruning — splitting)

The persona is the most expensive. The same 100 lines in the persona pay × every session × every response. So the persona stays smallest and gets pruned most often.

5. Traps — patterns where token economy goes wrong

5.1 "Just in case" inline

Adding one guard, one policy, one example to the persona because it might be useful someday. A year later the persona is 1000 lines. → Hold to the rule: "only promotes to the persona if it has actually triggered twice."

5.2 Memory body copied into the index

Copying part of a memory body into the index as a "summary." The index itself inflates. → Index always carries one-line hooks only. Bodies stay in their files.

5.3 No periodic token review

Using it daily but with no idea where tokens go. → Schedule token audits. Which auto-loads are large, which memories go unused.

5.4 Sub results absorbed whole into main

Sub returns 1000 lines and main absorbs all of it. The reason to use a sub is lost. → Sub results always split into summary (for main) + full (for files or next sub call).

5.5 Hitting the context limit

Tokens reach the end of the window and auto-compaction kicks in, or the oldest context is dropped. The dropped slice might be one the operator did not consciously track. → Always be aware of context usage; at 80% utilization, do cleanup.

6. Accumulated token economy — what changes

Changes since I started treating token economy as a first-class concern.

Item	Before	After
Session-start context	Flat 60–70k	Index + persona = 15–20k
Tokens available for deep work	30–40k	80–85k
Frequency of context compaction	Often	Rarely
Decision quality on the same work	Average	Deep (less noise)
Response speed	Average	Fast

The core is the last two — the same work unrolls into a deeper decision, and unrolls faster. Token savings translate directly into decision quality.

7. Compressed into one principle

The core of token-economy design collapses into one sentence.

"Inline only what must be seen every session; auto-load what is seen often; on-demand Read what is seen occasionally; external for what should almost never be seen. The expensive mistake is the wrong layer."

When this holds, the token economy becomes the accelerator of decision quality. When it breaks — noise eats decision room, and the collaborator gets confused about what matters.

The next post is the last one. It covers how all of this — persona, memory, skills, hooks, multi-agent, verification, tokens — actually starts and ends each day. Session protocols.

8. Token economy — what enters the context and what stays out