Per-Layer Embeddings — Trading Flash for DRAM on Edge Models
The Problem PLE Solves
A token’s embedding vector must simultaneously carry:
- Raw identity — what word this is (“bank”)
- Contextual potential — which meaning applies (money vs. river)
In wide models (hidden size 8,192+), both signals fit without interference. In narrow edge models (hidden size 1,536), they collide — “bank the river” and “bank the money” compete for the same coordinates, and every downstream layer inherits that collision. No amount of downstream projection can unflatten meanings that were crushed in the initial lookup.
How PLE Works
Instead of one embedding table, each decoder layer gets its own 256-dimension lookup table. When a token arrives, every layer performs its own fresh lookup. The main hidden state no longer needs to preserve raw token identity across all layers because each layer has a dedicated identity signal.
Cost: In Gemma 4 E2B, PLE tables consume 2.35B parameters — 46% of the entire 5.1B budget.
Why it works on phones: PLE tables are static lookups, read once per token per layer. They do not need fast DRAM. They sit in flash storage (128 GB available on a typical phone), making the 4.7 GB footprint effectively free.
Why servers skip it: At wider hidden sizes (2,816–5,376), the representational collision is not a fatal problem. Spending 46% of parameters on a technique that only pays off when DRAM is the binding constraint makes no sense on an 80 GB HBM server.
Evidence It Works
Gemma 4 E2B (2.3B effective parameters) scores 37.5% on AIME 2026 — beating Gemma 3 27B’s 20.8% despite being 12x smaller. This strongly suggests the narrow hidden state is free to reason instead of spending capacity remembering token identity.
Related Notes
- Edge vs Server Model Architecture - Why One DNA Cannot Serve Both — the constraint flip driving PLE adoption
- Google Gemma 4 Will Change How AI Is Deployed — source article
- AI Memory Crowding - HBM Eats Consumer Device Budgets — DRAM scarcity context