Per-Layer Embeddings — Trading Flash for DRAM on Edge Models

The Problem PLE Solves

A token’s embedding vector must simultaneously carry:

Raw identity — what word this is (“bank”)
Contextual potential — which meaning applies (money vs. river)

In wide models (hidden size 8,192+), both signals fit without interference. In narrow edge models (hidden size 1,536), they collide — “bank the river” and “bank the money” compete for the same coordinates, and every downstream layer inherits that collision. No amount of downstream projection can unflatten meanings that were crushed in the initial lookup.

How PLE Works

Instead of one embedding table, each decoder layer gets its own 256-dimension lookup table. When a token arrives, every layer performs its own fresh lookup. The main hidden state no longer needs to preserve raw token identity across all layers because each layer has a dedicated identity signal.

Cost: In Gemma 4 E2B, PLE tables consume 2.35B parameters — 46% of the entire 5.1B budget.

Why it works on phones: PLE tables are static lookups, read once per token per layer. They do not need fast DRAM. They sit in flash storage (128 GB available on a typical phone), making the 4.7 GB footprint effectively free.

Why servers skip it: At wider hidden sizes (2,816–5,376), the representational collision is not a fatal problem. Spending 46% of parameters on a technique that only pays off when DRAM is the binding constraint makes no sense on an 80 GB HBM server.

Evidence It Works

Gemma 4 E2B (2.3B effective parameters) scores 37.5% on AIME 2026 — beating Gemma 3 27B’s 20.8% despite being 12x smaller. This strongly suggests the narrow hidden state is free to reason instead of spending capacity remembering token identity.

Edge vs Server Model Architecture - Why One DNA Cannot Serve Both — the constraint flip driving PLE adoption
Google Gemma 4 Will Change How AI Is Deployed — source article
AI Memory Crowding - HBM Eats Consumer Device Budgets — DRAM scarcity context

Per-Layer Embeddings — Trading Flash for DRAM on Edge Models

The Problem PLE Solves

How PLE Works

Evidence It Works

Related Notes