ESSAYApril 8, 2026·4 min read·[ai][local-llm][first-look]

First look: Gemma 4 E2B — Google's thinking model that runs on your laptop

I just ran ollama run gemma4:e2b on my poor little AMD laptop — no GPU, 16 GB of RAM soldered to the motherboard with zero upgrade path — and watched a 2-billion-parameter model go full internal monologue on me.

ollama run gemma4:e2b

$ ollama run gemma4:e2b

pulling manifest

pulling 4e30e2665218: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 7.2 GB

verifying sha256 digest

success

7.2 gigs pulled down, no CUDA required. Here are some first impressions.

What does Gemma 4 actually do different?

Imagine most AI models as a single really smart person trying to answer every question themselves. Gemma 4 is more like a team of specialists sitting behind one desk. When a question comes in, only the experts who are relevant wake up and do the work — the rest stay quiet and save energy. Google calls this a Mixture-of-Experts (MoE) architecture.

The other big deal: this thing can see, hear, and read. Text, images, audio — all natively, no duct-taping a speech-to-text service onto the side. And it has a 128K token context window, so it can hold roughly the length of a novel in its working memory at once.

Oh, and it thinks out loud. Every response starts with a visible reasoning trace — step-by-step logic before the final answer. More on that in a second.

What does "E2B" mean?

E2B = Edge, 2 Billion (parameters). The "edge" part means it's designed to run on your device — a laptop, a phone, even a Raspberry Pi 5 — instead of needing a datacenter. Google achieved this with a trick called Per-Layer Embeddings (PLE): instead of one giant lookup table shared across the whole network, each layer gets its own tiny, specialized one. Think of it like giving every floor of a building its own concise cheat sheet instead of one massive encyclopedia in the lobby. The result is a model with 5.1B total parameters that only activates about 2.3B of them at inference time.

Will it run on my machine?

Precision	RAM needed
4-bit (Q4)	~3–4 GB
8-bit (Q8)	~5–8 GB
Full 16-bit	~10 GB

I can confirm: it runs on a budget AMD laptop with integrated graphics and 16 GB of non-upgradeable RAM. Ollama uses Q4 quantization by default, so it fits comfortably with room to spare. No GPU needed — it just runs on CPU, and while it's not blazing fast, it works. That's the whole point of an edge model.

What's the deal with "kwont"?

You'll see people online saying things like "running the 4-bit quant." They mean quantization — and despite looking like a typo, it's a real term. Google's own inference memory guide breaks down the tradeoffs across BF16, SFP8, and Q4_0 precision.

A model's "brain" is millions of numbers (weights). By default each number is stored in high precision — like writing every price to 16 decimal places. Quantization is rounding those numbers to use fewer decimal places. A 4-bit quant rounds aggressively (fewer decimals, smaller file, less memory) while 16-bit keeps the full precision.

The tradeoff: lower precision = faster and lighter, but slightly less accurate. For a casual conversation or quick task you'd be hard-pressed to notice the difference at Q4. The Ollama default (Q4_K_M) is the sweet spot — roughly 55–60% less memory with minimal quality loss. If you want to go deeper, Unsloth's Gemma 4 guide covers the full quantization matrix.

What's exciting to explore

Image + text reasoning on-device. Feeding it screenshots and asking it to explain UI layouts or debug visual issues, without anything leaving my machine. Gemma 4 supports variable resolution and aspect ratio natively.
Audio input. The E2B handles audio natively — voice memos, meeting recordings, transcription + summarization in one pass.
Function calling. It supports native tool use and structured JSON output, which opens the door to local agentic workflows — a tiny model orchestrating scripts and APIs on your own hardware.
The 128K context window. Dropping in an entire codebase file or long document and asking questions, locally.
Thinking mode as a teaching tool. Watching the reasoning trace isn't just a novelty — it's a window into how the model arrives at answers. The configurable thinking mode lets you dial the reasoning depth up or down.

The joy of watching it think about "hi"

My first prompt was the most boring thing possible:

gemma4:e2b

>>> hi

Thinking ...

1. Analyze the input — The input is "hi".

2. Determine the context — This is a very casual, open-ended greeting.

3. Identify the goal — The response should be friendly, polite, and acknowledge the greeting.

4. Formulate appropriate responses:

→ Simple acknowledgment: "Hello."

→ Polite with an inquiry: "Hi! How can I help you?"

→ Warm and open: "Hello there!"

5. Select the best response — A friendly and engaging response is usually preferred.

6. Self-Correction — Keep it brief and inviting.

Hi! How can I help you today?

Six deliberate reasoning steps. For "hi." It analyzed my intent, brainstormed three candidate responses, evaluated them against a friendliness heuristic, ran a self-correction pass to trim the fat, and then replied. There's something genuinely charming about a model running on your own machine, weighing the philosophical implications of the word "hi" through a visible chain of thought, only to arrive at the most wholesome response possible.

It's like watching a golden retriever carefully consider the optimal way to greet you at the door.

This is running locally. For free. On a laptop that can't even upgrade its RAM. What a time to be tinkering.

Resources

A roundup of everything referenced above and the most useful links I've found for getting started with Gemma 4.

Official

Gemma 4 announcement — Google's launch blog post with the full vision
Gemma 4 model overview — specs, memory tables, architecture breakdown
Gemma 4 model card — benchmarks, training details, safety evaluations
Gemma on Ollama — one-command local install
Gemma 4 on Hugging Face — all model weights and variants
Gemma 4 on Kaggle — model downloads + notebooks

Capabilities deep-dives

Thinking / reasoning mode — configuring the thinking trace depth
Vision: image understanding — variable resolution, aspect ratios, OCR
Vision: video understanding — video input on larger models
Audio understanding — native speech recognition on E2B/E4B
Function calling — tool use and structured output

Community & guides

Welcome Gemma 4 — Hugging Face blog — deep technical walkthrough with code examples
A Visual Guide to Gemma 4 — Maarten Grootendorst's illustrated architecture guide
Running Gemma 4 locally with Ollama — step-by-step setup walkthrough
Gemma 4 edge deployment — E2B/E4B on phones, Pi, and IoT
Unsloth: Gemma 4 fine-tuning — quantization matrix and local training
Gemma 4 hardware requirements — detailed RAM/VRAM breakdown by model size
Google AI Studio — try the 31B model in-browser, no setup

License

Gemma 4 is released under Apache 2.0 — fully open, commercially permissive, no strings.

← all writing /rss index