close

DEV Community

Cover image for Hacking with multimodal Gemma 4 in AI Studio
Paige Bailey for Google AI

Posted on

Hacking with multimodal Gemma 4 in AI Studio

We’re in an incredibly fun era for building. The friction between "I have a weird idea" and "I have a working prototype" is basically zero, especially with the release of Gemma 4, which is now available via the Gemini API and Google AI Studio.

Whether you want to deeply inspect model reasoning or you're just trying to build a pipeline to auto-caption an archive of historical web comics and obscure wiki trivia, you can now hit open-weights models directly from your code without needing to provision a massive GPU rig first.

Here’s a look at the architecture, how to use it, and how to go from the UI to production code in one click.

The Models: Apache 2.0, MoE, and 256k Context

Before we look at the API, the biggest detail about Gemma 4 is the license: it's released under Apache 2.0. This means total developer flexibility and commercial permissiveness. You can prototype with the Gemini API, and eventually run it anywhere from a local rig to your own cloud infrastructure.

The benchmarks are also genuinely impressive. The 31B model is currently sitting at #3 on the Arena AI text leaderboard, out-competing models massively larger than it.

When you drop into Google AI Studio, you'll see two primary models in the picker:

  • Gemma 4 31B IT: The flagship dense model. It has a massive 256K context window — perfect for dumping in entire codebases, massive log files, or huge JSON datasets.
  • Gemma 4 26B A4B IT: A Mixture-of-Experts (MoE) architecture. It's highly efficient, only activating roughly 4 billion parameters per inference. High throughput, lower cost.

(Note: There are also E2B and E4B "Edge" models meant for local on-device deployment that feature native audio input, but we're focusing on the AI Studio API today. I recommend that you go download and test the smaller models locally, though!)

Multimodal Inputs + Chain of Thought

Text is great, but Gemma 4 is natively multimodal. Let's say you want to build a pipeline to reverse-engineer prompts from a folder of distinct images.

In AI Studio, you can drop images directly into the playground alongside your prompt.

The Prompt:

"Generate descriptions of each of these images, and a prompt that I could give to an image generation model to replicate each one."

Because the Gemma models support advanced reasoning, after you click Run, you can click the Thoughts toggle to literally step through the model's chain-of-thought process before it generates its final output.

If you love understanding the "why" behind model logic, or you're trying to debug why an agent went off the rails, this level of transparency is incredibly useful.

Shipping the code

The bridge between "playing around in a UI" and "writing a script" should be exactly one click. Once you have your prompt, your images, and your reasoning configuration dialed in perfectly, click the Get Code button in the top right corner.

You can grab the exact payload required for TypeScript, Python, Go, or standard cURL. Best of all, if you toggle "Include prompt/history", it automatically handles the base64 encoding of your images and explicitly sets the thinkingConfig parameters in the code for you.

Here's what the TypeScript output looks like when you want to use Gemma 4's reasoning capabilities via the SDK:

import { GoogleGenAI } from '@google/genai';

// Initialize the client
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

// Configure Gemma 4 reasoning logic
const config = {
  thinkingConfig: {
    thinkingLevel: 'HIGH',
  }
};

const response = await ai.models.generateContent({
  model: 'gemma-4-31b-it',
  contents: 'Tell me a fascinating, obscure story from internet history.',
  config: config
});

console.log(response.text);
Enter fullscreen mode Exit fullscreen mode

Go build open-source things!

Having Apache 2.0 open-weights models accessible via a fast API completely changes the calculus for weekend projects. Whether you're building a script to summarize deeply technical whitepapers, analyze visual data natively, or wire up autonomous multi-step code generation agents—the friction is basically gone.

I can't wait to see what you build! Let me know in the comments what rabbit hole you're pointing Gemma at first. Happy hacking this weekend. :)

Top comments (9)

Collapse
 
theycallmeswift profile image
Swift

The native multimodal capability is HUGE. I did a project with Gemma 3 last year and spent so much time working around preprocessing and ingesting non-text data.

One stat I've started looking at more closely with open models is the ratio of context window size per billion active parameters. Idea being that the higher the number, the less resources we need for larger and longer running tasks. Caveat being that Mixture of Experts models will have a much better ratio than dense models by definition, but still interesting.

Model Params (total / active) Context Context per Active B
Llama 4 Scout 109B / 17B 10M 588K per B
Gemma 4 26B-A4B 26B / 3.8B 256K 67K per B
DeepSeek Coder V2 Lite 16B / 2.4B 128K 53K per B
Qwen 3.5-27B 27B / 27B 256K 9K per B
Gemma 3 27B 27B / 27B 128K 5K per B
DeepSeek R1 32B 32B / 32B 128K 4K per B
Command R 35B / 35B 128K 4K per B
Llama 3.3 70B 70B / 70B 128K 2K per B
Qwen 3 MoE 235B / 22B 32K 1.5K per B
Mistral Small 3 24B / 24B 32K 1.4K per B
Phi-4 14B / 14B 16K 1.1K per B

Llama 4 Scout doesn't really have accurate benchmarks still and DeepSeek Coder V2 Lite is both code focused and has a more restrictive license. So the reality is nothing is close to Gemma 4 26B-A4B right now!

Collapse
 
_winter_1314 profile image
Winter

This is awesome, so excited to try it out!

Though I did have a throught given the rapid emergence of new large language models, both open/closed source and locally deployable/cloud‑hosted, how should one approach selecting the most appropriate model for a particular task? Is the process primarily empirical testing, or is there a more systematic methodology?

In addition, under what circumstances would it be rational to choose a hosted, larger‑parameter Gemma model over a hosted Gemini model, assuming both are accessible via online APIs? At first glance this seems suboptimal if both or hosted, unless the hosted Gemma offering allows per‑tenant or per‑request weight customization that is not available with Gemini.

Collapse
 
codedpool profile image
Romanch Roshan Singh

been trying to build a local-first desktop app for a while and handling images + text together has always been the tricky part. this might actually be a great fit for what I'm building!

Collapse
 
talenmud profile image
Talen

The fact that Gemma 4 is released under Apache 2.0 is a massive win for the offline AI future. While the API is great for prototyping, having these open-weights means we can move multimodal capabilities to local edge devices where Wifi and service simply don't exist. The opportunity for building offline apps with native image/audio inference is unbelievable. This could be a vital bridge for technological access in humanitarian crises or war zones. Giving people the power of high-level reasoning and data processing locally when they’ve been systematically cut off from the global grid.

Collapse
 
sanidhya18 profile image
Sanidhya Goel

The more easy it is to get hands on local LLMs the more it would be easy to get ai into devices. Memory efficient LLMs are the next big thing, the smaller infra it requires, the cheaper they are to run and the more easy the adaptability would be.

Super excited to try this out

Collapse
 
john_walters_dc67d574a399 profile image
John Walters

How does the fact that these are open-weights models change things for me as a developer of these kinds of apps?

Collapse
 
sneh1117 profile image
sneh1117

wow what a completely different perspective

Collapse
 
shayan_king_f69176a84d718 profile image
Shayan King

Hello please help me

Some comments may only be visible to logged-in visitors. Sign in to view all comments.