Google 推端侧多模态模型 Gemma 4
为什么值得看
Gemma 4 12B 是 Google 推出的无编码器统一多模态模型,主打笔记本本地运行的高性能推理。对 AI 工程师意味着端侧部署门槛大幅降低,可直接替代部分云端 API 调用场景。
媒体预览
编辑判断
无编码器架构是个关键取舍。传统多模态模型如 GPT-4V、Claude 3 都依赖独立的视觉编码器(ViT),Gemma 4 直接把原始像素和文本 token 塞进同一个 transformer,简化了管线但也对训练数据质量要求极高。
12B 规模瞄准的不是和 70B+ 模型拼能力,而是抢"足够好用、完全本地、零 API 成本"的生态位。之前这个定位主要是 Llama 3.2 Vision 和 Apple 的 MLX 社区在占,Google 现在用官方版本进来搅局。
如果你在做的产品涉及图片理解但不想把用户数据送云端,这个模型值得优先测。注意它的实际显存占用和量化方案,12B 纯文本和 12B 多模态的推理开销不是一回事。
社区反馈
意见分歧 175 条评论
核心争论:"无编码器"是营销术语还是架构创新?投影层是否算编码器
The big story here is the encoder-free part, which I still don't fully understand. > Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations. That's technically encoding, just without using
> That's technically encoding Isn't that just projecting the patches into the d_model size vectors that the models takes? >I am assuming that involves of quantization 12B model in 16GB seems very reasonable to me, int8 is top quality for running models.
The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input." 12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but t