Summarized by Dodly:
Google's Gemma 12B: Local AI Gets Smarter and Smaller
Audio Summary
Summary
Google's new Gemma 12B model is bridging the gap for on-device AI, offering impressive capabilities in a more manageable size. While smaller E2B and E4B models were optimized for phones and multimodal input like audio and video, and larger 31B models required more powerful hardware, the 12B model provides a strong middle ground. A key innovation is the "unified" architecture, which integrates vision and audio encoders directly, reducing the need for separate components and making it more efficient. This unified 12B model even outperforms some larger models in specific benchmarks and handles multimodal tasks like text, image, and audio processing. A significant advancement is the introduction of quantization-aware training, or QAT, on this model. This technique trains the model to perform well even with reduced computational resources, allowing it to maintain intelligence while requiring less memory and processing power. In tests, the Gemma 12B QAT demonstrated strong instruction following, successfully executing complex multi-chain tool calls, such as scraping websites, summarizing content, and generating PDFs with images, even outperforming some competitors on similar tasks. While audio and video input are limited to 30 and 60 seconds respectively, this is easily managed with pre-processing. The 12B unified model offers a larger 256K context window compared to earlier mobile-focused models, further enhancing its utility. This development represents a significant step for making powerful AI accessible on everyday devices.