IBM released Granite 4.0 3B Vision on March 31, a vision-language model engineered specifically for enterprise document data extraction rather than general image understanding. The model ships as a 0.5B parameter LoRA adapter that loads on top of IBM's Granite 4.0 Micro base model, creating a dual-mode system that can handle text-only requests without the vision overhead. Built with a SigLIP vision encoder and "DeepStack" architecture that injects visual features across 8 transformer layers, it focuses on three core tasks: converting charts to CSV/code, extracting tables to HTML/JSON, and pulling semantic key-value pairs from forms.
This represents a notable departure from the "bigger is better" multimodal trend. While companies chase GPT-4V and Gemini capabilities, IBM built something narrow and practical. The model was trained on ChartNet, a million-scale dataset focused on chart understanding, plus a "code-guided" pipeline that aligns plotting code with rendered images and underlying data tables. That training approach matters—most vision models are terrible at structured extraction because they're optimized for natural language descriptions, not precise data parsing.
The Apache 2.0 license and local deployment story differentiate this from cloud-only alternatives. Multiple sources highlight integration with IBM's Docling document parser and vLLM inference support, suggesting this targets teams building RAG systems or automated document pipelines who need to keep data on-premises. The 3B parameter count makes it feasible to run locally while the modular LoRA design means you're not loading vision weights for text-only tasks.
For developers dealing with enterprise document processing, this could be significant. Most existing solutions either use expensive API calls to frontier models or struggle with the structured extraction accuracy that enterprise workflows demand. A locally-runnable, Apache-licensed model that actually handles complex tables and charts properly fills a real gap—assuming it delivers on IBM's accuracy claims.
