Microsoft Phi-3 Vision vs. Google PaliGemma: The Open AI Showdown

DALL·E 2024 05 30 09.43.14 A fun and playful representation of an open source showdown between PaliGemma and Phi 3 Vision. On the left a friendly AI robot with the PaliGemma lo

Fresh from the release of PaliGemma, Microsoft would not be outdone with their own new open source vision language model, Phi-3 Vision. I am keen to dive into how they differ and which one is best for our customers. While the primary use case and the data used for training these vision language models is understanding graphs, diagrams and user screens, it’s amazing to see how the current rapid rate of advancement in VLM’s is poliferating into adjacent fields such as video analytics.

TLDR : New open source model from Microsoft called Phi3-Vision, shows promise for a small open-source vision language model. Initial testing for optical character recognition (OCR) is slightly less accurate but much faster, will fine tune and train to increase accuracy so we can deploy to customer sites to drive more automation. Faster = less compute = cheaper solutions for our customers.

Benefits of Microsoft’s Phi-3 Vision Model

Microsoft’s Phi-3 Vision, a 4.2 billion parameter multimodal model, offers several significant benefits:

  • Compact yet Powerful: Despite its smaller size, Phi-3 Vision demonstrates performance on par with larger models, thanks to a meticulously curated training dataset. This makes it efficient and capable of running on devices with limited computational resources, including modern smartphones.
  • Advanced Multimodal Capabilities: The model excels at processing and understanding both text and images. It integrates a CLIP ViT-L/14 image encoder with a phi-3-mini-128K-instruct transformer decoder, allowing it to handle high-resolution images and various aspect ratios dynamically.
  • Diverse Pre-Training Data: Phi-3 Vision was trained on a comprehensive dataset that includes interleaved image-text documents, synthetic OCR data from PDFs, and datasets for chart and table comprehension. This extensive and varied training enhances its ability to handle a wide range of visual and textual inputs.
  • Enhanced Reasoning and Contextual Understanding: The model’s training methodology focuses on reasoning and contextual understanding, which are crucial for applications requiring high-level cognitive capabilities, such as summarising complex scenes or interpreting intricate charts.
  • Open-Source Flexibility: Similar to PaliGemma, Phi-3 Vision is also open-source. This allows for extensive customization and fine-tuning to meet specific project requirements. Developers can leverage the open-source nature to adapt the model for unique applications and optimise its performance for particular tasks.
table comparison

                                  Source –

Architectural Differences Between PaliGemma and Phi-3 Vision

While both PaliGemma and Phi-3 Vision are designed to handle multimodal inputs, their architectures reflect different design philosophies and capabilities:

  1.  Size and Deployment:
    1. PaliGemma is a larger open-source model optimised for flexibility and customization, making it suitable for a variety of high-performance applications.
    2. Phi-3 Vision, with its smaller size, is optimised for efficiency and can run on devices with limited resources, such as smartphones, without sacrificing performance.
  2. Model Components:
    1. PaliGemma leverages a transformer-based architecture that excels at integrating text and image inputs, allowing for complex scene understanding and contextual generation.
    2. Phi-3 Vision combines a CLIP-based image encoder with a transformer decoder, allowing seamless integration and efficient processing of mixed text and image inputs.
  3. Training Data and Methodology:
    1. PaliGemma benefits from its open-source nature, enabling extensive customisation and fine-tuning with domain-specific data.
    2. Phi-3 Vision utilises a unique training dataset composed of heavily filtered web data and synthetic data, focusing on maintaining a balance between model size and performance through an optimal data regime.

Optical Character Recognition

Demonstrates strong OCR capabilities, leveraging its advanced image-text processing to excel in reading text from various visual contexts. The model’s ability to understand and contextualise text within images further enhances its OCR performance, making it suitable for diverse applications like document digitisation and automated data extraction. For pure licence plate recognition it was not quite as accurate as PaliGemma on blurrier images, tight angles or dirty plates.

How Phi-3 Vision Could Help Our Customers

At DDI Labs, the integration of Phi-3 Vision could significantly enhance our capabilities and offerings across various industries:

Improving License Plate Recognition – Utilising Phi-3 Vision’s superior OCR capabilities, we can enhance the accuracy and efficiency of our licence plate recognition systems. This is crucial for automation in industries like parking management and vehicle tracking.

Automating Edge Cases – With the ability to process both visual and textual inputs, Phi-3 Vision can handle complex scenarios, such as drivers arriving at a site without a formal booking. The model can scan additional documentation and extract necessary information, determining site access or denial seamlessly.

Smart Surveillance Systems – Phi-3 Vision’s ability to understand and contextualise visual data can significantly enhance our security systems. It can identify suspicious activities and understand the context of scenes captured by cameras, generating detailed daily reports in natural language for security teams to review.

Customisation and Flexibility -Being open-source, Phi-3 Vision invites innovation and collaboration. This allows us to tailor the model to our specific needs, unlocking new possibilities in automation and AI-driven solutions. Whether enhancing licence plate recognition systems or developing advanced automation solutions, Phi-3 Vision’s capabilities show immense promise.

P.S here is the original image Dalle generated from the title of this blog.