Power of PaliGemma


I’m thrilled to share my insights on PaliGemma, a great new open-source AI model developed by Google. This model is interesting due to its multi-modal nature, seamlessly integrating text and vision inputs to generate natural language outputs. While its announcement was probably overshadowed by larger closed sourced models like ChatGPT and Gemini, I am keen to dive into how we might be able to use PaliGemma at DDI.

TLDR : New open source model from Google called PaliGemma, shows promise for a small open-source vision language model. Initial testing for optical character recognition (OCR) is highly accurate, will fine tune and train so we can deploy to customer sites to drive more automation.

Key Features

Comprehensive Image Understanding

PaliGemma can identify and segment objects within an image—such as cars, people, or animals—and provide detailed context. This is in contrast to most other large language models trained for image input, as they tend to perform very poorly with detecting where in an image an object is. While these models still have a way to outperform traditional convolutional neural networks at pure detection, their ability to provide further context for objects in images or videos is remarkable. Example, “this vehicle appears unsafe due to no tarp over load”. Normally you would have to train a detector to detect the truck in the scene and then pass the output to a classifier to detect tarp/no tarp and finally hard code the output for a safety flag. With PaliGemma, I was able to simply define its role in assessing safety and pass a single image in to generate the correct response. This saves hours and cost for custom AI projects for our customers.

Open-Source Flexibility

Unlike closed models like ChatGPT and Gemini, PaliGemma is open-source, which allows me to fine-tune and customise it for specific tasks. This flexibility enables me to adapt the model to meet my unique needs, making it an invaluable tool for various projects. It also means as models advance, we can swap out the Vision or Language model for a more accurate result. 

Image Source: https://github.com/google-research/big_vision/raw/main/big_vision/configs/proj/paligemma/paligemma.png

Advanced Optical Character Recognition (OCR)

In my initial tests, PaliGemma’s OCR capabilities have proven to be among the best in the open-source domain. This makes it an excellent tool for applications like licence plate recognition, where high accuracy is crucial for automation in various industries. There is also scope to be able to read permits, licences, work orders, dockets to drive automation with accurate capture.

How PaliGemma could help DDI Labs

I am always on the lookout for cutting-edge AI solutions that can solve real-world problems across various industries. PaliGemma has presented numerous opportunities for us to enhance our capabilities and offerings.

Enhanced Optical Character Recognition (OCR) – PaliGemma has demonstrated exceptional performance in OCR, particularly in reading and interpreting complex texts in challenging conditions. This capability allows me to:

  1. Improve Licence Plate Recognition: Utilise PaliGemma’s superior OCR capabilities to enhance the accuracy and efficiency of our licence plate recognition systems. 
  2. Automate Edge cases: By having the ability to listen to a driver’s voice as input and an image of the vehicle, we can look at automated edge cases where drivers arrive at site without a formal booking. The ability to scan additional documentation and extract information would be crucial in determining site access or denial.
Next week we will look at the latest open-source model from Microsoft, Phi3-Vision.