A Weekend AI Challenge: Making a Visible Assistant for Individuals with Imaginative and prescient Impairments | by Dmitrii Eliuseev

Machine Learning

A Weekend AI Challenge: Making a Visible Assistant for Individuals with Imaginative and prescient Impairments | by Dmitrii Eliuseev | Feb, 2024

hhhhm

2024年2月17日

A Weekend AI Challenge: Making a Visible Assistant for Individuals with Imaginative and prescient Impairments | by Dmitrii Eliuseev | Feb, 2024

[ad_1]

Working a multimodal LLaVA mannequin, digital camera, and speech synthesis

Trendy giant multimodal fashions (LMMs) can course of not solely textual content but additionally several types of knowledge. Certainly, “an image is price a thousand phrases,” and this performance will be essential through the interplay with the true world. On this “weekend challenge,” I’ll use a free LLaVA (Massive Language-and-Imaginative and prescient Assistant) mannequin, a digital camera, and a speech synthesizer; we are going to make an AI assistant that may assist folks with imaginative and prescient impairments. In the identical means as in earlier elements, all elements will run totally offline with none cloud price.

With out additional ado, let’s get into it!

Elements

On this challenge, I’ll use a number of elements:

A LLaVA mannequin, which mixes a big language mannequin and a visible encoder with the assistance of a particular projection matrix. This permits the mannequin to grasp not solely textual content but additionally picture prompts. I shall be utilizing the LlamaCpp library to run the mannequin (regardless of its title, it may well run not solely LLaMA however LLaVA fashions as effectively).
Streamlit Python library that permits us to make an interactive UI. Utilizing the digital camera, we will take the picture and ask the LMM totally different questions on it (for instance, we will ask the mannequin to explain the picture).
A TTS (text-to-speech) mannequin will convert the LMM’s reply into speech, so an individual with imaginative and prescient impairment can hearken to it. For the textual content conversion, I’ll use an MMS-TTS (Massively Multilingual Speech TTS) mannequin made by Fb.

As promised, all listed elements are free to make use of, don’t want any cloud API, and might work totally offline. From a {hardware} perspective, the mannequin can run on any Home windows or Linux laptop computer or pill (an 8 GB GPU is really useful however not necessary), and the UI can work in any browser, even on a smartphone.

Let’s get began.

LLaVA

LLaVA (Massive Language-and-Imaginative and prescient Assistant) is an open-source giant multimodal mannequin that mixes a imaginative and prescient encoder and an LLM for visible and language understanding. As was talked about earlier than, I’ll use a LlamaCpp to load the mannequin. This…

[ad_2]