Easy to use AI that explains images

MLBoy

3 min readJul 2, 2023

AI that explains properly

If you put an image, it will return text.
It uses a library called LAVIS.
It’s super easy to use.

GitHub - salesforce/LAVIS: LAVIS - A One-stop Library for Language-Vision Intelligence

LAVIS - A One-stop Library for Language-Vision Intelligence - GitHub - salesforce/LAVIS: LAVIS - A One-stop Library for…

github.com

Usage

install

pip install salesforce-lavis

List of available models

from lavis.models import model_zoo
print(model_zoo)

==================================================
Architectures Types
=============================================== ==
albef_classification ve
albef_feature_extractor base
albef_nlvr nlvr
albef_pretrain base
albef_retrieval coco, flickr
albef_vqa vqav2
alpro_qa msrvtt, msvd
alpro_retrieval msrvtt, didemo
blip_caption base_coco, large_coco
blip_classification base
blip_feature_extractor base
blip_image_text _matching base, large
blip_nlvr nlvr
blip_pretrain base
blip_retrieval coco, flickr
blip_vqa vqav2, okvqa, aokvqa
blip2_opt pretrain_opt2.7b, pretrain_opt6.7b, caption_coco_opt2.7b, caption_coco_opt6.7b
blip2_t5 pretrain_flant5xl, pretrain_flant5xl_vitL, pretrain_flant5xxl, caption_coco_flant5xl
blip2_feature_extractor pretrain, pretrain_vitL, coco
blip2 pretrain, pretrain_vitL , coco
blip2_image_text_matching pretrain, pretrain_vitL, coco
pnp_vqa base, large, 3b
pnp_unifiedqav2_fid
img2prompt_vqa base
clip_feature_extractor ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14–336, RN50
clip ViT-B-32, ViT-B-16, ViT-L-14, ViT- L-14–336, RN50
gpt_dialogue base
========================================= =========
Architectures Types
====================================== ===========
albef_classification ve
albef_feature_extractor base
albef_nlvr nlvr
albef_pretrain base
albef_retrieval coco, flickr
albef_vqa vqav2
alpro_qa msrvtt, msvd
alpro_retrieval msrvtt, didemo
blip_caption base_coco, large_coco
blip_classification base
blip_feature_extractor base
blip_image_text_match ing base, large
blip_nlvr nlvr
blip_pretrain base
blip_retrieval coco, flickr
blip_vqa vqav2, okvqa, aokvqa
blip2_opt pretrain_opt2 .7b, pretrain_opt6.7b, caption_coco_opt2.7b, caption_coco_opt6.7b
blip2_t5 pretrain_flant5xl, pretrain_flant5xl_vitL, pretrain_flant5xxl, caption_coco_flant5xl
blip2_feature_extractor pretrain, pretrain_vitL, coco
blip2 pretrain, pretrain_vitL, coco
blip2_image_text_matching pretrain, pretrain_vitL, coco
pnp_vqa base, large, 3b
pnp_unifiedqav2_fid
img2prompt_vqa base
clip_feature_extractor ViT-B-32, ViT-B-16, ViT-L- 14, ViT-L- 14–336, RN50
clip ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14–336, RN50
gpt_dialogue base

execution

import torch
from PIL import Image
from lavis.models import load_model_and_preprocess

# setup device to use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# load sample image
raw_image = Image.open("cosplay_girl.jpg").convert("RGB")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
model.generate({"image": image})

’a person is pouring syrup on a stack of pancakes’

ask about the image

from lavis.models import load_model_and_preprocess
model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)
# ask a random question.
question = "What color is this person's hair?"
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
question = txt_processors["eval"](question)
model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")

‘berries’

🐣

I’m a freelance engineer.
Work consultation
Please feel free to contact us with a brief development description.
rockyshikoku@gmail.com

I am creating applications using machine learning and AR technology.

I send machine learning / AR related information.

GitHub

Twitter
Medium

Easy to use AI that explains images

AI that explains properly

GitHub - salesforce/LAVIS: LAVIS - A One-stop Library for Language-Vision Intelligence

LAVIS - A One-stop Library for Language-Vision Intelligence - GitHub - salesforce/LAVIS: LAVIS - A One-stop Library for…

Usage

Written by MLBoy