Constructing an Picture Similarity Search Engine with FAISS and CLIP | by Lihi Gur Arie, PhD | Aug, 2024

A guided tutorial explaining the way to search your picture dataset with textual content or picture queries, utilizing CLIP embedding and FAISS indexing.

Picture was generated by creator on Flux-Professional platform

Have you ever ever needed to search out a picture amongst your endless picture dataset, however discovered it too tedious? On this tutorial we’ll construct a picture similarity search engine to simply discover pictures utilizing both a textual content question or a reference picture. On your comfort, the whole code for this tutorial is supplied on the backside of the article as a Colab pocket book.

Pipeline Overview

The semantic which means of a picture may be represented by a numerical vector referred to as an embedding. Evaluating these low-dimensional embedding vectors, slightly than the uncooked pictures, permits for environment friendly similarity searches. For every picture within the dataset, we’ll create an embedding vector and retailer it in an index. When a textual content question or a reference picture is supplied, its embedding is generated and in contrast in opposition to the listed embeddings to retrieve probably the most comparable pictures.

Right here’s a short overview:

  1. Embedding: The embeddings of the pictures are extracted utilizing the CLIP mannequin.
  2. Indexing: The embeddings are saved as a FAISS index.
  3. Retrieval: With FAISS, The embedding of the question is in contrast in opposition to the listed embeddings to retrieve probably the most comparable pictures.

CLIP Mannequin

The CLIP (Contrastive Language-Picture Pre-training) mannequin, developed by OpenAI, is a multi-modal imaginative and prescient and language mannequin that maps pictures and textual content to the identical latent area. Since we are going to use each picture and textual content queries to seek for pictures, we are going to use the CLIP mannequin to embed our information. For additional studying about CLIP, you’ll be able to try my earlier article right here.

FAISS Index

FAISS (Fb AI Similarity Search) is an open-source library developed by Meta. It’s constructed across the Index object that shops the database embedding vectors. FAISS permits environment friendly similarity search and clustering of dense vectors, and we are going to use it to index our dataset and retrieve the photographs that resemble to the question.

Step 1 — Dataset Exploration

To create the picture dataset for this tutorial I collected 52 pictures of various matters from Pexels. To get the sensation, lets observe 10 random pictures:

Step 2 — Extract CLIP Embeddings from the Picture Dataset

To extract CLIP embeddings, we‘ll first load the CLIP mannequin utilizing the HuggingFace SentenceTransformer library:

mannequin = SentenceTransformer('clip-ViT-B-32')

Subsequent, we’ll create a perform that iterates by way of our dataset listing with glob, opens every picture with PIL Picture.open, and generates an embedding vector for every picture with CLIP mannequin.encode. It returns an inventory of the embedding vectors and an inventory of the paths of our pictures dataset:

def generate_clip_embeddings(images_path, mannequin):

image_paths = glob(os.path.be part of(images_path, '**/*.jpg'), recursive=True)

embeddings = []
for img_path in image_paths:
picture = Picture.open(img_path)
embedding = mannequin.encode(picture)
embeddings.append(embedding)

return embeddings, image_paths

IMAGES_PATH = '/path/to/pictures/dataset'

embeddings, image_paths = generate_clip_embeddings(IMAGES_PATH, mannequin)

Step 3 — Generate FAISS Index

The following step is to create a FAISS index from the embedding vectors checklist. FAISS gives varied distance metrics for similarity search, together with Interior Product (IP) and L2 (Euclidean) distance.

FAISS additionally gives varied indexing choices. It could possibly use approximation or compression method to deal with giant datasets effectively whereas balancing search velocity and accuracy. On this tutorial we are going to use a ‘Flat’ index, which performs a brute-force search by evaluating the question vector in opposition to each single vector within the dataset, guaranteeing actual outcomes at the price of greater computational complexity.

def create_faiss_index(embeddings, image_paths, output_path):

dimension = len(embeddings[0])
index = faiss.IndexFlatIP(dimension)
index = faiss.IndexIDMap(index)

vectors = np.array(embeddings).astype(np.float32)

# Add vectors to the index with IDs
index.add_with_ids(vectors, np.array(vary(len(embeddings))))

# Save the index
faiss.write_index(index, output_path)
print(f"Index created and saved to {output_path}")

# Save picture paths
with open(output_path + '.paths', 'w') as f:
for img_path in image_paths:
f.write(img_path + 'n')

return index

OUTPUT_INDEX_PATH = "/content material/vector.index"
index = create_faiss_index(embeddings, image_paths, OUTPUT_INDEX_PATH)

The faiss.IndexFlatIP initializes an Index for Interior Product similarity, wrapped in an faiss.IndexIDMap to affiliate every vector with an ID. Subsequent, the index.add_with_ids provides the vectors to the index with sequential ID’s, and the index is saved to disk together with the picture paths.

The index can be utilized instantly or saved to disk for future use .To load the FAISS index we are going to use this perform:

def load_faiss_index(index_path):
index = faiss.read_index(index_path)
with open(index_path + '.paths', 'r') as f:
image_paths = [line.strip() for line in f]
print(f"Index loaded from {index_path}")
return index, image_paths

index, image_paths = load_faiss_index(OUTPUT_INDEX_PATH)

Step 4 — Retrieve Photos by a Textual content Question or a Reference Picture

With our FAISS index constructed, we are able to now retrieve pictures utilizing both textual content queries or reference pictures. If the question is a picture path, the question is opened with PIL Picture.open. Subsequent, the question embedding vector is extracted with CLIP mannequin.encode.

def retrieve_similar_images(question, mannequin, index, image_paths, top_k=3):

# question preprocess:
if question.endswith(('.png', '.jpg', '.jpeg', '.tiff', '.bmp', '.gif')):
question = Picture.open(question)

query_features = mannequin.encode(question)
query_features = query_features.astype(np.float32).reshape(1, -1)

distances, indices = index.search(query_features, top_k)

retrieved_images = [image_paths[int(idx)] for idx in indices[0]]

return question, retrieved_images

The Retrieval is occurring on the index.search methodology. It implements a k-Nearest Neighbors (kNN) search to search out the okay most comparable vectors to the question vector. We will modify the worth of okay by altering the top_k parameter. The gap metric used within the kNN search in our implementation is the cosine similarity. The perform returns the question and an inventory of retrieve pictures paths.

Search with a Textual content Question:

Now we’re prepared to look at the search outcomes. The helper perform visualize_results shows the outcomes. You may fined it within the related Colab pocket book. Lets discover the retrieved most comparable 3 pictures for the textual content question “ball” for instance:

question = 'ball'
question, retrieved_images = retrieve_similar_images(question, mannequin, index, image_paths, top_k=3)
visualize_results(question, retrieved_images)
Retrieved pictures with the question: ‘a ball’

For the question ‘animal’ we get:

Retrieved pictures with the question: ‘animal’

Search with a Reference Picture:

question ='/content material/drive/MyDrive/Colab Notebooks/my_medium_projects/Image_similarity_search/image_dataset/pexels-w-w-299285-889839.jpg'
question, retrieved_images = retrieve_similar_images(question, mannequin, index, image_paths, top_k=3)
visualize_results(question, retrieved_images)
Question and Retrieved pictures

As we are able to see, we get fairly cool outcomes for an off-the-shelf pre-trained mannequin. After we searched by a reference picture of an eye fixed portray, moreover discovering the unique picture, it discovered one match of eyeglass and one among a special portray. This demonstrates totally different elements of the semantic which means of the question picture.

You may strive different queries on the supplied Colab pocket book to see how the mannequin performs with totally different textual content and picture inputs.

On this tutorial we constructed a primary picture similarity search engine utilizing CLIP and FAISS. The retrieved pictures shared comparable semantic which means with the question, indicating the effectiveness of the method. Although CLIP reveals good outcomes for a Zero Shot mannequin, it’d exhibit low efficiency on Out-of-Distribution information, High-quality-Grained duties and inherit the pure bias of the info it was educated on. To beat these limitations you’ll be able to strive different CLIP-like pre-trained fashions as in OpenClip, or fine-tune CLIP by yourself customized dataset.

Congratulations on making all of it the way in which right here. Click on 👍 to point out your appreciation and lift the algorithm self-worth 🤓

Wish to be taught extra?

Colab Pocket book Hyperlink