Home Machine Learning Construct a (recipe) recommender chatbot utilizing RAG and hybrid search (Half I) | by Sebastian Bahr | Mar, 2024

Construct a (recipe) recommender chatbot utilizing RAG and hybrid search (Half I) | by Sebastian Bahr | Mar, 2024

0
Construct a (recipe) recommender chatbot utilizing RAG and hybrid search (Half I) | by Sebastian Bahr | Mar, 2024

[ad_1]

For this mission, we’ll use recipes from Public Area Recipes. All recipes are saved as markdown recordsdata on this GitHub repository. For this tutorial, I already did some knowledge cleansing and created options from the uncooked textual content enter. If you’re eager on doing the info cleansing half your self, the code is obtainable on my GitHub repository.

The dataset consists of the next columns:

  • title: the title of the recipe
  • date: the date the recipe was added
  • tags: an inventory of tags that describe the meal
  • introduction: an introduction to the recipe, the content material varies strongly between information
  • elements: all wanted elements. Be aware that I eliminated the amount as it’s not wanted for creating embeddings and opposite might result in undesirable suggestions.
  • course: all required steps it’s essential to carry out to cook dinner the meal
  • recipe_type: indicator if the recipe is vegan, vegetarian, or common
  • output: comprises the title, elements, and course of the recipe and might be later supplied to the chat mannequin as enter.

Let’s take a look on the distribution of the recipe_type characteristic. We see that almost all (60%) of the recipes embody fish or meat and aren’t vegetarian-friendly. Roughly 35% are vegetarian-friendly and solely 5% are vegan-friendly. This characteristic might be used as a tough filter for retrieving matching recipes from the vector database.

import re
import json
import spacy
import torch
import openai
import vertexai
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from transformers import AutoModelForMaskedLM, AutoTokenizer
from pinecone import Pinecone, ServerlessSpec
from vertexai.language_models import TextEmbeddingModel
from utils_google import authenticate
credentials, PROJECT_ID, service_account, pinecone_API_KEY = authenticate()
from utils_openai import authenticate
OPENAI_API_KEY = authenticate()

openai_client = openai.OpenAI(api_key=OPENAI_API_KEY)

REGION = "us-central1"
vertexai.init(mission = PROJECT_ID,
location = REGION,
credentials = credentials)

laptop = Pinecone(api_key=pinecone_API_KEY)

# obtain spacy mannequin
#!python -m spacy obtain en_core_web_sm

recipes = pd.read_json("recipes_v2.json")
recipes.head()
plt.bar(recipes.recipe_type.distinctive(), recipes.recipe_type.value_counts(normalize=True).values)
plt.present()
Distribution of recipe sorts

Hybrid search makes use of a mixture of sparse and dense vectors and a weighting issue alpha, which permits adjusting the significance of the dense vector within the retrieval course of. Within the following, we’ll create dense vectors primarily based on the title, tags, and introduction and sparse vectors on the elements. By adjusting alpha we will due to this fact in a while decide how a lot “consideration” needs to be paid to elements the consumer talked about in its question.

Earlier than creating the embeddings a brand new characteristic must be created that comprises the mixed info of the title, the tags, and the introduction.

recipes["dense_feature"] = recipes.title + "; " + recipes.tags.apply(lambda x: str(x).strip("[]").change("'", "")) + "; " + recipes.introduction
recipes["dense_feature"].head()

Lastly, earlier than diving deeper into the era of the embeddings we’ll take a look on the output column. The second a part of the tutorial might be all about making a chatbot utilizing OpenAI that is ready to reply consumer questions utilizing information from our recipe database. Due to this fact, after discovering the recipes that match finest the consumer question the chat mannequin wants some info it builds its reply on. That’s the place the output is used, because it comprises all of the wanted info for an satisfactory reply

# instance output
{'title': 'Creamy Mashed Potatoes',
'elements': 'The portions listed below are for about 4 grownup parts. If you're planning on consuming this as a aspect dish, it is likely to be extra like 6-8 parts. * 1kg potatoes * 200ml milk* * 200ml mayonnaise* * ~100g cheese * Garlic powder * 12-16 strips of bacon * Butter * 3-4 inexperienced onions * Black pepper * Salt *You may play with the proportions relying on how creamy or dry you need the mashed potatoes to be.',
'course': '1. Peel and lower the potatoes into medium sized items. 2. Put the potatoes in a pot with some water in order that it covers the potatoes and boil them for about 20-Half-hour, or till the potatoes are gentle. 3. About ten minutes earlier than eradicating the potatoes from the boiling water, lower the bacon into little items and fry it. 4. Heat up the milk and mayonnaise. 5. Shred the cheese. 6. When the potatoes are completed, take away all water from the pot, add the nice and cozy milk and mayonnaise combine, add some butter, and mash with a potato masher or a blender. 7. Add some salt, black pepper and garlic powder to style and proceed mashing the combination. 8. As soon as the combination is considerably homogeneous and the potatoes are correctly mashed, add the shredded cheese and fried bacon and blend slightly. 9. Serve and prime with chopped inexperienced onions.'}

Additional, a novel identifier must be added to every recipe, which permits retrieving the information of the beneficial candidate recipes and their output.

recipes["ID"] = vary(len(recipes))

Generate sparse embeddings

The subsequent step includes creating sparse embeddings for all 360 observations. To calculate these embeddings, a extra subtle technique than the incessantly used TF-IDF or BM25 strategy is used. As an alternative, the SPLADE Sparse Lexical and Expansion mannequin is utilized. An in depth clarification of SPLADE may be discovered right here. Dense embeddings have the identical form for every textual content enter, whatever the variety of tokens within the enter. In distinction, sparse embeddings comprise a weight for every distinctive token within the enter. The dictionary under represents a sparse vector, the place the token ID is the important thing and the assigned weight is the worth.

model_id = "naver/splade-cocondenser-ensembledistil"

tokenizer = AutoTokenizer.from_pretrained(model_id)
mannequin = AutoModelForMaskedLM.from_pretrained(model_id)

def to_sparse_vector(textual content, tokenizer, mannequin):
tokens = tokenizer(textual content, return_tensors='pt')
output = mannequin(**tokens)
vec = torch.max(
torch.log(1 + torch.relu(output.logits)) * tokens.attention_mask.unsqueeze(-1), dim=1
)[0].squeeze()

cols = vec.nonzero().squeeze().cpu().tolist()
weights = vec[cols].cpu().tolist()
sparse_dict = dict(zip(cols, weights))
return sparse_dict

sparse_vectors = []

for i in tqdm(vary(len(recipes))):
sparse_vectors.append(to_sparse_vector(recipes.iloc[i]["ingredients"], tokenizer, mannequin))

recipes["sparse_vectors"] = sparse_vectors

sparse embeddings of the primary recipe

Producing dense embeddings

At this level of the tutorial, some prices will come up when you use a textual content embedding mannequin from VertexAI (Google) or OpenAI. Nevertheless, when you use the identical dataset, the prices might be at most $5. The fee might range when you use a dataset with extra information or longer texts, as you might be charged by tokens. If you don’t want to incur any prices however nonetheless wish to comply with the tutorial, notably the second half, you possibly can obtain the pandas DataFrame recipes_with_vectors.pkl with pre-generated embedding knowledge from my GitHub repository.

You may select to make use of both VertexAI or OpenAI to create the embeddings. OpenAI has the benefit of being straightforward to arrange with an API key, whereas VertexAI requires logging into Google Console, making a mission, and including the VertexAI API to your mission. Moreover, the OpenAI mannequin lets you specify the variety of dimensions for the dense vector. Nonetheless, each of them create state-of-the-art dense embeddings.

Utilizing VertexAI API

# working this code will create prices !!!
mannequin = TextEmbeddingModel.from_pretrained("textembedding-gecko@003")

def to_dense_vector(textual content, mannequin):
dense_vectors = mannequin.get_embeddings([text])
return [dense_vector.values for dense_vector in dense_vectors][0]

dense_vectors = []

for i in tqdm(vary(len(recipes))):
dense_vectors.append(to_dense_vector(recipes.iloc[i]["dense_feature"], mannequin))

recipes["dense_vectors"] = dense_vectors

Utilizing OpenAI API

# working this code will create prices !!!

# Create dense embeddings utilizing OpenAIs textual content embedding mannequin with 768 dimensions
mannequin = "text-embedding-3-small"

def to_dense_vector_openAI(textual content, shopper, mannequin, dimensions):
dense_vectors = shopper.embeddings.create(mannequin=mannequin, dimensions=dimensions, enter=[text])
return [dense_vector.values for dense_vector in dense_vectors][0]

dense_vectors = []

for i in tqdm(vary(len(recipes))):
dense_vectors.append(to_dense_vector_openAI(recipes.iloc[i]["dense_feature"], openai_client, mannequin, 768))

recipes["dense_vectors"] = dense_vectors

Add knowledge to vector database

After producing the sparse and dense embeddings, we have now all the mandatory knowledge to add them to a vector database. On this tutorial, Pinecone might be used as they permit performing a hybrid search utilizing sparse and dense vectors and supply a serverless pricing schema with $100 free credit. To carry out a hybrid search in a while, the similarity metric must be set to dot product. If we might solely carry out a dense as an alternative of a hybrid search we might have the ability to choose considered one of these similarity metrics: dot product, cosine, and Euclidean distance. Extra details about similarity metrics and the way they calculate the similarity between two vectors may be discovered right here.

# load pandas DataFrame with pre-generated embeddings when you
# did not generate them within the final step
recipes = pd.read_pickle("recipes_with_vectors.pkl")

# if it's essential to delte an current index
laptop.delete_index("index-name")

# create a brand new index
laptop.create_index(
identify="recipe-project",
dimension=768, # alter if wanted
metric="dotproduct",
spec=ServerlessSpec(
cloud="aws",
area="us-west-2"
)
)

laptop.describe_index("recipe-project")

Congratulations on creating your first Pinecone index! Now, it’s time to add the embedded knowledge to the vector database. If the embedding mannequin you used creates vectors with a special variety of dimensions, be sure that to regulate the dimension argument.

Now it’s time to add the info to the newly created Pinecone index.

# upsert to pinecone in batches
def sparse_to_dict(knowledge):
dict_ = {"indices": checklist(knowledge.keys()),
"values": checklist(knowledge.values())}
return dict_

batch_size = 100
index = laptop.Index("recipe-project")

for i in tqdm(vary(0, len(recipes), batch_size)):
i_end = min(i + batch_size, len(recipes))
meta_batch = recipes.iloc[i: i_end][["ID", "recipe_type"]]
meta_dict = meta_batch.to_dict(orient="information")

sparse_batch = recipes.iloc[i: i_end]["sparse_vectors"].apply(lambda x: sparse_to_dict(x))
dense_batch = recipes.iloc[i: i_end]["dense_vectors"]

upserts = []

ids = [str(x) for x in range(i, i_end)]
for id_, meta, sparse_, dense_ in zip(ids, meta_dict, sparse_batch, dense_batch):
upserts.append({
"id": id_,
"sparse_values": sparse_,
"values": dense_,
"metadata": meta
})

index.upsert(upserts)

index.describe_index_stats()

If you’re interested in what the uploaded knowledge seems like, log in to Pinecone, choose the newly created index, and take a look at its gadgets. For now, we don’t want to concentrate to the rating, as it’s generated by default and signifies the match with a vector randomly generated by Pinecone. Nevertheless, later we’ll calculate the similarity of the embedded consumer question with all gadgets within the vector database and retrieve the okay most comparable gadgets. Additional, every merchandise comprises an merchandise ID generated by Pinecone, and the metadata, which consists of the recipe ID and its recipe_type. The dense embeddings are saved in Values and the sparse embeddings in Sparse Values.

The primary three gadgets of the index (Picture by creator)

We are able to fetch the knowledge from above utilizing the Pinecone Python SDK. Let’s take a look on the saved info of the primary merchandise with the index merchandise ID 50.

index.fetch(ids=["50"])

As within the Pinecone dashboard, we get the merchandise ID of the component, its metadata, the sparse values, and the dense values, that are saved within the checklist on the backside of the truncated output.

[ad_2]