Are GPTs Good Embedding Fashions. A stunning experiment to point out that… | by Yu-Cheng Tsai

Machine Learning

Are GPTs Good Embedding Fashions. A stunning experiment to point out that… | by Yu-Cheng Tsai | Could, 2024

hhhhm

2024年5月18日

Are GPTs Good Embedding Fashions. A stunning experiment to point out that… | by Yu-Cheng Tsai | Could, 2024

[ad_1]

A stunning experiment to point out that the satan is within the particulars

With the rising variety of embedding fashions obtainable, selecting the best one to your machine studying functions will be difficult. Thankfully, the MTEB leaderboard gives a complete vary of rating metrics for numerous pure language processing duties.

Prime 5 embedding fashions from the MTEB leaderboard as of Could seventeenth, 2024

If you go to the location, you’ll discover that the highest 5 embedding fashions are Generative Pre-trained Transformers (GPTs). This may lead you to assume that GPT fashions are the perfect for embeddings. However is that this actually true? Let’s conduct an experiment to search out out.

Embeddings are tensor illustration of texts, that converts textual content token IDs and tasks them right into a tensor area.

By inputting textual content right into a neural community mannequin and performing a ahead go, you possibly can acquire embedding vectors. Nonetheless, the precise course of is a little more complicated. Let’s break it down step-by-step:

Convert the textual content into token IDs
Go the token IDs right into a neural community
Return the outputs of the neural community

In step one, I’m going to make use of a tokenizer to attain it. model_inputs is the tensor illustration of the textual content content material, "some questions." .

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
messages = [
{
"role": "user",
"content": "some questions.",
},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")

The second step is simple, forward-passing the model_inputs right into a neural community. The logits of generated tokens will be accessed through .logits. torch.no_grad() means I don’t need the mannequin weights to be up to date as a result of the mannequin is in inference mode.

import torchwith torch.no_grad():
return mannequin(model_inputs).logits

The third step is a bit difficult. GPT fashions are decoder-only, and their token era is autoregressive. In easy phrases, the final token of a accomplished sentence has seen all of the previous tokens within the sentence. Subsequently, the output of the final token comprises all of the affinity scores (attentions) from the previous tokens.

Bingo! You might be most within the final token due to the eye mechanism within the transformers.

The output dimension of the GPTs applied in Hugging Face is (batch dimension, enter token dimension, variety of vocabulary). To get the final token output of all of the batches, I can carry out a tensor slice.

import torch
with torch.no_grad():
return mannequin(model_inputs).logits[:, -1, :]

To measure the standard of those GPT embeddings, you should utilize cosine similarity. The upper the cosine similarity, the nearer the semantic that means of the sentences.

import torch
def compute_cosine_similarity(vec1, vec2):
cos = torch.nn.CosineSimilarity(dim=1, eps=1e-6)
return cos(vec1, vec2)

Let’s create some util features that permits us to loop by listing of query and reply pairs and see the outcome. Mistral 7b v0.1 instruct , one of many nice open-sourced fashions, is used for this experiment.

import torch
from termcolor import coloured
from transformers import AutoModelForCausalLM, AutoTokenizermannequin = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
def generate_last_token_embeddings(query, max_new_tokens=30):
messages = [
{
"role": "user",
"content": question,
},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")
model_inputs = encodeds.to("cuda")
with torch.no_grad():
return mannequin(model_inputs).logits[:, -1, :]
def get_similarities(questions, solutions):
for query in questions:
for reply in solutions:
q_embedding, a_embedding = (
generate_last_token_embeddings(query),
generate_last_token_embeddings(reply),
)
similarity = compute_cosine_similarity(q_embedding, a_embedding)
print(coloured(f"query: {query} and ans: {reply}", "inexperienced"))
print(coloured(f"outcome: {similarity}", "blue"))
questions = ["Where is the headquarter of OpenAI?", "What is GPU?"]
solutions = [
"OpenAI is based at San Francisco.",
"A graphics processing unit (GPU) is an electronic circuit that can perform mathematical calculations quickly",
]

Cosine similarities for mistral 7b v0.1 instruct (Picture by the writer)

For the primary query and reply pair:

Query: “What’s the headquarter of OpenAI?”
Reply: “OpenAI relies at San Francisco.”
Cosine Similarity: 0.96

For the second query and reply pair:

Query: “What’s GPU?”
Reply: “A graphics processing unit (GPU) is an digital circuit that may carry out mathematical calculations shortly.”
Cosine Similarity: 0.94

For an irrelevant pair:

Query: “The place is the headquarter of OpenAI?”
Reply: “A graphics processing unit (GPU) is an digital circuit that may carry out mathematical calculations shortly.”
Cosine Similarity: 0.90

For the worst pair:

Query: “What’s GPU?”
Reply: “OpenAI relies at San Francisco.”
Cosine Similarity: 0.93

These outcomes counsel that utilizing GPT fashions, on this case, the mistral 7b instruct v0.1, as embedding fashions might not yield nice outcomes by way of distinguishing between related and irrelevant pairs. However why are GPT fashions nonetheless among the many high 5 embedding fashions?

tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-mistral-7b-instruct")
mannequin = AutoModelForCausalLM.from_pretrained(
"intfloat/e5-mistral-7b-instruct"
)

Cosine similarities for `e5-mistral-7b-instruct (Picture by the writer)`

Repeating the identical analysis process with a distinct mannequin, e5-mistral-7b-instruct, which is likely one of the high open-sourced fashions from the MTEB leaderboard and fine-tuned from mistral 7b instruct, I uncover that the cosine similarity for the related query and pairs are 0.88 and 0.84 for OpenAI and GPU questions, respectively. For the irrelevant query and reply pairs, the similarity drops to 0.56 and 0.67. This findings suggests e5-mistral-7b-instruct is a much-improved mannequin for embeddings. What makes such an enchancment?

Delving into the paper behind e5-mistral-7b-instruct, the secret is using contrastive loss to additional tremendous tune the mistral mannequin.

Not like GPTs which can be skilled or additional fine-tuned utilizing cross-entropy loss of predicted tokens and labeled tokens, contrastive loss goals to maximise the gap between unfavorable pairs and reduce the gap between the constructive pairs.

This weblog submit covers this idea in better particulars. The sim operate calculates the cosine distance between two vectors. For contrastive loss, the denominators symbolize the cosine distance between constructive examples and unfavorable examples. The rationale behind contrastive loss is that we would like related vectors to be as near 1 as attainable, since log(1) = 0 represents the optimum loss.

On this submit, I’ve highlighted a standard pitfall of utilizing GPTs as embedding fashions with out fine-tuning. My analysis means that fine-tuning GPTs with contrastive loss, the embeddings will be extra significant and discriminative. By understanding the strengths and limitations of GPT fashions, and leveraging custom-made loss like contrastive loss, you may make extra knowledgeable selections when deciding on and using embedding fashions to your machine studying tasks. I hope this submit helps you select GPTs fashions properly to your functions and sit up for listening to your suggestions! 🙂

[ad_2]