Home Machine Learning Evaluating Textual content Technology in Massive Language Fashions | by Mina Ghashami | Jan, 2024

Evaluating Textual content Technology in Massive Language Fashions | by Mina Ghashami | Jan, 2024

0
Evaluating Textual content Technology in Massive Language Fashions | by Mina Ghashami | Jan, 2024

[ad_1]

Metrics to measure the hole between neural textual content and human textual content

Picture from unsplash.com

Lately, massive language fashions have proven super skill in producing human-like texts. There are lots of metrics to measure how shut/comparable a textual content generated by massive language fashions is to the reference human textual content. Actually, bridging this hole is an lively space of analysis.

On this publish, we glance into two well-known metrics for routinely evaluating the machine generated texts.

Take into account you’re given a reference textual content that’s human-generated, and a machine-generated textual content that’s generated by an LLM. To compute the semantic similarity between these two texts, BERTScore compute pairwise cosine similarity of token embeddings. See the picture beneath:

Picture from [1]

Right here the reference textual content is “the climate is chilly right this moment” and the candidate textual content which is machine generated is “it’s freezing right this moment”. If we compute the n-gram similarity these two texts can have a low rating. Nonetheless, we all know they’re semantically very comparable. So BERTScore computes the contextual embedding of every token in each reference textual content and the candidate textual content and the based mostly on these embedding vectors, it computes the pairwise cosine similarities.

Picture from [1]

Based mostly on pairwise cosine similarities, we will compute precision, recall and F1 rating. To take action as following:

  • Recall: we get the utmost cosine similarity for each token within the reference textual content and get their common
  • Precision: we get the utmost cosine similarity for each token within the candidate textual content and get their common
  • F1 rating: the harmonic imply of precision and recall

BERTScore[1] additionally suggest a modification to above rating known as as “significance weighting”. In “significance weighting” , considers the truth that uncommon phrase that are widespread between two sentences are extra…

[ad_2]