Home Machine Learning A Excessive Stage Information to LLM Analysis Metrics | by David Hundley | Feb, 2024

A Excessive Stage Information to LLM Analysis Metrics | by David Hundley | Feb, 2024

0
A Excessive Stage Information to LLM Analysis Metrics | by David Hundley | Feb, 2024

[ad_1]

Creating an understanding of a wide range of LLM benchmarks & scores, together with an instinct of when they might be of worth on your goal

17 min learn

13 hours in the past

Title card created by the writer

Evidently nearly on a weekly foundation, a brand new giant language mannequin (LLM) is launched to the general public. With every announcement of an LLM, these suppliers will tout efficiency numbers that may sound fairly spectacular. The problem that I’ve discovered is that there’s a large breadth of efficiency metrics which are referenced throughout these press releases. Whereas there are a couple of that present up extra usually than the others, there sadly is just not merely one or two “go to” metrics. If you wish to see a tangible instance of this, try the web page for GPT-4’s efficiency. It references many alternative benchmarks and scores!

The primary pure query one might need is, “Why can’t we merely agree to make use of a single metric?” In brief, there isn’t a clear strategy to assess LLM efficiency, so every efficiency metric seeks to offer a quantitative evaluation for one targeted area. Moreover, many of those efficiency metrics have “sub-metrics” that calculate the metric barely in a different way than the unique metric. After I initially began performing analysis for this weblog publish, my intention was to cowl each single certainly one of these benchmarks and scores, however I rapidly found if I have been to take action, we’d be protecting over 50 totally different metrics!

As a result of assessing every particular person metric isn’t precisely possible, what I found is that we are able to chunk these numerous benchmarks and scores into classes of what they’re typically attempting to evaluate. Within the the rest of this publish, we’ll cowl these numerous classes and likewise present particular examples of common metricsthat would fall beneath every of those classes. The purpose of this publish is which you could stroll away from this publish with a normal sense of which efficiency metric you assessing on your particular use case.

The six classes we’ll assess on this publish embody the next. Please notice: there isn’t significantly an “trade normal” on how these classes have been created. These classes have been created by how I hear them referenced most frequently:

  1. Basic information benchmarks

[ad_2]