Home Machine Learning Merge Giant Language Fashions with mergekit | by Maxime Labonne | Jan, 2024

Merge Giant Language Fashions with mergekit | by Maxime Labonne | Jan, 2024

0
Merge Giant Language Fashions with mergekit | by Maxime Labonne | Jan, 2024

[ad_1]

Create your personal fashions simply, no GPU required!

Picture by writer

Mannequin merging is a method that combines two or extra LLMs right into a single mannequin. It’s a comparatively new and experimental technique to create new fashions for affordable (no GPU required). Mannequin merging works surprisingly nicely and produced many state-of-the-art fashions on the Open LLM Leaderboard.

On this tutorial, we are going to implement it utilizing the mergekit library by Charles Goddard. Extra particularly, we are going to assessment 4 merge strategies and supply examples of configurations. Then, we are going to use mergekit to create our personal mannequin, Marcoro14–7B-slerp, which grew to become the best-performing mannequin on the Open LLM Leaderboard (02/01/23).

The code is on the market on GitHub and Google Colab. I like to recommend utilizing my automated pocket book to simply run mergekit: 🥱 LazyMergekit.

Picture by writer

On this part, we are going to concentrate on 4 strategies presently applied in mergekit. Observe that there are different strategies, corresponding to linear and Job Arithmetic. If you happen to’re enthusiastic about papers on mannequin merging, I like to recommend this wonderful assortment on Hugging Face.

1. SLERP

Spherical Linear Interpolation (SLERP) is a technique used to easily interpolate between two vectors. It maintains a relentless charge of change and preserves the geometric properties of the spherical area wherein the vectors reside.

There are a number of causes to desire SLERP over a conventional linear interpolation. For instance, in high-dimensional areas, linear interpolation can result in a lower within the magnitude of the interpolated vector (i.e., it reduces the size of weights). Furthermore, the change in course of the weights typically represents extra significant data (like characteristic studying and illustration) than the magnitude of change.

SLERP is applied utilizing the next steps:

  1. Normalize the enter vectors to unit size, making certain they symbolize instructions quite than magnitudes
  2. Calculate the angle between these vectors utilizing their dot product.
  3. If the vectors are practically collinear, it defaults to linear interpolation for effectivity. In any other case, SLERP computing scale components based mostly on the interpolation issue t (t=0 = 100% of the primary vector, t=1 = 100% of mannequin 2) and the angle between the vectors.
  4. These components are used to weigh the unique vectors, that are then summed to acquire the interpolated vector.

SLERP is presently the most well-liked merging technique, however it’s restricted to combining solely two fashions at a time. It’s nonetheless attainable to hierarchically mix a number of fashions, as proven in Mistral-7B-Merge-14-v0.1.

Instance of configuration:

slices:
- sources:
- mannequin: OpenPipe/mistral-ft-optimized-1218
layer_range: [0, 32]
- mannequin: mlabonne/NeuralHermes-2.5-Mistral-7B
layer_range: [0, 32]
merge_method: slerp
base_model: OpenPipe/mistral-ft-optimized-1218
parameters:
t:
- filter: self_attn
worth: [0, 0.5, 0.3, 0.7, 1]
- filter: mlp
worth: [1, 0.5, 0.7, 0.3, 0]
- worth: 0.5
dtype: bfloat16

This can be a traditional SLERP configuration, utilized to each layer of each fashions. Observe that we enter a gradient of values for the interpolation issue t. The parameters for the self-attention and MLP layers will use totally different combos of OpenPipe/mistral-ft-optimized-1218 and mlabonne/NeuralHermes-2.5-Mistral-7B. The opposite layers are a 50/50 combination of the 2 fashions.

You will discover the ultimate mannequin on the Hugging Face Hub at mlabonne/NeuralPipe-7B-slerp.

2. TIES

Launched in this paper by Yadav et al., TIES-Merging is designed to effectively merge a number of task-specific fashions right into a single multitask mannequin. It addresses two fundamental challenges in mannequin merging:

  • Redundancy in mannequin parameters: It identifies and eliminates redundant parameters inside task-specific fashions. That is achieved by specializing in the modifications made throughout fine-tuning, figuring out the top-k% most vital modifications, and discarding the remainder.
  • Disagreement between parameter indicators: Conflicts come up when totally different fashions counsel opposing changes to the identical parameter. TIES-Merging resolves these conflicts by making a unified signal vector that represents essentially the most dominant course of change throughout all fashions.

TIES-Merging is split into the next three steps:

  1. Trim: Reduces redundancy in task-specific fashions by retaining solely a fraction essentially the most important parameters (density parameter) and resetting the remainder to zero.
  2. Elect Signal: Resolves signal conflicts throughout totally different fashions by making a unified signal vector based mostly on essentially the most dominant course (optimistic or damaging) by way of cumulative magnitude.
  3. Disjoint Merge: Averages parameter values that align with the unified signal vector, excluding zero values.

Not like SLERP, TIES can merge a number of fashions at a time.

Instance of configuration:

fashions:
- mannequin: mistralai/Mistral-7B-v0.1
# no parameters mandatory for base mannequin
- mannequin: OpenPipe/mistral-ft-optimized-1218
parameters:
density: 0.5
weight: 0.5
- mannequin: mlabonne/NeuralHermes-2.5-Mistral-7B
parameters:
density: 0.5
weight: 0.3
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
normalize: true
dtype: float16

With this config, we use Mistral-7B as a base mannequin to calculate the delta weights. We merge the identical two fashions: mistral-ft-optimized-1218 (50%) and NeuralHermes-2.5-Mistral-7B (30%) with normalization. Right here, the density signifies that we’re solely retaining 50% of the parameters of every mannequin (the opposite half comes from the bottom mannequin).

Observe that the sum of the weights isn’t equal to 1. It means that we are going to reducle the size of the ultimate weights (not beneficial really!). This config is impressed by the parameters supplied by the writer of OpenHermes-2.5-neural-chat-7b-v3–1–7B.

You will discover the ultimate mannequin on the Hugging Face Hub at mlabonne/NeuralPipe-7B-ties.

3. DARE

Launched by Yu et al. (2023), DARE makes use of an strategy much like TIES with two fundamental variations:

  • Pruning: DARE randomly reset fine-tuned weights to their authentic values (these of the bottom mannequin).
  • Rescaling: DARE rescales the weights to maintain the expectations of mannequin outputs roughly unchanged. It provides the rescaled weights of each (or extra) fashions to the weights of the bottom mannequin with a scale issue.

Mergekit’s implementation of this technique has two flavours: with the signal election step of TIES (dare_ties) or with out (dare_linear).

Instance of configuration:

fashions:
- mannequin: mistralai/Mistral-7B-v0.1
# No parameters mandatory for base mannequin
- mannequin: samir-fama/SamirGPT-v1
parameters:
density: 0.53
weight: 0.4
- mannequin: abacusai/Slerp-CM-mist-dpo
parameters:
density: 0.53
weight: 0.3
- mannequin: EmbeddedLLM/Mistral-7B-Merge-14-v0.2
parameters:
density: 0.53
weight: 0.3
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
int8_mask: true
dtype: bfloat16

On this configuration, we merge three totally different fashions based mostly on Mistral-7B utilizing dare_ties. This time, I selected weights that sum to 1, which appears to carry out higher. The density parameter is a little bit larger than what’s beneficial, but it surely seems prefer it labored nicely for this mannequin.

You will discover it on the Hugging Face Hub at mlabonne/Daredevil-7B. It’s additionally the very best merge mannequin on this article, outperforming even Marcoro14–7B-slerp.

4. Passthrough

The passthrough technique differs considerably from the earlier ones. By concatenating layers from totally different LLMs, it may possibly produce fashions with an unique variety of parameters (e.g., 9B with two 7B parameter fashions). These fashions are sometimes called “frankenmerges” or “Frankenstein fashions” by the group.

This method may be very experimental, but it surely managed to create spectacular fashions, like goliath-120b utilizing two Llama 2 70B fashions. The lately launched SOLAR-10.7B-v1.0 additionally makes use of the identical thought, referred to as depth-up scaling of their paper.

Instance of configuration:

slices:
- sources:
- mannequin: OpenPipe/mistral-ft-optimized-1218
layer_range: [0, 32]
- sources:
- mannequin: mlabonne/NeuralHermes-2.5-Mistral-7B
layer_range: [24, 32]
merge_method: passthrough
dtype: bfloat16

The ensuing frankenmerge can have all of the 32 layers from the primary mannequin and eight further layers from the second mannequin. This creates a frankenmerge with a complete of 40 layers and eight.99B parameters. This config is impressed by GML-Mistral-merged-v1.

You will discover the ultimate mannequin on the Hugging Face Hub at mlabonne/NeuralPipe-9B-merged.

On this part, we are going to use mergekit to load a merge configuration, run it, and add the ensuing mannequin to the Hugging Face Hub.

Initially, we set up mergekit immediately from supply as follows:

!git clone https://github.com/cg123/mergekit.git
!cd mergekit && pip set up -q -e .

Within the following block, we load the merge configuration in a YAML format. We additionally specify the identify of the merged mannequin for future use. You may copy/paste any configuration from the earlier part right here.

This time, we are going to use two totally different fashions: Marcoroni-7B-v3 and Mistral-7B-Merge-14-v0.1 and merge them with the SLERP technique. We save the config as a yaml file for use as enter within the merge command.

import yaml

MODEL_NAME = "Marcoro14-7B-slerp"
yaml_config = """
slices:
- sources:
- mannequin: AIDC-ai-business/Marcoroni-7B-v3
layer_range: [0, 32]
- mannequin: EmbeddedLLM/Mistral-7B-Merge-14-v0.1
layer_range: [0, 32]
merge_method: slerp
base_model: AIDC-ai-business/Marcoroni-7B-v3
parameters:
t:
- filter: self_attn
worth: [0, 0.5, 0.3, 0.7, 1]
- filter: mlp
worth: [1, 0.5, 0.7, 0.3, 0]
- worth: 0.5
dtype: bfloat16

"""

# Save config as yaml file
with open('config.yaml', 'w', encoding="utf-8") as f:
f.write(yaml_config)

We run the merge command with the next parameters:

  • --copy-tokenizer to repeat the tokenizer from the bottom mannequin
  • --allow-crimes and --out-shard-size to chunk the fashions into smaller shards that may be computed on a CPU with low RAM
  • --lazy-unpickle to allow the experimental lazy unpickler for decrease reminiscence utilization

As well as, some fashions can require the --trust_remote_code flag (this isn’t the case with Mistral-7B).

This command will obtain the weights of all of the fashions listed within the merge configuration and run the chosen merge technique (it ought to take ~10 minutes).

# Merge fashions
!mergekit-yaml config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickl

The mannequin is now merged and saved within the `merge` listing. Earlier than importing it, we will create a README file with all the data required for reproducibility. The next code block defines a Jinja template and routinely fills it with the information from the merge configuration.

!pip set up -qU huggingface_hub

from huggingface_hub import ModelCard, ModelCardData
from jinja2 import Template

username = "mlabonne"

template_text = """
---
license: apache-2.0
tags:
- merge
- mergekit
- lazymergekit
{%- for mannequin in fashions %}
- {{ mannequin }}
{%- endfor %}
---

# {{ model_name }}

{{ model_name }} is a merge of the next fashions utilizing [mergekit](https://github.com/cg123/mergekit):

{%- for mannequin in fashions %}
* [{{ model }}](https://huggingface.co/{{ mannequin }})
{%- endfor %}

## 🧩 Configuration

```yaml
{{- yaml_config -}}
```
"""

# Create a Jinja template object
jinja_template = Template(template_text.strip())

# Get checklist of fashions from config
information = yaml.safe_load(yaml_config)
if "fashions" in information:
fashions = [data["models"][i]["model"] for i in vary(len(information["models"])) if "parameters" in information["models"][i]]
elif "parameters" in information:
fashions = [data["slices"][0]["sources"][i]["model"] for i in vary(len(information["slices"][0]["sources"]))]
elif "slices" in information:
fashions = [data["slices"][i]["sources"][0]["model"] for i in vary(len(information["slices"]))]
else:
elevate Exception("No fashions or slices present in yaml config")

# Fill the template
content material = jinja_template.render(
model_name=MODEL_NAME,
fashions=fashions,
yaml_config=yaml_config,
username=username,
)

# Save the mannequin card
card = ModelCard(content material)
card.save('merge/README.md')

Now that now we have a mannequin card, we will push your complete folder to the Hub.

from google.colab import userdata
from huggingface_hub import HfApi

username = "mlabonne"

# Outlined within the secrets and techniques tab in Google Colab
api = HfApi(token=userdata.get("HF_TOKEN"))

api.create_repo(
repo_id=f"{username}/{MODEL_NAME}",
repo_type="mannequin"
)
api.upload_folder(
repo_id=f"{username}/{MODEL_NAME}",
folder_path="merge",
)

The mannequin is now out there on the Hugging Face Hub at mlabonne/Marcoro14–7B-slerp. In one other pocket book, we will strive the mannequin in 4-bit precision on a free T4 GPU utilizing the next code:

!pip set up -qU transformers speed up

from transformers import AutoTokenizer
import transformers
import torch

mannequin = "mlabonne/Marcoro14-7B-slerp"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(mannequin)
immediate = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
pipeline = transformers.pipeline(
"text-generation",
mannequin=mannequin,
torch_dtype=torch.float16,
device_map="auto",
)

outputs = pipeline(immediate, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)

We’re asking the query “What’s a Giant Language Mannequin?” and obtained this output:

A big language mannequin is a kind of synthetic intelligence (AI) system that has been educated on huge quantities of textual content information. It’s designed to grasp and generate human-like language, making predictions on what phrases or phrases would possibly come subsequent in a sentence or doc. These fashions use complicated algorithms and neural community architectures to be taught from the information and enhance their efficiency over time. Some well-known giant language fashions embody GPT-3 from OpenAI and BERT from Google.

It’s wanting good, however we’d like a extra complete analysis. For this sort of general-purpose mannequin, there are a couple of attention-grabbing benchmarks:

  • Chatbot Enviornment, which compiles an Elo-based LLM leaderboard based mostly on human votes.
  • MT-bench (similar hyperlink), which makes use of GPT-4 as a decide to grade mannequin responses on a set of multi-turn questions.
  • NousResearch benchmark suite, which aggregates 4 benchmarks: AGIEval, GPT4ALL, TruthfulQA, and Bigbench. GPT4ALL itself contains HellaSwag, OpenBookQA, Winogrande, ARC-Straightforward, ARC-Problem, BoolQ, and PIQA.
  • Open LLM Leaderboard, which aggregates six benchmarks: ARC, HellaSwag, MMLU, Winogrande, GSM8K, and TruthfulQA.

Sadly, we will’t submit our mannequin to the Chatbot Enviornment. As a substitute, I selected to guage it utilizing the Open LLM Leaderboard and NousResearch benchmarks.

I submitted our mannequin to the Open LLM Leaderboard (“🚀 Submit right here!” tab). As proven within the introduction, it ranked as the very best 7B parameter mannequin on the leaderboard. Listed below are the entire outcomes:

Picture by writer

The issue with the Open LLM Leaderboard is that these benchmarks are public. It signifies that folks can prepare LLMs on the check information to get higher outcomes. By merging the very best fashions, we additionally contaminate our personal outcomes. It’s protected to imagine that Marcoro14–7B-slerp is contaminated and a few fashions used on this merge have been educated on the check set. If you wish to create the very best mannequin and never hack the leaderboard, I like to recommend solely utilizing non-merge fashions to create your personal merges.

For this reason we don’t wish to solely depend on the OpenLLM Leaderboard. For NousResearch benchmark suite, I used 🧐 LLM AutoEval to compute the scores routinely with a easy Colab pocket book. Listed below are the outcomes in comparison with the wonderful OpenHermes-2.5-Mistral-7B:

Picture by writer

We get a big enchancment over this mannequin on each benchmark. Observe that NousResearch benchmark suite shares some duties with the Open LLM Leaderboard: ARC-Problem, TruthfulQA, HellaSwag, and Winogrande. To the very best of my data, Bigbench is the one benchmark that’s 100% totally different (be happy to contact me if that’s not the case). Nevertheless, one of many fashions we used on this merge might nonetheless have been educated on Bigbench.

On this article, we launched the idea of merging LLMs with 4 totally different strategies. We detailed how SLERP, TIES, DARE, and passthrough work and supplied examples of configurations. Lastly, we ran SLERP with mergekit to create Marcoro14–7B-slerp and add it to the Hugging Face Hub. We obtained wonderful efficiency on two benchmark suites: Open LLM Leaderboard (best-performing 7B mannequin) and NousResearch. If you wish to create your personal merges, I like to recommend utilizing my automated pocket book 🥱 LazyMergekit.

One other approach of mixing a number of fashions is to merge them in a Combination of Consultants (MoE) structure. Within the subsequent article, we’ll talk about how to do that intimately and create our personal Mixtral-like mannequin. If you happen to appreciated this text, please observe me on Medium and Twitter @mlabonne.



[ad_2]