LLMs for Everybody: Working the HuggingFace Textual content Technology Inference in Google Colab | by Dmitrii Eliuseev

Machine Learning

LLMs for Everybody: Working the HuggingFace Textual content Technology Inference in Google Colab | by Dmitrii Eliuseev | Jan, 2024

hhhhm

2024年1月13日

LLMs for Everybody: Working the HuggingFace Textual content Technology Inference in Google Colab | by Dmitrii Eliuseev | Jan, 2024

[ad_1]

Experimenting with Massive Language Fashions at no cost (Half 3)

Within the first half of the story, we used a free Google Colab occasion to run a Mistral-7B mannequin and extract data utilizing the FAISS (Fb AI Similarity Search) database. Within the second half of the story, we used a LLaMA-13B mannequin and a LangChain library to make a chat with textual content summarization and different options. On this half, I’ll present the right way to use a HuggingFace 🤗 Textual content Technology Inference (TGI). TGI is a toolkit that enables us to run a big language mannequin (LLM) as a service. As within the earlier components, we’ll take a look at it within the Google Colab occasion, fully at no cost.

Textual content Technology Inference

Textual content Technology Inference (TGI) is a production-ready toolkit for deploying and serving giant language fashions (LLMs). Working LLM as a service permits us to make use of it with completely different shoppers, from Python notebooks to cellular apps. It’s fascinating to check the TGI’s performance, but it surely turned out that its system necessities are fairly excessive, and never all the things works as easily as anticipated:

A free Google Colab occasion gives solely 12.7 GB of RAM, which is usually not sufficient to load a 13B and even 7B mannequin “in a single piece.” The AutoModelForCausalLM class from HuggingFace permits us to make use of “sharded” fashions that had been break up into smaller chunks. It really works properly in Python, however for some motive, this performance doesn’t work in TGI, and the occasion is crashing with a “not sufficient reminiscence” error.
A VRAM dimension could be a second situation. In my assessments with TGI v1.3.4, 8-bit quantization was working properly with a bitsandbytes library, however the 4-bit quantization (bitsandbytes-nf4 possibility) didn’t work. I particularly verified this in Colab Professional on the 40 GB NVIDIA A100 GPU; even with bitsandbytes-nf4 or bitsandbytes-fp4 enabled, the required VRAM dimension was 16.4 GB, which is just too excessive for a free Colab occasion (and even for Colab Professional customers, the 40 GB NVIDIA A100 utilization worth is 2–4x increased in comparison with 16 GB NVIDIA T4).
TGI wants Rust to be put in. A free Google Colab occasion doesn’t have a full-fledged terminal, so correct set up can also be a problem.

[ad_2]