Fine-Tuning LLMs on Enterprise Data with QLoRA: A Practical Approach

Introduction

In the era of big data and advanced artificial intelligence, language models have emerged as formidable tools capable of processing and generating human-like text. Large Language Models (LLMs) like ChatGPT are general-purpose bots capable of having conversations on many topics. However, LLMs can also be fine-tuned on domain-specific data, making them more accurate and on-point for enterprise-specific questions.

Many industries and applications require fine-tuned LLMs for various reasons:

Improved performance from a chatbot trained on specific data
Confidentiality concerns with using black-box models like OpenAI’s ChatGPT
Prohibitive API costs for large-scale applications

The challenge with fine-tuning an LLM is the computational resources required to train a billion-parameter model without optimizations can be prohibitive. Fortunately, research has provided techniques that allow fine-tuning LLMs on smaller GPUs. In this blog, we explore these techniques using the Falcon 7B model and financial data on a Colab GPU.

QLoRA: Efficient Fine-Tuning with Quantization and Low-Rank Adaptation

QLoRA (Quantized Low-Rank Adaptation) combines quantization and low-rank adaptation to achieve efficient fine-tuning of AI models. It reduces the memory required for fine-tuning LLMs without any drop in performance. This method enables a 7 billion parameter model to be fine-tuned on a 16GB GPU, a 33 billion parameter model on a 24GB GPU, and a 65 billion parameter model on a 46GB GPU.

LoRA (Low-Rank Adaptation) involves injecting small sets of trainable parameters into each layer of the Transformer architecture while fine-tuning, greatly reducing the number of trainable parameters. Original model weights remain frozen, and only these adapters are updated during training, maintaining a small memory footprint.

Quantization reduces the memory needed to store model weights by converting them into smaller data types like 8-bit or 4-bit precision. In QLoRA, model weights are stored in a 4-bit floating point data type but all computations are performed in a 16-bit floating point, maintaining accuracy while reducing memory usage.

HuggingFace Support for Fine-Tuning

HuggingFace has released several libraries that facilitate the fine-tuning of LLMs:

PEFT Library: Supports Parameter Efficient Fine Tuning (PEFT) with LoRA.
Quantization Support: Many models can be loaded in 8-bit and 4-bit precision using the bitsandbytes module.
Accelerate Library: Features that reduce the memory requirements of models.
Supervised Fine-Tuning Trainer (SFT): A trainer class for supervised fine-tuning of large LLMs.

Applying QLoRA: A Case Study with Falcon 7B on Alpaca Finance Dataset

To demonstrate the practical application of QLoRA, we fine-tuned the Falcon-7B model on a financial dataset using Google Colab. This hands-on example illustrates how the techniques discussed can be implemented in a real-world scenario.

Training Falcon 7B on Alpaca Finance Dataset

We successfully fine-tuned the Falcon-7B model on the Alpaca-Finance dataset, which consists of approximately 70K finance data points, including finance questions and answers. This dataset is available on HuggingFace's dataset hub and can be loaded directly from there.

The training process involved:

Loading the Pre-trained Model: We started by loading a pre-trained Falcon-7B model from HuggingFace, setting the storage type to 4-bit and the computation type to FP-16 using the AutoModelForCausalLM function.
Creating Adapters: Adapters, or extra layers added to Transformer modules, were created to hold our fine-tuned weights. These adapters were added to linear layers and query value pairs of Transformer modules for optimal accuracy.
Initializing the SFTTrainer Class: We brought everything together by initializing the SFTTrainer class, providing the appropriate data, token, formatting function, and max_seq_length.

The model was trained for about an hour, with the training loss significantly reduced after 100 steps. The code for this fine-tuning process, including the necessary configurations, is detailed in the Colab Notebook provided.

Inference with Alpaca Finance Dataset

Post-training, the fine-tuned model was tested with various finance-related questions to assess its performance. When asked about the income needed to retire, the model generated a comprehensive response covering rules of thumb for 401K and other pension plans, emphasizing the importance of personalized financial planning. Similarly, questions about portfolio diversification yielded insightful answers, highlighting the model's capability to handle finance-specific inquiries effectively.

Conclusion

Fine-tuning LLMs on custom datasets has become more accessible, allowing businesses to create their own “private GPT” models hosted locally on commercial GPUs. These models offer a ChatGPT-like interface tailored to specific enterprise needs.