A Practical Guide: Fine-Tuning Large Language Models with HuggingFace

Rifx.Online
Programming , Machine Learning , Natural Language Processing
11 Jan, 2025

Co-authors: Srijith Rajamohan, Ahmed Salhin, Todd Cook, Josh Frazier

Every new announcement of Large Language Models (LLMs) tends to push performance to new heights, often surpassing the previous benchmark (e.g. Massive Multitask Language Understanding or MMLU). This progress has sparked numerous applications using the largest and greatest ones. In our previous post, we discussed the scaling laws of LLMs and explained why larger models are better at predicting the next tokens.

However, the journey from prototyping an LLM demo to a functional production system is not without its challenges. User privacy and trust are important, especially when we are operating in the accounting and finance domain. Ensuring that LLMs enhance our applications and deliver value to our customers remains our top priority.

We have observed that model responses drift over time when using proprietary models, such as GPT-4 and its variants. This is understandable because the weights of the model get updated. However, this creates noticeable inference difference that can make the performance of downstream applications unstable. Additionally, we observed that it is hard to obtain deterministic results even seeding everything possible for random seeds and choosing greedy decoding strategy. Moreover, we realized that GPT-4 cannot respond to certain questions well that are specific to our field, and it can generate hallucinated results.

To mitigate these issues, we decided to train our own accounting domain-specific models and worked with the Amazon AWS team to leverage their infrastructure and knowledge to provide state-of-the-start training and inference compute resource. However, a challenge in fine-tuning LLMs is the high GPU memory consumption. In the following section, we will explore model parallelism and data parallelism strategies, which are commonly used to address this issue.

Model Parallelism and Data Parallelism

When training a LLM, the choice between model parallelism (MP) and data parallelism(DP) depends on whether the model can fit into a single GPU. MP is used when the the model’s weights are too large for a single GPU. MP shards model weights across multiple GPUs. On the other hand, DP shards the data across several GPUs, with each GPU holds a complete copy of the model weights.

In the case of DP, PyTorch offers Distributed Data Parallel (DDP) feature, that wraps the model class object, allowing one to launch DDP via its torchrun launcher utilities.

In the case of MP, PyTorch natively supports Fully Sharded Data Parallelism (FSDP) for model parallelism. Additionally, DeepSpeed, developed by the Microsoft team, is another popular technique for MP that we have used in our previous post.

As you might have noticed, using these parallelism techniques require modifications to your code. To simplify this process, HuggingFace offers the accelerate package, which handles the complex wrapping functions and GPU device placement. This makes it much easier to use these parallelism techniques without writing your own boilerplate code specific for parallelization.

Prepare Dataset for Instruction Fine Tuning

First, we curate our dataset into pairs of question and answers. Because LLMs operate in a token-in and token-out fashion, the text in the pairs are converted into tokens using the tokenizer object. For the instruct-tuned models, the pairs are required to conform to the instruction format that is unique to each model. We can use the tokenizer.apply_chat_template() function to achieve that.

def apply_chat_template(
    datapoint,
    tokenizer,
    prefix="",
):
    messages = [
        {"role": "user", "content": f"""{prefix}: [{datapoint["input"]}]"""},
        {"role": "assistant", "content": datapoint["output"]},
    ]
    datapoint["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )
    return datapoint


def apply_chat_template_test(
    datapoint,
    tokenizer,
    prefix="",
):
    print("using chat template prompting.")
    messages = [
        {"role": "user", "content": f"""{prefix}: [{datapoint["input"]}]"""},
    ]
    datapoint["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )
    return datapoint

Here is the completed code snippet.

from sklearn.model_selection import train_test_split

import pandas as pd
df = pd.read_csv(data_path)


## split the dataset into train and test
X_train, X_test = train_test_split(
  df, test_size=0.00001, random_state=42
)


## reformat the input and output to conform with instructed format
X_train = pd.DataFrame(
      X_train.apply(
          lambda row: apply_chat_template(
              row, tokenizer, prefix
          ),
        axis=1,
    ),
    columns=["text"],
)
X_test = pd.DataFrame(
    X_test.apply(
        lambda row: apply_chat_template_test(
            row, tokenizer, prefix
        ),
        axis=1,
    ),
    columns=["text"],
)

## load the data into the Dataset object
from datasets import Dataset

train_data = Dataset.from_pandas(X_train)
test_data = Dataset.from_pandas(X_test)

Model Training

The LLM is trained in a supervised fashion because we provide the labelled answers for the questions. To enable this training process, HuggingFace implemented SFTTrainer() object.

Load pre-trained model and tokenizer

First, we load the pre-trained model and tokenizer using HuggingFace’s AutoModelForCausalLM and AutoTokenizer objects. Some models (e.g. Phi-3 series) require trust_remote_code=True , and therefore, we set this arg to be true by default.

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
)
import torch


## Load the model
def load_model(
    base_model,
):
    return AutoModelForCausalLM.from_pretrained(
        base_model,
        trust_remote_code=True,
    )


## Load tokenizer
def load_tokenizer(base_model,):
    return AutoTokenizer.from_pretrained(
        base_model, trust_remote_code=True,
    )

Before starting training, we set a seed for the process to ensure reproducibility.

from transformers import set_seed


## Set seed for reproducibility
set_seed(123)

LoRA Fine-Tuning

Training LLMs from scratch requires compute resources and a large dataset size (e.g. 15 trillion tokens for Llama 3.1 405B, 16,000 H100 GPUs). Fortunately, fine-tuning LLMs requires less compute resource, if only the linear layers are fine-tuned.

LoRA takes advantage of low-rank matrix approximation, a concept where a high-dimensional matrix is approximated as the product of two smaller matrices (e.g. rank 1), reducing the matrix update dimension from 25 to 10, as shown in the Fig 1. This approach is rooted in the mathematical technique of Singular Value Decomposition (SVD). By applying this to the weight matrices in the attention layers of a transformer model, LoRA effectively reduces the number of parameters that need to be trained.

The peft package provides prepare_model_for_kbit_training and LoraConfig object to allow you to add tunable weights to targeted linear layers while freezing the other layers’ weights.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training


def add_lora_layers(
    model,
    lora_alpha=<an interger>,
    r=<an interger>,
    target_modules="all-linear",
):
    model = prepare_model_for_kbit_training(
        model,
        use_gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
    )
    peft_config = LoraConfig(
        lora_alpha=lora_alpha,
        lora_dropout=<a floating number>,
        r=r,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=target_modules,
    )
    return get_peft_model(model, peft_config), peft_config

Define model hyperparameters

We need to define hyper-parameters for training the model. We choose the cosine learning rate schedule, which provides smooth decays. You should carefully select the per device batch size and gradient accumulation step to avoid exhausting your GPU memory. The effective batch size is calculated by multiplying the per-device batch size by the number of gradient accumulation steps and the number of GPUs in a cluster (node). For example, Nvidia packs and ships 8 A100 GPUs in a single node for cloud service providers (CSPs). Additionally, we save the top three models that achieve the lowest cross-entropy loss.

from transformers import EarlyStoppingCallback, TrainingArguments, trainer_utils
import torch


## If your gpu is above 8, you can accelerate training with bf16
major, _ = torch.cuda.get_device_capability()

## Set up training hyperparameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=<an interger>,
    per_device_train_batch_size=<an interger>,
    per_device_eval_batch_size=<an interger>,
    gradient_accumulation_steps=<an interger>,
    optim="paged_adamw_32bit",
    save_steps=<an integer>,
    logging_steps=<an integer>,
    learning_rate=<a floating number>,
    weight_decay=<a floating number>,
    fp16=False if major >= 8 else True,
    bf16=True if major >= 8 else False,
    max_grad_norm=<a floating number>,
    warmup_ratio=<a floating number>,
    group_by_length=True,
    lr_scheduler_type="cosine",
    gradient_checkpointing_kwargs={"use_reentrant": False},
    disable_tqdm=False,
    resume_from_checkpoint=True,
    seed=123,
    log_level="info",
    remove_unused_columns=True,
    ...
)

Apply SFTTrainer Object

Next, we initiate the training by calling the SFTTrainer() object, passing it the training dataset along with the hyper-parameters. We also incorporate callbacks to end the training early if the loss fails to decrease for more than our defined limit.

from trl import SFTTrainer


## Set up the trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=(
        test_data
    ),
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
    callbacks=[
        EarlyStoppingCallback(
            early_stopping_patience=<an interger>,
            early_stopping_threshold=<a floating number>,
        ),
    ],
)

Log model metrics

Conveniently, we can log the hyper-parameters and loss metrics using the logging function.

import datasets
import logging
import transformers


logger = logging.getLogger(__name__)


## setup logging
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = training_arguments.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

Finally, we log metrics for both the train and test datasets, including the loss and the hyper-parameters used during training.

resume_from_checkpoint = True

if resume_from_checkpoint:
    print("Continue training from the last checkpoint.")
    train_result = trainer.train(
        resume_from_checkpoint=resume_from_checkpoint
    )
else:
    train_result = trainer.train()

metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

## Evaluation
tokenizer.padding_side = "left"
metrics = trainer.evaluate()
metrics["eval_samples"] = (
    len(test_data)
)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

Launch with Accelerate and DeepSpeed Config

We can launch the training job using a command line and putting all of the code up to this point into a python script. The DeepSpeed config is passed after the flag, --config_file . Please ensure that the gradient_accumulation_steps and gradient_clipping matches those defined in the TrainingArguments . The num_machines defines the number of node is used. The num_processes defines the number of GPUs in the node.

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: <an interger>
  gradient_clipping: <a floating number>
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

CUDA_VISIBLE_DEVICE=0,1,2,3,4,5,6,7 accelerate launch --config_file "deepspeed_stage3.yaml" <put all of the code up to this point into a python script>

Trade-offs: Precision and Speed

We want to end this post by highlighting other important trade-offs to consider: precision v.s. speed. Fig 2 shows the relationship between different data types and compute speeds, measured in TFLOPS, as detailed in Nvidia’s A100s white paper. For a deeper understanding of compute speed, you can refer to the appendix in our earlier blog post.

Different data types offers varying ranges of exponents and precisions. Additionally, different GPUs, such as the V100s and A100s, support different data types. It’s important to check which data types your GPUs supports. For instance, according to Fig2 above, FP16 provides 8x throughput of FP32 in operations/sec (compute speed), albeit with reduced exponent range and precision. We suggest carefully reviewing both the PyTorch CUDA setup and your GPU’s white paper, and experimenting with your data to find the right balance between precision and speed. For example, the following code snippet enables the use of TF32 for matrix multiplication.

## https://pytorch.org/docs/stable/notes/cuda.html#tf32-on-ampere

## The flag below controls whether to allow TF32 on matmul. This flag defaults to False
## in PyTorch 1.12 and later.
torch.backends.cuda.matmul.allow_tf32 = True

## The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True

Conclusion

In this post, we demonstrate how to fine-tune LLMs using MP techniques with tools provided by HuggingFace. By sharing these experiences, we hope you can train your own LLMs and own the weights. Please keep in mind that even fine-tuned LLMs can produce hallucinated responses. While fine-tuning can reduce the chance of hallucination, it does not eliminate it. Developers and scientists should design AI products with an understanding that LLMs still operate by generating the most likely tokens based on the preceding context tokens, and can generate ungrounded or non-factual responses. Despite these limitations, small, domain-specific language models are preferred because they can be optimized to understand jargon specific to a field, and have lower inference cost compared to large, close-source language models.