A Practical Guide: Fine-Tuning Large Language Models with HuggingFace
Co-authors: Srijith Rajamohan, Ahmed Salhin, Todd Cook, Josh Frazier
Every new announcement of Large Language Models (LLMs) tends to push performance to new heights, often surpassing the previous benchmark (e.g. Massive Multitask Language Understanding or MMLU). This progress has sparked numerous applications using the largest and greatest ones. In our previous post, we discussed the scaling laws of LLMs and explained why larger models are better at predicting the next tokens.
However, the journey from prototyping an LLM demo to a functional production system is not without its challenges. User privacy and trust are important, especially when we are operating in the accounting and finance domain. Ensuring that LLMs enhance our applications and deliver value to our customers remains our top priority.
We have observed that model responses drift over time when using proprietary models, such as GPT-4 and its variants. This is understandable because the weights of the model get updated. However, this creates noticeable inference difference that can make the performance of downstream applications unstable. Additionally, we observed that it is hard to obtain deterministic results even seeding everything possible for random seeds and choosing greedy decoding strategy. Moreover, we realized that GPT-4 cannot respond to certain questions well that are specific to our field, and it can generate hallucinated results.
To mitigate these issues, we decided to train our own accounting domain-specific models and worked with the Amazon AWS team to leverage their infrastructure and knowledge to provide state-of-the-start training and inference compute resource. However, a challenge in fine-tuning LLMs is the high GPU memory consumption. In the following section, we will explore model parallelism and data parallelism strategies, which are commonly used to address this issue.
Model Parallelism and Data Parallelism
When training a LLM, the choice between model parallelism (MP) and data parallelism(DP) depends on whether the model can fit into a single GPU. MP is used when the the model’s weights are too large for a single GPU. MP shards model weights across multiple GPUs. On the other hand, DP shards the data across several GPUs, with each GPU holds a complete copy of the model weights.
In the case of DP, PyTorch offers Distributed Data Parallel (DDP) feature, that wraps the model class object, allowing one to launch DDP via its torchrun
launcher utilities.
In the case of MP, PyTorch natively supports Fully Sharded Data Parallelism (FSDP) for model parallelism. Additionally, DeepSpeed, developed by the Microsoft team, is another popular technique for MP that we have used in our previous post.
As you might have noticed, using these parallelism techniques require modifications to your code. To simplify this process, HuggingFace offers the acceler
ate package, which handles the complex wrapping functions and GPU device placement. This makes it much easier to use these parallelism techniques without writing your own boilerplate code specific for parallelization.
Prepare Dataset for Instruction Fine Tuning
First, we curate our dataset into pairs of question and answers. Because LLMs operate in a token-in and token-out fashion, the text in the pairs are converted into tokens using the tokenizer
object. For the instruct-tuned models, the pairs are required to conform to the instruction format that is unique to each model. We can use the tokenizer.apply_chat_template()
function to achieve that.
def apply_chat_template(
datapoint,
tokenizer,
prefix="",
):
messages = [
{"role": "user", "content": f"""{prefix}: [{datapoint["input"]}]"""},
{"role": "assistant", "content": datapoint["output"]},
]
datapoint["text"] = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
return datapoint
def apply_chat_template_test(
datapoint,
tokenizer,
prefix="",
):
print("using chat template prompting.")
messages = [
{"role": "user", "content": f"""{prefix}: [{datapoint["input"]}]"""},
]
datapoint["text"] = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
return datapoint
Here is the completed code snippet.
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv(data_path)
## split the dataset into train and test
X_train, X_test = train_test_split(
df, test_size=0.00001, random_state=42
)
## reformat the input and output to conform with instructed format
X_train = pd.DataFrame(
X_train.apply(
lambda row: apply_chat_template(
row, tokenizer, prefix
),
axis=1,
),
columns=["text"],
)
X_test = pd.DataFrame(
X_test.apply(
lambda row: apply_chat_template_test(
row, tokenizer, prefix
),
axis=1,
),
columns=["text"],
)
## load the data into the Dataset object
from datasets import Dataset
train_data = Dataset.from_pandas(X_train)
test_data = Dataset.from_pandas(X_test)
Model Training
The LLM is trained in a supervised fashion because we provide the labelled answers for the questions. To enable this training process, HuggingFace implemented SFTTrainer()
object.
Load pre-trained model and tokenizer
First, we load the pre-trained model and tokenizer using HuggingFace’s AutoModelForCausalLM
and AutoTokenizer
objects. Some models (e.g. Phi-3 series) require trust_remote_code=True
, and therefore, we set this arg to be true by default.
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
)
import torch
## Load the model
def load_model(
base_model,
):
return AutoModelForCausalLM.from_pretrained(
base_model,
trust_remote_code=True,
)
## Load tokenizer
def load_tokenizer(base_model,):
return AutoTokenizer.from_pretrained(
base_model, trust_remote_code=True,
)
Before starting training, we set a seed for the process to ensure reproducibility.
from transformers import set_seed
## Set seed for reproducibility
set_seed(123)
LoRA Fine-Tuning
Training LLMs from scratch requires compute resources and a large dataset size (e.g. 15 trillion tokens for Llama 3.1 405B, 16,000 H100 GPUs). Fortunately, fine-tuning LLMs requires less compute resource, if only the linear layers are fine-tuned.
LoRA takes advantage of low-rank matrix approximation, a concept where a high-dimensional matrix is approximated as the product of two smaller matrices (e.g. rank 1), reducing the matrix update dimension from 25 to 10, as shown in the Fig 1. This approach is rooted in the mathematical technique of Singular Value Decomposition (SVD). By applying this to the weight matrices in the attention layers of a transformer model, LoRA effectively reduces the number of parameters that need to be trained.
The peft package provides prepare_model_for_kbit_training
and LoraConfig
object to allow you to add tunable weights to targeted linear layers while freezing the other layers’ weights.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
def add_lora_layers(
model,
lora_alpha=<an interger>,
r=<an interger>,
target_modules="all-linear",
):
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
)
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=<a floating number>,
r=r,
bias="none",
task_type="CAUSAL_LM",
target_modules=target_modules,
)
return get_peft_model(model, peft_config), peft_config
Define model hyperparameters
We need to define hyper-parameters for training the model. We choose the cosine learning rate schedule, which provides smooth decays. You should carefully select the per device batch size and gradient accumulation step to avoid exhausting your GPU memory. The effective batch size is calculated by multiplying the per-device batch size by the number of gradient accumulation steps and the number of GPUs in a cluster (node). For example, Nvidia packs and ships 8 A100 GPUs in a single node for cloud service providers (CSPs). Additionally, we save the top three models that achieve the lowest cross-entropy loss.
from transformers import EarlyStoppingCallback, TrainingArguments, trainer_utils
import torch
## If your gpu is above 8, you can accelerate training with bf16
major, _ = torch.cuda.get_device_capability()
## Set up training hyperparameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=<an interger>,
per_device_train_batch_size=<an interger>,
per_device_eval_batch_size=<an interger>,
gradient_accumulation_steps=<an interger>,
optim="paged_adamw_32bit",
save_steps=<an integer>,
logging_steps=<an integer>,
learning_rate=<a floating number>,
weight_decay=<a floating number>,
fp16=False if major >= 8 else True,
bf16=True if major >= 8 else False,
max_grad_norm=<a floating number>,
warmup_ratio=<a floating number>,
group_by_length=True,
lr_scheduler_type="cosine",
gradient_checkpointing_kwargs={"use_reentrant": False},
disable_tqdm=False,
resume_from_checkpoint=True,
seed=123,
log_level="info",
remove_unused_columns=True,
...
)
Apply SFTTrainer Object
Next, we initiate the training by calling the SFTTrainer()
object, passing it the training dataset along with the hyper-parameters. We also incorporate callbacks
to end the training early if the loss fails to decrease for more than our defined limit.
from trl import SFTTrainer
## Set up the trainer
trainer = SFTTrainer(
model=model,
train_dataset=train_data,
eval_dataset=(
test_data
),
peft_config=peft_config,
max_seq_length=max_seq_length,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
packing=False,
callbacks=[
EarlyStoppingCallback(
early_stopping_patience=<an interger>,
early_stopping_threshold=<a floating number>,
),
],
)
Log model metrics
Conveniently, we can log the hyper-parameters and loss metrics using the logging
function.
import datasets
import logging
import transformers
logger = logging.getLogger(__name__)
## setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = training_arguments.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
Finally, we log metrics for both the train and test datasets, including the loss and the hyper-parameters used during training.
resume_from_checkpoint = True
if resume_from_checkpoint:
print("Continue training from the last checkpoint.")
train_result = trainer.train(
resume_from_checkpoint=resume_from_checkpoint
)
else:
train_result = trainer.train()
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
## Evaluation
tokenizer.padding_side = "left"
metrics = trainer.evaluate()
metrics["eval_samples"] = (
len(test_data)
)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
Launch with Accelerate and DeepSpeed Config
We can launch the training job using a command line and putting all of the code up to this point into a python script. The DeepSpeed config is passed after the flag, --config_file
. Please ensure that the gradient_accumulation_steps
and gradient_clipping
matches those defined in the TrainingArguments
. The num_machines
defines the number of node is used. The num_processes
defines the number of GPUs in the node.
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
gradient_accumulation_steps: <an interger>
gradient_clipping: <a floating number>
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
CUDA_VISIBLE_DEVICE=0,1,2,3,4,5,6,7 accelerate launch --config_file "deepspeed_stage3.yaml" <put all of the code up to this point into a python script>
Trade-offs: Precision and Speed
We want to end this post by highlighting other important trade-offs to consider: precision v.s. speed. Fig 2 shows the relationship between different data types and compute speeds, measured in TFLOPS, as detailed in Nvidia’s A100s white paper. For a deeper understanding of compute speed, you can refer to the appendix in our earlier blog post.
Different data types offers varying ranges of exponents and precisions. Additionally, different GPUs, such as the V100s and A100s, support different data types. It’s important to check which data types your GPUs supports. For instance, according to Fig2 above, FP16 provides 8x throughput of FP32 in operations/sec (compute speed), albeit with reduced exponent range and precision. We suggest carefully reviewing both the PyTorch CUDA setup and your GPU’s white paper, and experimenting with your data to find the right balance between precision and speed. For example, the following code snippet enables the use of TF32 for matrix multiplication.
## https://pytorch.org/docs/stable/notes/cuda.html#tf32-on-ampere
## The flag below controls whether to allow TF32 on matmul. This flag defaults to False
## in PyTorch 1.12 and later.
torch.backends.cuda.matmul.allow_tf32 = True
## The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = True
Conclusion
In this post, we demonstrate how to fine-tune LLMs using MP techniques with tools provided by HuggingFace. By sharing these experiences, we hope you can train your own LLMs and own the weights. Please keep in mind that even fine-tuned LLMs can produce hallucinated responses. While fine-tuning can reduce the chance of hallucination, it does not eliminate it. Developers and scientists should design AI products with an understanding that LLMs still operate by generating the most likely tokens based on the preceding context tokens, and can generate ungrounded or non-factual responses. Despite these limitations, small, domain-specific language models are preferred because they can be optimized to understand jargon specific to a field, and have lower inference cost compared to large, close-source language models.