Finetuning an LLM to do classification

Introduction

Many researchers are using LLMs to automatically classify and label text data. While large language models (LLMs) like ChatGPT’s GPT-4 have shown impressive capabilities in such tasks, they often come with significant drawbacks: high API costs and suboptimal performance on domain-specific tasks. For this reason it may be useful to finetune a smaller model on your custom dataset and use it for your classification task.

In this post, we’ll walk through the process of finetuning a LLAMA model for a custom classification task. We will be using the open-source package unsloth to finetune the model. Unsloth is a good choice due to its speed and ease of use, but requires a GPU and usually access to the internet, which might make it difficult to run on a HPC cluster.

Setting up unsloth (and an environment)

Let’s start by setting up an environment with unsloth. The HPC I have access to does not have internet access on GPU nodes and no conda installation, so I will be using a pip installation of unsloth. First, on a node with internet access, set up a virtual environment and install unsloth:

python -m venv unsloth_env
source unsloth_env/bin/activate
pip install unsloth

Let’s also install some other useful packages here:

pip install datasets
pip install trl

Downloading a model

With unsloth installed, the next step is to download the model we want to finetune. For me this was quite a bit of work, since unsloth can only run on GPU nodes and hence the built-in function to download a model was not an option on my CPU only node with internet access. This may be a fringe problem, so feel free to skip this part and just use the built-in function to download a model (the model automatically downloads when loading it – see below).

Unable to use unsloth to directly download the model, I turned to Hugging Face Hub to download the model, which unfortunately also failed because the huggingface_hub package downloads the model in a way that is not compatible with the unsloth package.

Finally, I decided to simply clone the model repository and use git-lfs to download the model files. Git-lfs is a tool that allows you to download large files from a repository, however due to the lack of admin access to the HPC I was not able to install it, so I resorted to download files to my local machine and then upload them to the HPC.

git lfs install
git clone https://huggingface.co/unsloth/Llama-3.2-3B-bnb-4bit

Then upload the files to the e.g. HPC using rsync:

rsync -avz --progress Llama-3.2-3B-bnb-4bit/ your_username@your_hpc.com:~/path/to/your/project/

The finetuning data

Now that we have the model and unsloth ready on the server, we need to get our finetuning data ready. In my case, I have data that uses all posts of a user to detect if a specific reddit user might be interested in a specific subreddit that is suggested to him/her based on all their previous activity.

I prepared a dataset with the information about the interest of the users and their activity as a huggingface dataset, it looks something like this:

ds = load_from_disk("users_interest")
print(ds['train'])
# Dataset({
#     features: ['body', 'interest'],
#     num_rows: 1614219
# })

The individual features are body which contains all the posts, and label, which is the users expressed interest in the community and is just a 1 or 0. body meanwhile looks like this:

<start_of_post> First post of the user.<end_of_post> <start_of_post>Second post<end_of_post>

This data is clearly not in the format yet that we need in order to finetune our LLM. LLMs typically expect a kind of Chat structure and often build on the OpenAI API structure. We can convert our data as follows:

def relabel(label):
  if label==0 :
    return "no interest"
  if label==1:
    return "interest"

def make_conversation(text, label):
    return {
        'conversation': [
            {'role': 'user',
             'content': f"""Determine if these posts and comments belong to a user that is interested in the new community or not:
              {text}
              """
             },
            {'role': 'assistant',
             'content': relabel(label)}
        ]
    }

ds = ds.map(lambda x: make_conversation(x['body'], x['label']))

This will restructure the data so that each example has the conversation feature.

We will need to do some more restructuring to prepare the exact format needed by our model, but for this we need the custom tokenizer needed for the model. We can use unsloth to help us. Let’s load the model and the tokenizer:

from unsloth import FastLanguageModel, is_bfloat16_supported
from unsloth.chat_templates import get_chat_template

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "your/model/path/model_path,
        max_seq_length = 8192,
        dtype = None,
        load_in_4bit = True,
        local_files_only=True, #ensures that the model loads using no internet
        device_map="auto"
    )

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
    )

The code does three things, first, we load the model and the tokenizer from the local model files. Second, we get the model PEFT for training. Last, load the chat template needed for the model. Since we use LLAMA 3.2 we can use the chat template of LLAMA 3.1 (it’s the same).

With the model and tokenizer loaded, we can apply the last step for the formatting:

def formatting_prompts_func(examples, tokenizer):
    convos = examples["conversation"]
    texts = [tokenizer.apply_chat_template(convo,
                tokenize = False,
                add_generation_prompt = False
                )
                for convo in convos
            ]
    return { "text" : texts, }

ds = ds.map(formatting_prompts_func,
    batched = True,
    fn_kwargs = {"tokenizer" : tokenizer}
)
train = ds['train']

This creates a field called text, which contains the adjusted chat templates for the LLAMA 3.2 model. We then extract only the train part of the dataset.

Train

Training is quite easy. We will use Hugging Face’s TRL library for training:

from trl import SFTTrainer

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = True, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        #num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 30,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

trainer.train()

Thanks to unsloth the training should be relatively fast. Finally save the model.

model.save_pretrained_merged(
    os.path.join("llama3-2-3b-interest-finetuned"),
    tokenizer,
    save_method = "merged_16bit",
    )

Inference

With our newly trained model we might want to actually do some inference. Unsloth allows this directly but personally I wanted to use vllm, which has a nice API for inference with finetuned models and allows to use multiple GPUs.

First, make sure you have vllm installed. If not, you can install it via pip:

pip install vllm

Now, let’s see how we can use vllm to run inference. We’ll need to load the model and then use the LLM class from vllm to generate text.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from unsloth.chat_templates import get_chat_template


model_path = "llama3-2-3b-interest-finetuned"

llm = LLM(model=model_path) # use tensor_parallel_size=2 for 2 GPUs

sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=5)
# We use temperature 0 for reproducibility.


# It's crucial to format the prompt exactly as it was during training.
user_posts_example = "<start_of_post> I love hiking in the mountains! <end_of_post> <start_of_post> Just bought a new tent. <end_of_post>"

tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1") 

# Format the prompt using the tokenizer
conversation = [
    {"role": "user", "content": f"""Determine if these posts and comments belong to a user that is interested in the new community or not:
              {user_posts_example}
              """},
]

formatted_prompt = tokenizer.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True
)

# Run inference
outputs = llm.generate(formatted_prompt, sampling_params)

# Print the outputs
for output in outputs:
    generated_text = output.outputs[0].text.strip()
    print(f"User posts: {user_posts_example}")
    print(f"Model prediction: {generated_text}")
    # The generated_text will be "interest" or "no interest"

This script initializes vllm with your finetuned model, sets up sampling parameters, formats an example prompt according to the Llama 3.1 chat template, and then generates the classification. Remember that the prompt structure, especially the roles (user, assistant) and any system messages, must mirror what the model saw during finetuning for optimal performance.