Outperforming GPT-4 on News Classification with Open Source Models

Back to blog

Benchmarks

Outperforming GPT-4 on News Classification - Achieving 95% Accuracy with a Fine-Tuned Llama Model

Feb 1, 2025

1 min read

TL;DR: By fine-tuning a small Llama model (Llama 3.2 1B Instruct) on a Google Colab T4 GPU for about 1 hour and 40 minutes, we achieved 95% accuracy on a news article classification task—outperforming GPT-4, which scored 87.2%. The small size of the model (1 billion parameters) enables easy deployment on resource-constrained environments. This tutorial walks you through the process, demonstrating how you can leverage open-source models to build efficient, high-performing AI systems without relying on external APIs.

Introduction

You can find the complete Jupyter notebook for this tutorial on my GitHub repository. Feel free to follow along and experiment with the code!

Large Language Models (LLMs) like GPT-4 have revolutionized Natural Language Processing (NLP), offering remarkable capabilities in text generation, classification, translation, and more. However, relying on external APIs introduces limitations such as usage costs, data privacy concerns, and dependency on third-party services.

What if you could achieve better results without these constraints?

In this tutorial, we'll explore how to fine-tune an open-source Llama model for a straightforward task: classifying news articles into topics. We'll compare the performance with GPT-4 and demonstrate how our fine-tuned model not only matches but surpasses it in accuracy. Additionally, we'll highlight how the small size of the model (1 billion parameters) allows for easy deployment, even on devices with limited computational resources.

Why Fine-Tune an Open-Source Model?

Cost Efficiency: Eliminate recurring costs associated with API calls to commercial services.
Data Privacy: Keep sensitive data in-house without sending it to external servers.
Customization: Tailor the model to your specific domain or task.
Performance: Achieve high accuracy by leveraging specialized data.
Control: Full access to model internals for debugging and optimization.
Ease of Deployment: Small models can be deployed on devices with limited resources.

Overview

We'll cover the following steps:

Setting Up the Environment
Loading the Model and Tokenizer
Preparing the Dataset
Configuring LoRA for Fine-Tuning
Training the Model
Evaluating the Model
Comparing with GPT-4
Inference
Conclusion

1. Setting Up the Environment

Configuring Google Colab for GPU

To run this tutorial, we'll use Google Colab, which provides free access to GPUs, including the NVIDIA T4. Here's how to set it up:

Open a New Notebook:
Go to Google Colab and create a new notebook.
Change Runtime Type:
- Click on "Runtime" in the menu bar.
- Select "Change runtime type".
- In the dialog box, set "Hardware accelerator" to "GPU".
- Ensure the "GPU Type" is set to "T4". If not, you can reconnect until you get a T4 GPU.

Install Required Libraries

Now, we'll install the necessary libraries:

# Install necessary libraries
!pip install -q transformers datasets accelerate peft trl bitsandbytes

transformers: For model and tokenizer.
datasets: To load and process datasets.
accelerate: Helps with distributed training.
peft: For Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA.
trl: Transformer Reinforcement Learning, useful for fine-tuning.
bitsandbytes: Enables 8-bit optimizers for efficient training.

Logging into Hugging Face

We'll need to log in to Hugging Face to access models and datasets:

from huggingface_hub import notebook_login

# Log in to your Hugging Face account
notebook_login()

You'll be prompted to enter your Hugging Face token. If you don't have one, you can sign up.

2. Loading the Model and Tokenizer

Selecting the Model

We're using the open-source Llama 3.2 1B Instruct model:

model_name = "meta-llama/Llama-3.2-1B-Instruct"

This model is relatively small, with 1 billion parameters, requiring approximately 6 GB of GPU memory for fine-tuning and around 2 GB for inference. This makes it suitable for deployment on resource-constrained environments and for fine-tuning on limited computational resources like a Google Colab T4 GPU, which has 16 GB of VRAM.

Hardware Requirements and Model Size

The Llama 3.2 1B Instruct model, with its 1 billion parameters, strikes a balance between performance and resource utilization:

Fine-Tuning Requirements:
- GPU Memory: Approximately 6 GB of VRAM is needed during fine-tuning.
- RAM: Around 12 GB of system RAM is recommended.
- Disk Space: The model and datasets require about 4 GB of storage.
Inference Requirements:
- GPU Memory: Requires about 2 GB of VRAM for inference.
- Can Run on CPU: For smaller workloads, inference can be performed on a CPU with sufficient RAM.

This makes the model accessible for those without high-end hardware. A Google Colab T4 GPU, which comes with 16 GB of VRAM, is more than sufficient for this task.

Tip: If you're running into memory issues, consider reducing the per_device_train_batch_size or using techniques like gradient accumulation.

Setting Torch Dtype and Attention Implementation

import torch
# Set torch dtype and attention implementation
if torch.cuda.get_device_capability()[0] >= 8:
    # For GPUs with compute capability >= 8
    !pip install -qqq flash-attn
    torch_dtype = torch.bfloat16
    attn_implementation = "flash_attention_2"
else:
    torch_dtype = torch.float16
    attn_implementation = "eager"

torch_dtype: Determines the precision of computations.
- bfloat16: Suitable for newer GPUs with better support.
- float16: For older GPUs.
attn_implementation: Chooses the attention mechanism implementation.
- flash_attention_2: Faster, but requires compatible hardware.
- eager: Standard implementation.

Loading the Model

from transformers import AutoModelForSequenceClassification

# Define label mappings
num_labels = 4
id2label = {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
label2id = {v: k for k, v in id2label.items()}

# Load the model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id,
    attn_implementation=attn_implementation,
)

AutoModelForSequenceClassification: A model class for classification tasks.
num_labels: Number of classes in the dataset.
id2label & label2id: Dictionaries mapping between labels and IDs.
attn_implementation: Passes the chosen attention implementation to the model.

Loading the Tokenizer

from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="right")
tokenizer.pad_token = tokenizer.eos_token

# Ensure the tokenizer and model have the correct padding token
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

AutoTokenizer: Automatically loads the appropriate tokenizer.
padding_side: Ensures that padding tokens are added to the right.
pad_token: Padding token for batch processing.

3. Preparing the Dataset

We'll use the AG News dataset, a collection of news articles labeled into four categories.

Loading the Dataset

from datasets import load_dataset

# Load the AG News dataset
dataset = load_dataset("ag_news")

Tokenizing the Dataset

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_dataset = dataset.map(preprocess_function, batched=True)

preprocess_function: Tokenizes the text and truncates sequences longer than the model's maximum input length.
batched=True: Processes multiple examples at once for efficiency.

Visualizing Sequence Lengths

import matplotlib.pyplot as plt

# Get the lengths of all sequences in the training dataset
train_dataset = tokenized_dataset["train"]
sequence_lengths = [len(example["input_ids"]) for example in train_dataset]

# Create a histogram
plt.hist(sequence_lengths, bins=50)
plt.xlabel("Sequence Length")
plt.ylabel("Frequency")
plt.title("Distribution of Sequence Lengths in Training Data")
plt.show()

This helps us decide on an appropriate max_seq_len , or the maximal number of tokens, for truncation.

Setting Maximum Sequence Length

max_seq_len = 128  # Chosen based on the histogram
print(max(sequence_lengths))

We set max_seq_len to 128 to balance between capturing enough context and computational efficiency.

Re-tokenizing with Padding and Truncation

def preprocess_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=max_seq_len,
        padding="max_length",
    )

tokenized_dataset = dataset.map(preprocess_function, batched=True)

max_length: Ensures sequences are truncated to max_seq_len.
padding="max_length": Pads sequences to max_seq_len.

4. Configuring LoRA for Fine-Tuning

LoRA (Low-Rank Adaptation) is a fine-tuning method that reduces memory and compute requirements by updating only a subset of model parameters. LoRA allows us to fine-tune large models efficiently by adding trainable rank-decomposition matrices to existing weights.

Setting Up LoRA

from peft import LoraConfig

peft_config = LoraConfig(
    task_type="SEQ_CLS",     # Sequence Classification
    inference_mode=False,
    r=16,                    # Rank of LoRA matrices
    lora_alpha=32,           # Scaling factor
    lora_dropout=0.1,        # Dropout rate
    target_modules="all-linear",  # Apply LoRA to all linear layers
)

task_type: Specifies the task (sequence classification).
r: Rank of the LoRA matrices; a trade-off between resource usage and performance.
lora_alpha: A scaling factor for LoRA updates.
lora_dropout: Dropout rate applied to LoRA layers.
target_modules: Applying LoRA to all linear layers for comprehensive fine-tuning.

5. Training the Model

Setting Hyperparameters

from transformers import TrainingArguments

training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_steps=100,
    learning_rate=5e-5,
    fp16=True,
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

output_dir: Directory to save model checkpoints.
num_train_epochs: Number of training epochs.
per_device_train_batch_size: Batch size per GPU/CPU during training.
gradient_accumulation_steps: Number of steps to accumulate gradients before updating.
evaluation_strategy: How often to run evaluation.
fp16: Enables 16-bit floating point precision for faster training.
load_best_model_at_end: Loads the best model based on the metric.
metric_for_best_model: Metric to determine the best model.

Defining Metrics

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='weighted'
    )
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall}

compute_metrics: Function to compute evaluation metrics like accuracy, F1 score, precision, and recall.

Initializing the Trainer

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset["train"].shuffle(seed=42),
    eval_dataset=tokenized_dataset["test"].shuffle(seed=42).select(range(500)),
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    packing=False,  # Explicitly control padding strategy
)

train_dataset: Shuffled training dataset.
eval_dataset: Subset of the test dataset for evaluation during training.
SFTTrainer: Trainer class from TRL (Transformer Reinforcement Learning).
packing: Set to False to control padding explicitly.

Training

trainer.train()

This will start the fine-tuning process. On a suitable GPU, training takes approximately 1 hour and 40 minutes.

6. Evaluating the Model

Evaluating on a Subset

# Evaluate on 500 test records after training
eval_results = trainer.evaluate(eval_dataset=tokenized_dataset["test"].select(range(500)))
print(eval_results)

Evaluating on the Entire Test Set

# Evaluate on the entire test dataset after training
eval_results = trainer.evaluate(eval_dataset=tokenized_dataset["test"])
print(eval_results)

Expected Results:

{'eval_loss': 0.16011320054531097, 'eval_accuracy': 0.9496052631578947, 'eval_f1': 0.9495957378250849, 'eval_precision': 0.9495878681631021, 'eval_recall': 0.9496052631578947, 'eval_runtime': 164.9781, 'eval_samples_per_second': 46.067, 'eval_steps_per_second': 2.879, 'epoch': 1.0}

We've achieved 95% accuracy, outperforming GPT-4 on the same task.

7. Comparing with GPT-4

To evaluate GPT-4 on the same test set, we used the OpenAI API. However, due to the costs associated with API usage, we limited our evaluation to 500 samples from the test dataset. We crafted a prompt asking GPT-4 to classify each news article into one of the four categories: World, Sports, Business, or Sci/Tech.

Evaluation Methodology:

Prompt Design: We provided the article text and instructed GPT-4 to return only the category name.
Data Processing:
- Used the same test dataset for a fair comparison.
- Handled API rate limits and ensured compliance with OpenAI's usage policies.
Metrics Computed:
- Accuracy
- Precision
- Recall
- F1 Score

You can find the complete code and methodology for the GPT-4 evaluation in this GitHub repository.

GPT-4 Evaluation Results:

After evaluating GPT-4 on 500 samples from the test set, we obtained the following results:

Comparison with Our Fine-Tuned Model:

Our fine-tuned Llama model outperforms GPT-4 by a significant margin on this specific task.

To evaluate GPT-4 on the same test set, we use the OpenAI API.

8. Inference

Let's test our fine-tuned model with an example from the test set.

import torch

# Select a sample from the test set
i = 5000
text = dataset['test']['text'][i]

# Tokenize and prepare inputs
inputs = tokenizer(text, return_tensors="pt", truncation=True).to(model.device)

# Get predictions
with torch.no_grad():
    logits = model(**inputs).logits

# Get the predicted class
predicted_class_id = logits.argmax().item()
predicted_label = model.config.id2label[predicted_class_id]
actual_label = model.config.id2label[dataset['test']['label'][i]]

print(f"Text: {text}")
print(f"Predicted Label: {predicted_label}")
print(f"Actual Label: {actual_label}")

Sample Output:

Our model correctly classified the news article into the Business category.

9. Conclusion

By fine-tuning an open-source Llama 3.2 1B Instruct model with LoRA, we achieved 95% accuracy on the AG News classification task, outperforming GPT-4's 87.2%. This demonstrates that with appropriate fine-tuning, smaller models can surpass larger, more expensive models on specific tasks.

Key Takeaways

Performance: Fine-tuned models can outperform general-purpose models like GPT-4 on specialized tasks.
Cost Efficiency: Avoid recurring costs associated with API calls to commercial services.
Data Privacy: Keep sensitive data in-house without sending it to external servers.
Customization: Tailor the model to your specific domain or task.
Accessibility: Utilize open-source models to democratize AI capabilities.

Next Steps

Experiment with Different Models: Try other open-source models and compare performance.
Optimize Hyperparameters: Adjust training arguments and LoRA configurations for better results.
Scale Up: Fine-tune on larger datasets or for more epochs.
Deploy the Model: Integrate the fine-tuned model into your applications.

About Datawizz.ai

At Datawizz.ai, we empower companies to own and optimize their AI with specialized, efficient models. Our mission is to make advanced AI accessible and customizable, allowing businesses to harness the full potential of their data.

References

Hugging Face Transformers - Llama 3.2 1B instruct
Parameter-Efficient Fine-Tuning (PEFT)
AG News Dataset
LoRA: Low-Rank Adaptation
Jupyter Notebooks in GitHub Repository: DatawizzAI Blogs

Note: The code snippets provided are intended for educational purposes and may require adjustments based on your environment and hardware capabilities.

Disclaimer: The performance comparison is based on specific configurations and datasets. Results may vary based on hardware, data preprocessing, and other factors.

In this post

Title

Learn more about AI

Tutorials

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

Tutorials

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

Benchmarks

The Death of RAG? Do We Still Need Retrieval Augmented Generation in the Age of Large Contexts?

Benchmarks

The Death of RAG? Do We Still Need Retrieval Augmented Generation in the Age of Large Contexts?

AI Models

Are Newer LLMs Hallucinating More? Ways to Solve AI Hallucinations

AI Models

Are Newer LLMs Hallucinating More? Ways to Solve AI Hallucinations

Tutorials

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

Benchmarks