TL;DR: By fine-tuning a small Llama model (Llama 3.2 1B Instruct) on a Google Colab T4 GPU for about 1 hour and 40 minutes, we achieved 95% accuracy on a news article classification task—outperforming GPT-4, which scored 87.2%. The small size of the model (1 billion parameters) enables easy deployment on resource-constrained environments. This tutorial walks you through the process, demonstrating how you can leverage open-source models to build efficient, high-performing AI systems without relying on external APIs.
Introduction
You can find the complete Jupyter notebook for this tutorial on my GitHub repository. Feel free to follow along and experiment with the code!
Large Language Models (LLMs) like GPT-4 have revolutionized Natural Language Processing (NLP), offering remarkable capabilities in text generation, classification, translation, and more. However, relying on external APIs introduces limitations such as usage costs, data privacy concerns, and dependency on third-party services.
What if you could achieve better results without these constraints?
In this tutorial, we'll explore how to fine-tune an open-source Llama model for a straightforward task: classifying news articles into topics. We'll compare the performance with GPT-4 and demonstrate how our fine-tuned model not only matches but surpasses it in accuracy. Additionally, we'll highlight how the small size of the model (1 billion parameters) allows for easy deployment, even on devices with limited computational resources.
Why Fine-Tune an Open-Source Model?
Cost Efficiency: Eliminate recurring costs associated with API calls to commercial services.
Data Privacy: Keep sensitive data in-house without sending it to external servers.
Customization: Tailor the model to your specific domain or task.
Performance: Achieve high accuracy by leveraging specialized data.
Control: Full access to model internals for debugging and optimization.
Ease of Deployment: Small models can be deployed on devices with limited resources.
Overview
We'll cover the following steps:
Setting Up the Environment
Loading the Model and Tokenizer
Preparing the Dataset
Configuring LoRA for Fine-Tuning
Training the Model
Evaluating the Model
Comparing with GPT-4
Inference
Conclusion
1. Setting Up the Environment
Configuring Google Colab for GPU
To run this tutorial, we'll use Google Colab, which provides free access to GPUs, including the NVIDIA T4. Here's how to set it up:
Open a New Notebook:
Go to Google Colab and create a new notebook.
Change Runtime Type:
Click on "Runtime" in the menu bar.
Select "Change runtime type".
In the dialog box, set "Hardware accelerator" to "GPU".
Ensure the "GPU Type" is set to "T4". If not, you can reconnect until you get a T4 GPU.
Install Required Libraries
Now, we'll install the necessary libraries:
transformers: For model and tokenizer.
datasets: To load and process datasets.
accelerate: Helps with distributed training.
peft: For Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA.
trl: Transformer Reinforcement Learning, useful for fine-tuning.
bitsandbytes: Enables 8-bit optimizers for efficient training.
Logging into Hugging Face
We'll need to log in to Hugging Face to access models and datasets:
You'll be prompted to enter your Hugging Face token. If you don't have one, you can sign up.
2. Loading the Model and Tokenizer
Selecting the Model
We're using the open-source Llama 3.2 1B Instruct model:
This model is relatively small, with 1 billion parameters, requiring approximately 6 GB of GPU memory for fine-tuning and around 2 GB for inference. This makes it suitable for deployment on resource-constrained environments and for fine-tuning on limited computational resources like a Google Colab T4 GPU, which has 16 GB of VRAM.
Hardware Requirements and Model Size
The Llama 3.2 1B Instruct model, with its 1 billion parameters, strikes a balance between performance and resource utilization:
Fine-Tuning Requirements:
GPU Memory: Approximately 6 GB of VRAM is needed during fine-tuning.
RAM: Around 12 GB of system RAM is recommended.
Disk Space: The model and datasets require about 4 GB of storage.
Inference Requirements:
GPU Memory: Requires about 2 GB of VRAM for inference.
Can Run on CPU: For smaller workloads, inference can be performed on a CPU with sufficient RAM.
This makes the model accessible for those without high-end hardware. A Google Colab T4 GPU, which comes with 16 GB of VRAM, is more than sufficient for this task.
Tip: If you're running into memory issues, consider reducing the per_device_train_batch_size or using techniques like gradient accumulation.
Setting Torch Dtype and Attention Implementation
torch_dtype: Determines the precision of computations.
bfloat16: Suitable for newer GPUs with better support.
float16: For older GPUs.
attn_implementation: Chooses the attention mechanism implementation.
flash_attention_2: Faster, but requires compatible hardware.
eager: Standard implementation.
Loading the Model
AutoModelForSequenceClassification: A model class for classification tasks.
num_labels: Number of classes in the dataset.
id2label & label2id: Dictionaries mapping between labels and IDs.
attn_implementation: Passes the chosen attention implementation to the model.
Loading the Tokenizer
AutoTokenizer: Automatically loads the appropriate tokenizer.
padding_side: Ensures that padding tokens are added to the right.
pad_token: Padding token for batch processing.
3. Preparing the Dataset
We'll use the AG News dataset, a collection of news articles labeled into four categories.
Loading the Dataset
Tokenizing the Dataset
preprocess_function: Tokenizes the text and truncates sequences longer than the model's maximum input length.
batched=True: Processes multiple examples at once for efficiency.
Visualizing Sequence Lengths

This helps us decide on an appropriate max_seq_len
, or the maximal number of tokens, for truncation.
Setting Maximum Sequence Length
We set max_seq_len
to 128 to balance between capturing enough context and computational efficiency.
Re-tokenizing with Padding and Truncation
max_length: Ensures sequences are truncated to
max_seq_len
.padding="max_length": Pads sequences to
max_seq_len
.
4. Configuring LoRA for Fine-Tuning
LoRA (Low-Rank Adaptation) is a fine-tuning method that reduces memory and compute requirements by updating only a subset of model parameters. LoRA allows us to fine-tune large models efficiently by adding trainable rank-decomposition matrices to existing weights.
Setting Up LoRA
task_type: Specifies the task (sequence classification).
r: Rank of the LoRA matrices; a trade-off between resource usage and performance.
lora_alpha: A scaling factor for LoRA updates.
lora_dropout: Dropout rate applied to LoRA layers.
target_modules: Applying LoRA to all linear layers for comprehensive fine-tuning.
5. Training the Model
Setting Hyperparameters
output_dir: Directory to save model checkpoints.
num_train_epochs: Number of training epochs.
per_device_train_batch_size: Batch size per GPU/CPU during training.
gradient_accumulation_steps: Number of steps to accumulate gradients before updating.
evaluation_strategy: How often to run evaluation.
fp16: Enables 16-bit floating point precision for faster training.
load_best_model_at_end: Loads the best model based on the metric.
metric_for_best_model: Metric to determine the best model.
Defining Metrics
compute_metrics: Function to compute evaluation metrics like accuracy, F1 score, precision, and recall.
Initializing the Trainer
train_dataset: Shuffled training dataset.
eval_dataset: Subset of the test dataset for evaluation during training.
SFTTrainer: Trainer class from TRL (Transformer Reinforcement Learning).
packing: Set to
False
to control padding explicitly.
Training
This will start the fine-tuning process. On a suitable GPU, training takes approximately 1 hour and 40 minutes.

6. Evaluating the Model
Evaluating on a Subset

Evaluating on the Entire Test Set

Expected Results:
We've achieved 95% accuracy, outperforming GPT-4 on the same task.
7. Comparing with GPT-4
To evaluate GPT-4 on the same test set, we used the OpenAI API. However, due to the costs associated with API usage, we limited our evaluation to 500 samples from the test dataset. We crafted a prompt asking GPT-4 to classify each news article into one of the four categories: World, Sports, Business, or Sci/Tech.
Evaluation Methodology:
Prompt Design: We provided the article text and instructed GPT-4 to return only the category name.
Data Processing:
Used the same test dataset for a fair comparison.
Handled API rate limits and ensured compliance with OpenAI's usage policies.
Metrics Computed:
Accuracy
Precision
Recall
F1 Score
You can find the complete code and methodology for the GPT-4 evaluation in this GitHub repository.
GPT-4 Evaluation Results:
After evaluating GPT-4 on 500 samples from the test set, we obtained the following results:
Comparison with Our Fine-Tuned Model:

Our fine-tuned Llama model outperforms GPT-4 by a significant margin on this specific task.
To evaluate GPT-4 on the same test set, we use the OpenAI API.
8. Inference
Let's test our fine-tuned model with an example from the test set.

Sample Output:
Our model correctly classified the news article into the Business category.
9. Conclusion
By fine-tuning an open-source Llama 3.2 1B Instruct model with LoRA, we achieved 95% accuracy on the AG News classification task, outperforming GPT-4's 87.2%. This demonstrates that with appropriate fine-tuning, smaller models can surpass larger, more expensive models on specific tasks.
Key Takeaways
Performance: Fine-tuned models can outperform general-purpose models like GPT-4 on specialized tasks.
Cost Efficiency: Avoid recurring costs associated with API calls to commercial services.
Data Privacy: Keep sensitive data in-house without sending it to external servers.
Customization: Tailor the model to your specific domain or task.
Accessibility: Utilize open-source models to democratize AI capabilities.
Next Steps
Experiment with Different Models: Try other open-source models and compare performance.
Optimize Hyperparameters: Adjust training arguments and LoRA configurations for better results.
Scale Up: Fine-tune on larger datasets or for more epochs.
Deploy the Model: Integrate the fine-tuned model into your applications.
About Datawizz.ai
At Datawizz.ai, we empower companies to own and optimize their AI with specialized, efficient models. Our mission is to make advanced AI accessible and customizable, allowing businesses to harness the full potential of their data.
References
Hugging Face Transformers - Llama 3.2 1B instruct
Jupyter Notebooks in GitHub Repository: DatawizzAI Blogs
Note: The code snippets provided are intended for educational purposes and may require adjustments based on your environment and hardware capabilities.
Disclaimer: The performance comparison is based on specific configurations and datasets. Results may vary based on hardware, data preprocessing, and other factors.