Legacy Projects

CIFAR-10 Image Classification with Deep Learning

(2021)

Summary: This project explores deep learning for a computer vision task on the CIFAR-10 dataset, a standard benchmark for 10-class image classification. The project proceeds in two phases. The initial phase involves experimenting with a basic Convolutional Neural Network (CNN) containing just two convolutional layers using ReLU and Sigmoid activations, followed by experiments with Multi-Layer Perceptrons (MLPs), to demonstrate the importance of convolutional layers for image data and the relative performance of different activation functions. This initial CNN achieved only 63% accuracy. The second phase focuses on developing a more complex CNN architecture, drawing inspiration from VGG-style networks in terms of depth and filter sizes, with the goal of exceeding 80% accuracy on a personal computer equipped with a single GPU. By incorporating techniques like batch normalisation and multi-scale convolutional filters, the final model achieved 88% accuracy on the CIFAR-10 test set, surpassing the initial goal and demonstrating the effectiveness of the chosen approach within the given resource constraints.

Problem: This project aimed to develop an effective deep learning model for the CIFAR-10 dataset within the constraints of a personal computer equipped with a single GTX 1070 GPU. The key challenges included:
1. Demonstrating the Importance of Convolutional Layers: Initial experiments with a basic two-layer CNN achieved only 63% accuracy. Subsequent trials with Multi-Layer Perceptrons (MLPs) performed even worse, clearly demonstrating the necessity of convolutional layers for effectively extracting features from image data. This comparison motivated the focus on a more advanced approach.
2. Achieving High Accuracy with Limited Resources: While established architectures like VGG-16/19 can achieve 90–93% accuracy on CIFAR-10 (and even higher with data augmentation), training such computationally intensive models from scratch was impractical on the available hardware. The challenge was to design a model that approached this level of performance while remaining feasible to train with limited resources.
3. Optimising Training Strategies: To maximise performance within the resource constraints, the project employed techniques like batch normalisation and multi-scale convolutional filters to improve accuracy and training stability. The target was to achieve an accuracy exceeding 80% without relying on extensive and complex training pipelines.
Solution:

To address the challenges outlined in the previous section, the project implemented the following solution:
1. Custom CNN Architecture: A custom CNN architecture was developed, drawing inspiration from Inception model in its use of multiple convolutional layers with varying filter sizes (3x3, 5x5, 7x7, and 9x9). This multi-scale approach aimed to capture features at different levels of detail. The architecture incorporated:
  - The convolutional section comprised an initial multi-scale feature extraction stage (four parallel convolutions with 3x3 to 9x9 kernels), three subsequent convolutional blocks with varying combinations of operations (convolutions, concatenations, batch normalization, and activations), and a final feature concatenation stage (two parallel convolutions).
  - Leaky ReLU activation functions (negative slope = 0.1) after the first set of concatenated convolutional layers (act) for improved gradient flow.
  - Batch normalisation layers (bn1 and bn2) strategically placed after the concatenation and activation operations to stabilize training and act as regularisation.
  - Max pooling layers (maxpool1 and maxpool2) to reduce spatial dimensions and introduce translation invariance.
  - A series of fully connected layers (fc1 to fc5) with ReLU activations, culminating in a final output layer with 10 neurons for the 10 CIFAR-10 classes.
2. Training Optimisation: The model was trained for 40 epochs using the following setup:
  - Batch size of 16 to manage memory usage on the GTX 1070 GPU.
  - Cross-entropy loss function, suitable for multi-class classification.
  - Stochastic Gradient Descent (SGD) optimiser with momentum (0.9).
  - A dynamic learning rate schedule to fine-tune the model during training. The learning rate was initially set to 0.001 for the first 20 epochs. After observing that the validation loss plateaued, the learning rate was reduced to 3e-4 for the next 10 epochs (epochs 21-30). Finally, for the last 10 epochs (epochs 31-40), the learning rate was further reduced to 1e-4 to facilitate finer adjustments and potentially escape local minima.
This strategy allowed the model to achieve 88% accuracy on the CIFAR-10 test set, surpassing the initial target of 80% and demonstrating the effectiveness of the chosen architecture and training regimen within the given resource constraints. The use of batch normalisation proved sufficient for regularisation, eliminating the need for dropout.
Impact: This project provided valuable practical experience in designing and training deep learning models within resource constraints. It solidified my understanding of convolutional layers and their crucial role in image data processing, along with the performance benefits of architectural innovations like multi-scale feature extraction. I also gained proficiency in optimization techniques, including dynamic learning rate schedules and batch normalisation, for stabilizing training and maximizing accuracy. Successfully building and fine-tuning a custom CNN architecture to achieve 88% accuracy on CIFAR-10 with limited resources showcased my ability to balance computational efficiency with performance—a critical skill for deep learning research and real-world applications.

GitHub repo: github.com/lzrdGreen/Models-for-CIFAR-10

Relevant skills: Python, PyTorch, Scikit-Learn, matplotlib, numpy, pandas

Jump to the Top

Application of BERT, a Transformer-based language model, to check the correctness of a sentence in English

(September 2021)

Summary: This project tackled the challenge of grammatical error detection in English using Natural Language Processing (NLP) by fine-tuning a pre-trained BERT model with the CoLA dataset. The implementation was completed on a personal computer with a single GTX 1070 GPU, demonstrating the accessibility of advanced NLP techniques without requiring high-performance computing clusters. By validating the model's performance on real-world examples, the project showcased BERT's potential for nuanced linguistic tasks, contributing to understanding fine-tuning techniques and paving the way for practical applications like grammar checkers and language learning tools.

Problem: The core problem addressed by this project is the difficulty of accurately and efficiently detecting grammatical errors in English sentences. While humans can generally identify many grammatical mistakes, the process is time-consuming and prone to inconsistencies. At the time of this project (September 2021), Natural Language Processing (NLP) was advancing rapidly, with large language models (LLMs) like BERT demonstrating promising capabilities in various language-based tasks. However, many introductory NLP applications focus on simpler classification tasks like sentiment analysis, which, thanks to readily available pre-trained models, have become remarkably streamlined. For instance, models like cardiffnlp/twitter-roberta-base-sentiment-latest (fine-tuned on a large dataset of tweets labeled with sentiment) from the Hugging Face Model Hub can be directly employed for accurate sentiment classification using only a simple downstream classifier like logistic regression, without requiring further training. Grammatical error detection, a more nuanced challenge, presents an opportunity to explore the potential of LLMs more deeply. Inspired by Chris McCormick's work utilising the Corpus of Linguistic Acceptability (CoLA) dataset and his provided Jupyter Notebook demonstrating fine-tuning a standard pre-trained 12-layer uncased BERT model (released in 2018), this project aimed to replicate and evaluate this fine-tuning process on a personal computing setup (single GTX 1070 GPU). This addressed the need for more sophisticated NLP tasks beyond basic sentiment analysis and explored the practical application of cutting-edge LLMs on readily available hardware.
Solution: This project implemented a solution based on fine-tuning a pre-trained BERT model for binary classification of grammatical acceptability using the CoLA dataset. The key steps involved:
1. Data Preprocessing: The CoLA dataset, containing sentences labeled as grammatically acceptable or unacceptable, was loaded into a pandas DataFrame and tokenised using BertTokenizer. This crucial step mapped sentences to numerical representations, a sequence of tokens, that the BERT model could understand.
2. Fine-tuning BERT: A pre-trained BERT model was fine-tuned specifically for this task. A BertForSequenceClassification transformer model, with a single linear layer on top of the BERT output, was used for binary classification (grammatically correct or incorrect). This process adapted the pre-trained model's knowledge to the specific task of grammatical error detection.
3. Training: The model was trained using the AdamW optimiser, a common choice for fine-tuning transformer models. This leveraged the pre-trained weights of BERT while optimising the model's parameters on the CoLA data.
4. Evaluation: The model's performance was primarily evaluated using real-world examples from a reputable source: Michael Swan's "Practical English Usage," an Oxford University Press publication. This approach was chosen due to the difficulty in interpreting metrics like MCC or F1 in isolation. Two pairs of sentences (one grammatically correct and one incorrect in each pair) were used to validate the model's generalization ability beyond the CoLA training data. The model successfully identified the grammatical correctness in all pairs, providing qualitative evidence of its effectiveness and practical applicability.
Impact: This project demonstrated the feasibility of using powerful LLMs like BERT for complex NLP tasks like grammatical error detection, even with limited computational resources. The impact extends beyond a simple demonstration of technology, contributing to the understanding and application of fine-tuning techniques, validating the real-world applicability of BERT for grammatical error detection, and opening avenues for future research and development in language technology. Specifically, the project:
1. Proof of Concept: While BERT is widely used for various NLP tasks, this project specifically showcased its effectiveness in the less common domain of grammatical correctness classification. This contributes to a broader understanding of BERT's versatility and its potential for addressing nuanced linguistic challenges.
2. Practical Validation: By moving beyond standard dataset evaluation and testing the model on real-world examples from a reputable grammar resource (Michael Swan's "Practical English Usage"), the project validated the model's ability to generalise to unseen data and address real-world linguistic challenges. This highlights the practical utility of fine-tuned language models in improving writing quality and language learning. This practical validation is a key contribution of the project.
3. Advanced understanding of fine-tuning: The project deepened the author's understanding of the intricacies of fine-tuning pre-trained models, including data preprocessing, model selection, training procedures, and evaluation strategies. This contributes to a growing body of knowledge on effective fine-tuning practices and provides a valuable learning experience.
4. Potential for real-world applications: While this project was a proof-of-concept, it demonstrates the potential for developing practical applications such as advanced grammar checkers, language learning tools that provide detailed feedback on grammatical errors, and improved machine translation systems that produce more grammatically sound output.
5. Accessibility: By successfully implementing the fine-tuning on a personal computer with a single GTX 1070 GPU, the project made this advanced NLP technique more accessible to a wider audience, demonstrating that high-performance computing clusters are not always necessary.
6. Reproducibility: By replicating existing work (while adapting the evaluation method), the project contributed to the reproducibility of NLP research and provided a valuable learning resource for others interested in exploring LLMs.

GitHub repo: English Grammar Tester

Relevant skills: Python, PyTorch, Scikit-Learn, matplotlib, numpy, pandas

Jump to the Top

LLM Fine-Tuning

In 2024-25, compact LLMs emerged as a promising area of research.

My 2021 project, "Application of BERT to check the correctness of a sentence," has found an unexpected continuation. While BERT—a masked language model specialised in question answering and sentence classification—represented one approach, generative LLMs now offer an alternative path.

I'm exploring these models through hands-on training on my personal laptop, which limits me to small-scale/compact LLMs.

LoRA, DoRA, and DDoRA

Follow-up Studies and Results (May 2025)

The initial success of our April 2025 investigation into LoRA, DoRA and DDoRA on the IMDb dataset prompted further exploration. Subsequent attempts to replicate these results on a more stubborn dataset proved much less fruitful. However, this persistent struggle led to a crucial pivot: a development of various techniques to deal with the challenging task. I applied the developed tools and techniques to the well-understood IMDb dataset which resulted in a better / more detailed examination of LoRA's internal dynamics. By examining the magnitudes and gradients of the low-rank matrices A and B across different layers during DistilBERT fine-tuning on the IMDb dataset, a non-obvious result emerged: significant dropout applied after the projection with matrix A and before the final projection with matrix compels lora.B to become more actively involved in learning meaning it partly takes over training from lora.A. While this dropout strategy did not significantly alter LoRA's performance on IMDb, it provided a critical insight that is expected to be particularly impactful for the application of Double DoRA (DDoRA) and possibly DoRA itself to IMDb where the small magnitude of the B matrix is a serious issue. Here are the key observations:

Our analysis revealed a consistent pattern where the magnitude of the B matrices remained significantly smaller (by approximately three orders of magnitude) than the A matrices throughout training. This is expected due to the near-zero initialisation of B, designed to minimise initial disruption to the pretrained weights. Despite their small size, the gradients of B (∣∇B∣) were notably larger than those of A (∣∇A∣), particularly in the earlier layers of the network. This suggests that while the B matrix starts small, it undergoes more active learning, especially in the initial stages of adaptation.

Furthermore, the experiments with dropout (applied after the A projection) indicated its role as a regulariser, encouraging more robust adaptation in the B matrix without a significant drop in performance. Notably, dropout consistently amplified the gradient magnitudes of B across almost all layers, suggesting that the added noise and sparsity during training compel B to learn more aggressively.

These findings underscore the importance of the directional updates facilitated by the A and B matrices. Even with a small magnitude in B, the effective low-rank update (ΔW=α⋅AB) can be meaningful due to the larger values in A and the scaling factor α. The comparatively large gradients observed for B suggest that the direction of these updates, rather than just the magnitude, plays a crucial role in the parameter-efficient adaptation process. The small magnitude of B may even contribute to better generalisation by preventing large, potentially overfitting weight changes.

This deeper understanding of LoRA's internal mechanics, particularly the interplay between the magnitudes and gradients of the A and B matrices and the impact of dropout, provides a valuable foundation for strategically navigating the more complex parameter space of Double DoRA (DDoRA).

Our subsequent DDoRA investigation involved a two-stage training approach to ensure stability and effective adaptation. In the initial stage (epochs 1-2), a high dropout rate (0.3) was applied within the LoRA path. This forced the A matrix to be noisy, compelling the B matrix to compensate, resulting in ||B|| becoming 50-60% of ||A|| and B actively adapting with consistently larger gradients than A (|∇B| > |∇A|). This confirmed dropout's critical role in making B a full, active partner in the learned LoRA basis. Despite the additional complexity and potential instability introduced by DDoRA's m_in, m_out, and directional scale factors, training remained stable with no explosive norms or vanishing gradients.

In the second stage (epochs 3-4), dropout was reduced to 0.1, and a custom optimiser was employed with a lr_B_scale = 0.5 to allow A and B to synthesise their learned subspaces. This stage maintained healthy gradients for both A and B (all norms in the ~1e-5 to 7e-5 range), with B norms consistently smaller than A (~0.1-0.17 vs ~0.22-0.28 respectively), reflecting the conservative updates to B. LoRA(x) magnitudes were found to be largest in FFN layers, and crucially, Layer 5 showed persistent active training, indicating that its weight freezing would be premature.

For the final two epochs (5-6), even smaller learning rates were set (base_lr=1e-3, lr_B_scale=1.0, lr_scale_params=1.0). While the intent was to potentially freeze FFN and the last attention layer, the observed training dynamics, particularly in Layer 5, suggested that freezing was not yet necessary. The results from this final stage demonstrated continued stability: |LoRA(x)| magnitudes remained healthy (ranging from 1e4 to 1e5), and |∇A| and |∇B| were non-zero across all layers 0-4, indicating active learning was still occurring. Overall, the training showed no obvious overfitting or catastrophic drift, with validation accuracy/F1 score consistently maintaining around 92.6–92.8% from early steps onward. For full details on the experimental setup and training dynamics, please refer to the complete notebook here.

In an alternative training setup, a small 5% dropout was additionally applied directly in the DDoRA forward path (F.dropout(lora_output, p=0.05)). This was intended to prevent overfitting to small artifacts and distribute the useful signal more broadly across the low-rank space, leading to even greater training stability after 4 epochs.

Following this, an experiment was conducted where Layer 5 FFN parameters were frozen for one epoch, then unfrozen for another, with adapted learning rates. The goal was to allow other layers to "catch up" in their learning. However, analysis of the training outcomes indicated that this freezing strategy did not provide the intended benefit, as Layer 5's FFN modules continued to show active participation in learning. For full details, please refer to the notebook here.

In a third experimental notebook, the DDoRA training was reproduced with initial settings across 4 epochs. Throughout these initial 4 epochs, a slightly smaller dropout of 2% was applied directly within the DDoRA forward path (F.dropout(lora_output, p=0.02)). This continuous application of dropout further contributed to stable training.

Following this, an experimental stage began to investigate whether restarting a layer's learning could be beneficial: the LoRA parameters (A and B matrices) in Layer 5 were reinitialized with random values (with B initialized at a smaller magnitude than A, as is standard). After this reinitialization, the model was trained for two more epochs (epochs 5-6) with a learning rate of 0.01.

The most striking difference after reinitialization was the drastically lower magnitudes for both A and B matrices in Layer 5 compared to the preceding layers (0-4). This indicated that the reinitialization effectively "reset" the learned weights. While Layer 5's gradients were non-zero across all its attention and FFN sub-layers, their magnitudes were generally lower than in layers 0-4. Despite this, the LoRA(x) magnitudes for Layer 5 FFN layers remained notably high, suggesting that even with smaller A and B magnitudes, these adapters quickly re-learned to contribute significantly to the model's overall output. This initial experiment showed that while reinitializing a layer could force fresh learning, it wasn't a "magic bullet" for performance enhancement in this DDoRA setup, as it didn't fundamentally alter the established performance trajectory of the already well-trained preceding layers.

Following the reinitialization experiment, training continued for another two epochs (epochs 7-8) with an even lower learning rate (base_lr=3e-3). This final stage aimed to further refine the adaptation process. The metrics from these last two epochs indicate that the model maintained its stability and healthy gradient flow across all layers (0-5). LoRA(x) magnitudes remained robust, further confirming that the DDoRA adapters were effectively contributing to the model's output. Even in Layer 5, despite its earlier reinitialization and lower A and B magnitudes compared to other layers, its LoRA(x) magnitudes remained substantial, particularly in the FFN layers. This suggests that Layer 5's reinitialized adapters successfully re-engaged in learning and adapted to contribute meaningfully to the model's performance, even at a lower learning rate. For full details, please refer to the notebook here.

A staged training strategy for Double DoRA on IMDb—starting with high dropout to activate B, then gradually reducing it while tuning learning rates—revealed that dropout plays a crucial role in balancing A and B dynamics, enabling stable adaptation and strong performance without overfitting. Applying small dropout directly in the DDoRA path and experimenting with Layer 5 freezing and reinitialization revealed that while targeted perturbations can enhance training stability and provoke renewed adaptation, they do not drastically alter the overall performance trajectory, as DDoRA layers—especially in Layer 5—consistently re-engage and contribute meaningfully despite resets or conservative updates.

(April 2025)

Summary: This project investigates parameter-efficient fine-tuning of large language models, specifically applying Low-Rank Adaptation (LoRA) and Weight-Decomposed Low-Rank Adaptation (DoRA) for sentiment classification on the IMDb dataset using DistilBERT. We observed that the standard application of DoRA offers limited improvements over LoRA in language models due to the vanishing magnitude of weight updates stemming from zero-initialised low-rank matrices. To address this, we introduced a trainable per-head scaling mechanism, enabling effective directional updates and significantly enhancing performance. Building on these findings, we propose Double DoRA (DDoRA), a natural extension of DoRA that applies geometric adaptation at both the input and output of linear layers, increasing representational flexibility while maintaining parameter efficiency. Our results demonstrate that per-head scaling stabilises training and that DDoRA provides additional degrees of freedom for model adaptation, offering deeper insights into the fine-tuning dynamics of different network components.

Problem: Efficiently adapting large pretrained language models for specific tasks remains a critical challenge in NLP (natural language processing), especially with limited datasets. Traditional fine-tuning of these massive models can be computationally expensive and time-consuming. Low-Rank Adaptation (LoRA), introduced in 2021, emerged as a parameter-efficient alternative that learns only necessary corrections through low-rank matrix factorisation without significantly altering the original model's weights.

More recently in 2024, DoRA (Weight-Decomposed Low-Rank Adaptation) built upon LoRA's parameter-efficient approach by explicitly considering the geometry of the weight updates. Unlike LoRA, which primarily focuses on adjusting weight values through low-rank matrices, DoRA takes into account the direction and magnitude of these adjustments in the model's function space. By aligning and normalising the gradient's direction and the output norms, DoRA aims to guide the learning process more effectively, potentially leading to better performance and a stronger preservation of the pretrained model's original capabilities compared to simply adjusting weights.

This project explores applying these efficient adaptation techniques to a smaller distilled model (DistilBERT) during fine-tuning on the IMDb Movie Reviews dataset (truncated to 384 tokens). The work was inspired by Sebastian Raschka's paper "Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch," which originally demonstrated these techniques on the MNIST dataset (though notably without Conv2d/MaxPool for image feature extraction). Due to access limitations with Raschka's Lightning.ai implementation for DistilBERT fine-tuning, I developed my own solution aligned with the idea that educational materials should be freely available and open source.
Solution: LoRA parameterises weight updates through the factorisation of these updates into two low-rank matrices, A (dimension: in_dim x rank) and B (dimension: rank x out_dim). This approach significantly reduces the number of trainable parameters; in our case, they constituted just 1.75% of the total parameters. However, realising LoRA's potential for rapid convergence and high performance (around 91% F1-score and accuracy on the IMDb test and evaluation datasets) necessitates a thorough grid search to identify optimal hyperparameters, including rank, alpha, and the learning_rate. Notably, the B matrices, initialised as zeroes by design to ensure minimal initial impact on the pretrained weights, consistently exhibit a much smaller magnitude compared to the A matrices even after several epochs of training.

In effect, the B matrix struggles to capture learned information, which impacts the standard implementation of DoRA. In typical DoRA, while the m vector (initialised as the normal random) updates reasonably well (a case for failure: initialise symmetrically with all ones or zeros), its influence is limited by the small magnitude of the B matrix it's multiplied with. Consequently, a direct application of DoRA results in negligible improvements over LoRA, as the directional guidance provided by the normalised weight doesn't translate into substantial changes in the model's behaviour. This leads to the idea of scaling, however just introducing a scaling factor is not sufficient to resolve this. Our experiments showed that only by making the scaling factor trainable per head did we achieve significant improvements. This per-head scaling, initialised at 2.0 and allowed to learn for each head separately, enables the model to effectively learn the optimal directions for weight updates and the importance of different attention heads. Observing and analysing which heads are most affected by the IMDb fine-tuning process and how the scaling evolves provides valuable insights into the model's adaptation.

After 5 epochs of training with this per-head scaling, we observed a reasonable improvement in performance (accuracy and F1 score reached 91% at the end of epoch 1). Crucially, the norm of the B matrices increased significantly (to the range of 20-50), becoming comparable to, though still somewhat smaller than, the norm of the A matrices. This indicates that the low-rank updates are now contributing meaningfully to the final weights. Furthermore, the learned scaling factors exhibit meaningful variation across different heads and layers, demonstrating that the model is selectively amplifying the directional updates. Importantly, the scaling factors did not collapse to zero (indicating dead nodes) nor explode to NaN (partly due to introduction of small L2 regularisation with weight_decay = 1e-4), showing stable training.

Building upon LoRA and scaled per-head DoRA insights, we developed Double DoRA (DDoRA), a novel enhancement that introduces geometric adaptation at both the input and output of linear layers. While standard DoRA applies geometric correction only at the output, DDoRA adds a complementary adjustment mechanism at the input side as well. The core innovation of DDoRA is introducing trainable m_in vectors and scale_in parameters that allow the model to learn input-specific scaling and directional emphasis before linear transformation occurs. This dual application of geometric adaptation provides additional degrees of freedom, enabling more expressive representations and potentially more stable fine-tuning. By allowing independent scaling of each input feature through the broadcasting of m_in and scale_in, DDoRA empowers the model to determine which layers and heads benefit most from enhanced geometric sensitivity at both transformation endpoints. This approach maintains the parameter efficiency of LoRA while significantly increasing representational power through strategic geometric adjustments at critical points in the network architecture.
Impact: This project culminates in the development of Double DoRA (DDoRA), a novel parameter-efficient fine-tuning technique that builds upon the strengths of LoRA and scaled per-head DoRA. Maintaining a small trainable parameter footprint (in our experiments, DDoRA utilised just 1.86% of the total parameters - this is crucial for fine-tuning on resource-constrained devices), DDoRA introduces a dual geometric adaptation mechanism at both the input and output of linear layers, offering increased flexibility compared to its predecessors.

While a comprehensive evaluation of DDoRA's stability and hyperparameter sensitivity compared to DoRA requires further investigation, the architecture's enhanced expressivity exposes the potential to capture more nuanced task-specific information. DDoRA serves as a valuable exploratory tool, allowing for a deeper understanding of which parts of the network architecture are most critical for adapting to specific tasks like sentiment classification on the IMDb dataset. By observing the learned scaling factors and the evolution of the m and B matrices across different layers and attention heads, we can gain insights into the model's learning process and the importance of various architectural components. This capability not only has the potential to improve training efficiency but also to guide future architectural innovations in parameter-efficient fine-tuning.

GRPO Fine-Tuning of Gemma 3-1B-it

(March 2025)

Summary: "Tiny Large Language Models (LLMs) like Qwen2.5-0.5B and TinyLlama-1.1B seem to lack reasoning capabilities. This study explores fine-tuning of Gemma 3-1B-it, the smallest in the recent Google family, using GRPO and a targeted reward system, on the 'causal_judgement' subset of the BBH dataset. This resulted in a promising accuracy improvement, demonstrating the model's enhanced reasoning capabilities.

Problem: Most tiny Large Language Models (LLMs), including Qwen2.5-0.5B and TinyLlama-1.1B, demonstrate extremely limited reasoning abilities. While Gemma 3-1B-it showed promise in basic mathematical tasks, a critical question remained: Can it develop meaningful reasoning capabilities?"
Solution: To address the reasoning challenge, I selected the 'causal_judgement' subset from the BBH dataset, recognising that the standard GSM8K dataset would likely overwhelm a tiny LLM. While the dataset's small size of 187 samples raises potential statistical bias concerns, it also reflects the challenge of developing effective training strategies for LLMs when faced with the real-world constraints of limited, proprietary datasets, as increasingly seen in the AI-backed app industry's shift towards in-house models. Group Relative Policy Optimisation (GRPO) was employed to fine-tune Gemma 3-1B-it. I developed a strategic reward system that penalises irrelevant or verbose responses and prioritises accurate binary answers. This targeted approach aimed to enhance the model's ability to discern causal relationships and produce accurate yes/no answers. The subset's binary answer format enabled straightforward evaluation using metrics like confusion matrix and accuracy, providing a clear method to measure reasoning improvement.
Impact: After 1,600 training steps, the GRPO fine-tuned Gemma 3-1B-it model showed notable performance gains. The accuracy increased from 20% to 50% while the confusion matrix improved from {'TP': 4, 'TN': 0, 'FP': 7, 'FN': 9} for the raw model to {'TP': 8, 'TN': 2, 'FP': 5, 'FN': 5} after training. These preliminary results suggest GRPO's potential for enhancing reasoning capabilities in Gemma 3-1B-it, particularly when trained on limited, proprietary datasets that mirror real-world scenarios like onboarding new hires. Future research should focus on evaluating the approach using domain-specific datasets that support binary yes/no target answers to further validate the GRPO fine-tuning method's robustness and practical applicability.

Fine Tuning of Qwen2.5-0.5B-Instruct Model

(February 2025)

Qwen2.5-0.5B-Instruct is a tiny LLM with half a billion parameters. This study examines various fine-tuning approaches (SFT and DPO, both independently and in combination) against each other and the baseline model.

I assessed the fine-tuning techniques (Table 1 below), using perplexity (PPL) as the evaluation metric—this quantifies how well the model predicts the next token, with lower scores indicating better performance. Each model version was evaluated using diverse prompts, compared against the baseline Qwen2.5-0.5B-Instruct model without fine-tuning.

I first investigated two approaches: Supervised Fine-Tuning (SFT) on conversational HuggingFaceH4/ultrachat_200k dataset, and Direct Preference Optimisation (DPO) on argilla/distilabel-intel-orca-dpo-pairs dataset of accepted/rejected response pairs. Contrary to common belief that human expertise benefits LLM performance, DPO showed virtually no improvement over baseline and proved challenging for my RTX 4080 GPU—the session with the best evaluation_loss ended in a runtime error near completion. More stable settings yielded slightly worse evaluation_loss results. In contrast, Supervised Fine-Tuning (SFT) demonstrated significant improvements, producing the best models according to PPL metrics. The same Jupyter notebook contains evaluation runs for both the baseline model and the pure DPO model loaded from its best checkpoint.

Given DPO's discouraging results, I tested applying DPO after initial SFT (see DPOafterSFT notebook). This approach deteriorated PPL, but less severely than pure DPO. I then applied a second round of SFT to create an SFT-DPO-SFT sequence, which yielded significant improvements nearly matching pure SFT. However, determining which metrics best reflect human preferences remains challenging.

Notably, all models performed strongest on creative writing tasks (e.g., "Write a scene from a play..."). Interestingly, while most factual prompts received good PPL scores, the specific prompt "Give me three facts about London" proved challenging for all models—possibly because the abundance of potential facts makes selection difficult.

Table 1.

Perplexity for Fine-Tuned Qwen2.5-0.5B-Instruct Model Using Various Techniques

Prompt	Supervised Fine Tuning	Direct Preference Optimisation	DPOafterSFT	SFT-DPO-SFT	Baseline
What is AI?	88.66	178.1	104.6	93.62	179.5
Tell me something interesting about Albert Einstein.	60.97	124.8	74.59	61.76	118.6
Tell me something about Large Language Models.	86.54	120.5	94.39	87.35	119.8
What is geometry? Explain it step by step.	50.75	80.79	60.04	54.14	80.46
Explain the concept of entropy in simple terms.	42.76	60.68	44.75	42.43	60.53
Tell me something about Jean Baudrillard.	50.86	82.91	55.48	51.20	81.40
Who was David Hilbert?	91.98	179.7	131.0	99.74	176.0
Give me three facts about London.	108.7	204.4	120.2	109.3	200.9
Tell a short story about enemies who eventually became friends, why did it happen?	86.02	117.0	97.68	86.90	114.8
Write a scene from a play where two men are having a philosophical debate about the nature of consciousness.	24.19	32.03	25.83	24.79	31.64
Imagine you are a time traveler who has just arrived in the remote future. Describe what you observe that is significantly different from today.	27.74	36.18	30.26	28.76	36.16
Tell me something about love.	76.32	138.5	87.30	77.46	136.0

Jump to the Top