Oleksandr Parshakov

Legacy Projects

CIFAR-10 Image Classification with Deep Learning

(2021)

Summary: This project explores deep learning for a computer vision task on the CIFAR-10 dataset, a standard benchmark for 10-class image classification. The project proceeds in two phases. The initial phase involves experimenting with a basic Convolutional Neural Network (CNN) containing just two convolutional layers using ReLU and Sigmoid activations, followed by experiments with Multi-Layer Perceptrons (MLPs), to demonstrate the importance of convolutional layers for image data and the relative performance of different activation functions. This initial CNN achieved only 63% accuracy. The second phase focuses on developing a more complex CNN architecture, drawing inspiration from VGG-style networks in terms of depth and filter sizes, with the goal of exceeding 80% accuracy on a personal computer equipped with a single GPU. By incorporating techniques like batch normalisation and multi-scale convolutional filters, the final model achieved 88% accuracy on the CIFAR-10 test set, surpassing the initial goal and demonstrating the effectiveness of the chosen approach within the given resource constraints.

GitHub repo: github.com/lzrdGreen/Models-for-CIFAR-10

Relevant skills: Python, PyTorch, Scikit-Learn, matplotlib, numpy, pandas

Loss for training and validaion sets

Loss for training and validaion sets.

Jump to the Top

Application of BERT, a Transformer-based language model, to check the correctness of a sentence in English

(September 2021)

Summary: This project tackled the challenge of grammatical error detection in English using Natural Language Processing (NLP) by fine-tuning a pre-trained BERT model with the CoLA dataset. The implementation was completed on a personal computer with a single GTX 1070 GPU, demonstrating the accessibility of advanced NLP techniques without requiring high-performance computing clusters. By validating the model's performance on real-world examples, the project showcased BERT's potential for nuanced linguistic tasks, contributing to understanding fine-tuning techniques and paving the way for practical applications like grammar checkers and language learning tools.

GitHub repo: English Grammar Tester

Relevant skills: Python, PyTorch, Scikit-Learn, matplotlib, numpy, pandas

Jump to the Top

LLM Fine-Tuning

In 2024-25, compact LLMs emerged as a promising area of research.

My 2021 project, "Application of BERT to check the correctness of a sentence," has found an unexpected continuation. While BERT—a masked language model specialised in question answering and sentence classification—represented one approach, generative LLMs now offer an alternative path.

I'm exploring these models through hands-on training on my personal laptop, which limits me to small-scale/compact LLMs.

LoRA, DoRA, and DDoRA

Follow-up Studies and Results (May 2025)

The initial success of our April 2025 investigation into LoRA, DoRA and DDoRA on the IMDb dataset prompted further exploration. Subsequent attempts to replicate these results on a more stubborn dataset proved much less fruitful. However, this persistent struggle led to a crucial pivot: a development of various techniques to deal with the challenging task. I applied the developed tools and techniques to the well-understood IMDb dataset which resulted in a better / more detailed examination of LoRA's internal dynamics. By examining the magnitudes and gradients of the low-rank matrices A and B across different layers during DistilBERT fine-tuning on the IMDb dataset, a non-obvious result emerged: significant dropout applied after the projection with matrix A and before the final projection with matrix compels lora.B to become more actively involved in learning meaning it partly takes over training from lora.A. While this dropout strategy did not significantly alter LoRA's performance on IMDb, it provided a critical insight that is expected to be particularly impactful for the application of Double DoRA (DDoRA) and possibly DoRA itself to IMDb where the small magnitude of the B matrix is a serious issue. Here are the key observations:

Our analysis revealed a consistent pattern where the magnitude of the B matrices remained significantly smaller (by approximately three orders of magnitude) than the A matrices throughout training. This is expected due to the near-zero initialisation of B, designed to minimise initial disruption to the pretrained weights. Despite their small size, the gradients of B (∣∇B∣) were notably larger than those of A (∣∇A∣), particularly in the earlier layers of the network. This suggests that while the B matrix starts small, it undergoes more active learning, especially in the initial stages of adaptation.

Furthermore, the experiments with dropout (applied after the A projection) indicated its role as a regulariser, encouraging more robust adaptation in the B matrix without a significant drop in performance. Notably, dropout consistently amplified the gradient magnitudes of B across almost all layers, suggesting that the added noise and sparsity during training compel B to learn more aggressively.

These findings underscore the importance of the directional updates facilitated by the A and B matrices. Even with a small magnitude in B, the effective low-rank update (ΔW=α⋅AB) can be meaningful due to the larger values in A and the scaling factor α. The comparatively large gradients observed for B suggest that the direction of these updates, rather than just the magnitude, plays a crucial role in the parameter-efficient adaptation process. The small magnitude of B may even contribute to better generalisation by preventing large, potentially overfitting weight changes.

This deeper understanding of LoRA's internal mechanics, particularly the interplay between the magnitudes and gradients of the A and B matrices and the impact of dropout, provides a valuable foundation for strategically navigating the more complex parameter space of Double DoRA (DDoRA).

Our subsequent DDoRA investigation involved a two-stage training approach to ensure stability and effective adaptation. In the initial stage (epochs 1-2), a high dropout rate (0.3) was applied within the LoRA path. This forced the A matrix to be noisy, compelling the B matrix to compensate, resulting in ||B|| becoming 50-60% of ||A|| and B actively adapting with consistently larger gradients than A (|∇B| > |∇A|). This confirmed dropout's critical role in making B a full, active partner in the learned LoRA basis. Despite the additional complexity and potential instability introduced by DDoRA's min, mout, and directional scale factors, training remained stable with no explosive norms or vanishing gradients.

In the second stage (epochs 3-4), dropout was reduced to 0.1, and a custom optimiser was employed with a lr_B_scale = 0.5 to allow A and B to synthesise their learned subspaces. This stage maintained healthy gradients for both A and B (all norms in the ~1e-5 to 7e-5 range), with B norms consistently smaller than A (~0.1-0.17 vs ~0.22-0.28 respectively), reflecting the conservative updates to B. LoRA(x) magnitudes were found to be largest in FFN layers, and crucially, Layer 5 showed persistent active training, indicating that its weight freezing would be premature.

For the final two epochs (5-6), even smaller learning rates were set (base_lr=1e-3, lr_B_scale=1.0, lr_scale_params=1.0). While the intent was to potentially freeze FFN and the last attention layer, the observed training dynamics, particularly in Layer 5, suggested that freezing was not yet necessary. The results from this final stage demonstrated continued stability: |LoRA(x)| magnitudes remained healthy (ranging from 1e4 to 1e5), and |∇A| and |∇B| were non-zero across all layers 0-4, indicating active learning was still occurring. Overall, the training showed no obvious overfitting or catastrophic drift, with validation accuracy/F1 score consistently maintaining around 92.6–92.8% from early steps onward. For full details on the experimental setup and training dynamics, please refer to the complete notebook here.

In an alternative training setup, a small 5% dropout was additionally applied directly in the DDoRA forward path (F.dropout(lora_output, p=0.05)). This was intended to prevent overfitting to small artifacts and distribute the useful signal more broadly across the low-rank space, leading to even greater training stability after 4 epochs.

Following this, an experiment was conducted where Layer 5 FFN parameters were frozen for one epoch, then unfrozen for another, with adapted learning rates. The goal was to allow other layers to "catch up" in their learning. However, analysis of the training outcomes indicated that this freezing strategy did not provide the intended benefit, as Layer 5's FFN modules continued to show active participation in learning. For full details, please refer to the notebook here.

In a third experimental notebook, the DDoRA training was reproduced with initial settings across 4 epochs. Throughout these initial 4 epochs, a slightly smaller dropout of 2% was applied directly within the DDoRA forward path (F.dropout(lora_output, p=0.02)). This continuous application of dropout further contributed to stable training.

Following this, an experimental stage began to investigate whether restarting a layer's learning could be beneficial: the LoRA parameters (A and B matrices) in Layer 5 were reinitialized with random values (with B initialized at a smaller magnitude than A, as is standard). After this reinitialization, the model was trained for two more epochs (epochs 5-6) with a learning rate of 0.01.

The most striking difference after reinitialization was the drastically lower magnitudes for both A and B matrices in Layer 5 compared to the preceding layers (0-4). This indicated that the reinitialization effectively "reset" the learned weights. While Layer 5's gradients were non-zero across all its attention and FFN sub-layers, their magnitudes were generally lower than in layers 0-4. Despite this, the LoRA(x) magnitudes for Layer 5 FFN layers remained notably high, suggesting that even with smaller A and B magnitudes, these adapters quickly re-learned to contribute significantly to the model's overall output. This initial experiment showed that while reinitializing a layer could force fresh learning, it wasn't a "magic bullet" for performance enhancement in this DDoRA setup, as it didn't fundamentally alter the established performance trajectory of the already well-trained preceding layers.

Following the reinitialization experiment, training continued for another two epochs (epochs 7-8) with an even lower learning rate (base_lr=3e-3). This final stage aimed to further refine the adaptation process. The metrics from these last two epochs indicate that the model maintained its stability and healthy gradient flow across all layers (0-5). LoRA(x) magnitudes remained robust, further confirming that the DDoRA adapters were effectively contributing to the model's output. Even in Layer 5, despite its earlier reinitialization and lower A and B magnitudes compared to other layers, its LoRA(x) magnitudes remained substantial, particularly in the FFN layers. This suggests that Layer 5's reinitialized adapters successfully re-engaged in learning and adapted to contribute meaningfully to the model's performance, even at a lower learning rate. For full details, please refer to the notebook here.

A staged training strategy for Double DoRA on IMDb—starting with high dropout to activate B, then gradually reducing it while tuning learning rates—revealed that dropout plays a crucial role in balancing A and B dynamics, enabling stable adaptation and strong performance without overfitting. Applying small dropout directly in the DDoRA path and experimenting with Layer 5 freezing and reinitialization revealed that while targeted perturbations can enhance training stability and provoke renewed adaptation, they do not drastically alter the overall performance trajectory, as DDoRA layers—especially in Layer 5—consistently re-engage and contribute meaningfully despite resets or conservative updates.

(April 2025)

Summary: This project investigates parameter-efficient fine-tuning of large language models, specifically applying Low-Rank Adaptation (LoRA) and Weight-Decomposed Low-Rank Adaptation (DoRA) for sentiment classification on the IMDb dataset using DistilBERT. We observed that the standard application of DoRA offers limited improvements over LoRA in language models due to the vanishing magnitude of weight updates stemming from zero-initialised low-rank matrices. To address this, we introduced a trainable per-head scaling mechanism, enabling effective directional updates and significantly enhancing performance. Building on these findings, we propose Double DoRA (DDoRA), a natural extension of DoRA that applies geometric adaptation at both the input and output of linear layers, increasing representational flexibility while maintaining parameter efficiency. Our results demonstrate that per-head scaling stabilises training and that DDoRA provides additional degrees of freedom for model adaptation, offering deeper insights into the fine-tuning dynamics of different network components.

GRPO Fine-Tuning of Gemma 3-1B-it

(March 2025)

Summary: "Tiny Large Language Models (LLMs) like Qwen2.5-0.5B and TinyLlama-1.1B seem to lack reasoning capabilities. This study explores fine-tuning of Gemma 3-1B-it, the smallest in the recent Google family, using GRPO and a targeted reward system, on the 'causal_judgement' subset of the BBH dataset. This resulted in a promising accuracy improvement, demonstrating the model's enhanced reasoning capabilities.

Fine Tuning of Qwen2.5-0.5B-Instruct Model

(February 2025)

Qwen2.5-0.5B-Instruct is a tiny LLM with half a billion parameters. This study examines various fine-tuning approaches (SFT and DPO, both independently and in combination) against each other and the baseline model.

I assessed the fine-tuning techniques (Table 1 below), using perplexity (PPL) as the evaluation metric—this quantifies how well the model predicts the next token, with lower scores indicating better performance. Each model version was evaluated using diverse prompts, compared against the baseline Qwen2.5-0.5B-Instruct model without fine-tuning.

I first investigated two approaches: Supervised Fine-Tuning (SFT) on conversational HuggingFaceH4/ultrachat_200k dataset, and Direct Preference Optimisation (DPO) on argilla/distilabel-intel-orca-dpo-pairs dataset of accepted/rejected response pairs. Contrary to common belief that human expertise benefits LLM performance, DPO showed virtually no improvement over baseline and proved challenging for my RTX 4080 GPU—the session with the best evaluation_loss ended in a runtime error near completion. More stable settings yielded slightly worse evaluation_loss results. In contrast, Supervised Fine-Tuning (SFT) demonstrated significant improvements, producing the best models according to PPL metrics. The same Jupyter notebook contains evaluation runs for both the baseline model and the pure DPO model loaded from its best checkpoint.

Given DPO's discouraging results, I tested applying DPO after initial SFT (see DPOafterSFT notebook). This approach deteriorated PPL, but less severely than pure DPO. I then applied a second round of SFT to create an SFT-DPO-SFT sequence, which yielded significant improvements nearly matching pure SFT. However, determining which metrics best reflect human preferences remains challenging.

Notably, all models performed strongest on creative writing tasks (e.g., "Write a scene from a play..."). Interestingly, while most factual prompts received good PPL scores, the specific prompt "Give me three facts about London" proved challenging for all models—possibly because the abundance of potential facts makes selection difficult.

Table 1.

Perplexity for Fine-Tuned Qwen2.5-0.5B-Instruct Model Using Various Techniques

Prompt Supervised Fine Tuning Direct Preference Optimisation DPOafterSFT SFT-DPO-SFT Baseline
What is AI? 88.66 178.1 104.6 93.62 179.5
Tell me something interesting about Albert Einstein. 60.97 124.8 74.59 61.76 118.6
Tell me something about Large Language Models. 86.54 120.5 94.39 87.35 119.8
What is geometry? Explain it step by step. 50.75 80.79 60.04 54.14 80.46
Explain the concept of entropy in simple terms. 42.76 60.68 44.75 42.43 60.53
Tell me something about Jean Baudrillard. 50.86 82.91 55.48 51.20 81.40
Who was David Hilbert? 91.98 179.7 131.0 99.74 176.0
Give me three facts about London. 108.7 204.4 120.2 109.3 200.9
Tell a short story about enemies who eventually became friends, why did it happen? 86.02 117.0 97.68 86.90 114.8
Write a scene from a play where two men are having a philosophical debate about the nature of consciousness. 24.19 32.03 25.83 24.79 31.64
Imagine you are a time traveler who has just arrived in the remote future. Describe what you observe that is significantly different from today. 27.74 36.18 30.26 28.76 36.16
Tell me something about love. 76.32 138.5 87.30 77.46 136.0

Jump to the Top