Fine tunning

LORA

when we train a model with more weights it is time and resources consuming because we need to update the weights by loading on the machine which is hard so to avoid LORA introduce a low-rank matrix decomposition

Instead of updating all the weights, LoRA focuses on tracking the changes induced by fine-tuning and representing these changes in a compact form. It leverages low-rank matrix decomposition, which allows a large matrix to be approximated by the product of two smaller matrices.

Example

Imagine a 5x5 matrix as a storage unit with 25 spaces. LORA breaks it down into two smaller matrices through matrix decomposition with “r” as rank(the dimension): a 5x1 matrix (5 spaces) and a 1x5 matrix (5 spaces). This reduces the total storage requirement from 25 to just 10, making the model more compact.

W = [[0.1, 0.2, 0.3, 0.4],
     [0.5, 0.6, 0.7, 0.8],
     [0.9, 1.0, 1.1, 1.2],
     [1.3, 1.4, 1.5, 1.6]]
 

we freeze W and only train two small matrices: A and B.

Let’s say Rank = 2. Then:

  • A is 4×2
  • B is 2×4

So instead of 16 parameters, you now train only 8 + 8 = 16 but with structured low-rank intent.

import numpy as np
 
# Original frozen weights
W = np.array([
    [0.1, 0.2, 0.3, 0.4],
    [0.5, 0.6, 0.7, 0.8],
    [0.9, 1.0, 1.1, 1.2],
    [1.3, 1.4, 1.5, 1.6]
])
 
# Trainable LoRA matrices (random for example)
A = np.random.randn(4, 2)  # down projection
B = np.random.randn(2, 4)  # up projection
 
# Scaled update (α = 2, rank = 2) => scale = α / rank = 1.0
scaling_factor = 1.0
delta_W = scaling_factor * (A @ B)
 
# Final weight = frozen + lora update
W_lora = W + delta_W
 

We can choose different LORA matrix rank example

W = [[1, 0, 0, 0],
     [0, 1, 0, 0],
     [0, 0, 1, 0],
     [0, 0, 0, 1]]   # Identity for simplicity
 
- Our weight 4*4 
 
**LoRA with Rank = 1**
A = [[1],
     [2],
     [3],
     [4]]          # Shape: (4×1)
 
B = [[1, 0, -1, 0]]  # Shape: (1×4)
 
 
**LoRA with Rank = 2**
A = [[1, 0],
     [0, 1],
     [1, 1],
     [0, 0]]
 
B = [[1, 2, 0, 1],
     [0, 1, 1, 0]]
 

The QLoRA paper tested rank values from 8 to 256 and found:

“If LoRA is applied to all layers, the rank has little to no effect on downstream performance.”

  • Because many tasks don’t need complex updates — the pre-trained model already “knows” most things, and you’re just nudging it.

Alpha Alpha determines a scaling factor applied to the weight changes before they are added to the original model weights14. This factor is calculated as Alpha divided by Rank

Dropout

Dropout is a percentage that randomly sets some parameters to zero during training16. Its purpose is to help avoid overfitting, where the model performs well only on its training data but poorly on new, unseen data

Parameters and there RAM required

QLORA

In this we just quantization matrix value from 32 bit to 8 Bit

LoRA:

  • Introduce two low-rank matrices, A and B, to work alongside the weight matrix W.
  • Adjust these matrices instead of the behemoth W, making updates manageable.

LoRA-FA (Frozen-A):

  • Takes LoRA a step further by freezing matrix A.
  • Only matrix B is tweaked, reducing the activation memory needed.

VeRA:

  • All about efficiency: matrices A and B are fixed and shared across all layers.
  • Focuses on tiny, trainable scaling vectors in each layer, making it super memory-friendly.

Delta-LoRA:

  • A twist on LoRA: adds the difference (delta) between products of matrices A and B across training steps to the main weight matrix W.
  • Offers a dynamic yet controlled approach to parameter updates.

LoRA+:

  • An optimized variant of LoRA where matrix B gets a higher learning rate. This tweak leads to faster and more effective learning.

Resources

Axolotl

Is wrapper above hugging face and provides numerous example configuration files in YAML format, covering a variety of common use cases and model types.

  • Supports fullfinetune, lora, qlora, relora, and gptq
  • Customize configurations using a simple yaml file or CLI overwrite
base_model: codellama/CodeLlama-7b-hf
model_type: LlamaForCausalLM
tokenizer_type: CodeLlamaTokenizer
 
load_in_8bit: true
load_in_4bit: false
strict: false
 
datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: alpaca
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/lora-out
 
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
 
adapter: lora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
 
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
 
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
 
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
 
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:
 
warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

LLM fine tuning made easy https://axolotl.ai/

Methods

Unsloth

Unsloth is a lightweight library for faster LLM fine-tuning

finetune https://www.clarifai.com/

Lamini

https://docs.lamini.ai/

Ray train

https://docs.ray.io/en/latest/train/train.html?_gl=1*fhc3w3*_gcl_au*MTk4OTE5MzY1Ni4xNzM0ODU1MTA0

https://www.lamini.ai/

https://openpipe.ai/
Train higher-quality, faster models that continuously improve.

https://www.tensorzero.com/ [very nice need to chck]

OpenPipe for fine-tuning and OctoAI for deployment

Posttrain finetune

  • Huggingface TRL
  • Open RLHF
  • veRL
  • nemo RL

Supervised finetunning

Labled prompt paris

  1. Input Sequence: Prompt + response (e.g., “Translate to French: Hello! Bonjour!“)
  2. Tokenization: Sequence is tokenized into IDs.
  3. Labels:
    • Prompt tokens: Set to -100 to ignore them.
    • Response tokens: Set to their actual token IDs to guide the model in learning the correct output.
  4. Prediction: The model predicts the next token at each position.
    • For prompt tokens: The model is not penalized.
    • For response tokens: The model is penalized based on how far its prediction is from the correct token.
  5. Loss Calculation: Loss is only calculated for response tokens, comparing predicted vs. actual tokens.
  6. Backpropagation: Model weights are updated based on the loss from the response tokens.

Example

combine prmpt with answer 
`"Translate to French: Hello! Bonjour!"`

we pass the whole to LLM it will start predciting what by reading 
tranlate and so on we will start seeing does it predict correctly the next token but we ingore that check for question we only do check for answer

Direct preference optimization

Online reinforcement learning

MUST NEED TO WATCH