Large Language Model

Keras embedding layer

Keras offers an embedding layer that can be used for neural networks, such as RNN’s (recurrent neural networks) for text data. This layer is defined as the first layer of a more complex architecture. The embedding layer needs at least three input values:

input_dim: Integer. Size of the vocabulary, i.e. maximum integer index+1.
output_dim: Integer. Dimension of the dense embedding.
input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
 
# Sample training data
texts = [
    "I love this movie, it is fantastic!",
    "The movie was okay, not great.",
    "I did not like the movie, it was boring.",
    "Fantastic film, I enjoyed every moment!",
    "Terrible movie, I won't watch it again."
]
labels = [1, 0, 0, 1, 0]  # 1 for positive, 0 for negative
 
# Tokenize the text
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
 
# Pad sequences to ensure uniform length
max_length = 10
data = pad_sequences(sequences, maxlen=max_length)
 
# Convert labels to a numpy array
labels = np.array(labels)
 
# Define vocabulary size and embedding dimensions
vocab_size = 10000
embedding_dim = 50
 
# Create the model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))  # For binary classification
 
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
 
# Print the model summary
model.summary()
 
# Train the model
model.fit(data, labels, epochs=10, batch_size=2, validation_split=0.2)
 
# Sample test data
test_texts = [
    "I really enjoyed this film, it was excellent!",
    "The film was not good, I did not enjoy it.",
]
test_labels = [1, 0]  # 1 for positive, 0 for negative
 
# Tokenize the test text
test_sequences = tokenizer.texts_to_sequences(test_texts)
 
# Pad sequences to ensure uniform length
test_data = pad_sequences(test_sequences, maxlen=max_length)
 
# Convert test labels to a numpy array
test_labels = np.array(test_labels)
 
# Evaluate the model on the test data
loss, accuracy = model.evaluate(test_data, test_labels, batch_size=2)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")
 
# Make predictions on the test data
predictions = model.predict(test_data)
 
# Print the predictions and corresponding test labels
for i, prediction in enumerate(predictions):
    print(f"Text: {test_texts[i]}")
    print(f"Predicted: {'Positive' if prediction > 0.5 else 'Negative'}")
    print(f"Actual: {'Positive' if test_labels[i] == 1 else 'Negative'}\n")

Open Source Embedding models

Two key resources for open-source embeddings:

Sentence Transformers (expert.net): This Python framework simplifies loading and using various embedding models, including the popular “all-mpnet-base-v2” and “all-MiniLM-L6-v2”.
Hugging Face (huggingface.co): This platform hosts a vast collection of machine learning models and datasets, including the “Massive Text Embedding Benchmark” (MTEB) project, which ranks and evaluates embedding models.

Choosing the Right Embedding Model

Task: Different models specialize in different tasks, like semantic search, clustering, or bitext mining.
Performance: The MTEB leaderboard offers a valuable resource for comparing model performance across various tasks.
Dimension Size: Smaller dimensions generally result in faster computation and lower memory requirements, especially for similarity searches.
Sequence Length: Models have limitations on the input length (measured in tokens), impacting how you process longer documents.

Optimizing Embedding Generation

Mean Pooling: This aggregation method combines multiple embeddings into a single representative embedding, essential for sentence-level comparisons.

Example: Text embedding model will return text the probablity of passed text with the no of vocabulary it has let say we have a text embedding model with dimension of 468 then it have 468 voc so it will return the probablity of passed text with all 468 word but if we want a single probablity we need to use meanpooling
Normalization: Normalizing embeddings (creating unit vectors) enables accurate comparisons using methods like dot product.
Quantization: This technique reduces the precision of model weights, shrinking the model size and potentially improving inference speed.
Caching: Transformers.js automatically caches models in the browser, significantly speeding up subsequent inference operations.

Model Merging

Model merging is an efficient alternative to fine-tuning that leverages the work of the open-source community. It involves combining the weights of different fine-tuned models to create a new model with enhanced capabilities. This technique has proven highly effective, as demonstrated by the dominance of merged models in performance benchmarks.

Merging Techniques

SLURP (Spherical Linear Interpolation): Interpolates the weights of two models using spherical linear interpolation. Different interpolation factors can be applied to various layers, allowing for fine-grained control.
Decomposed Redundancy Addition (DeRA): Reduces redundancy in model parameters through pruning and rescaling of weights. This technique allows merging multiple models simultaneously.
Pass-Through: Concatenates layers from different LLMs, including the possibility of concatenating layers from the same model (self-merging).
Mixture of Experts (MoE): Combines feed-forward network layers from different fine-tuned models, using a router to select the appropriate layer for each token and layer. This technique can be implemented without fine-tuning by initializing the router using embeddings calculated from positive prompts.

Advantages of Model Merging:

No GPU requirement, making it highly efficient.
Ability to leverage existing fine-tuned models from the open-source community.
Proven effectiveness in producing high-quality models.

1 Bit LLM

BitNet b1.58 where every weight in a Transformer can be represented as a {-1, 0, 1} instead of a floating point number.

Resources

books

https://shepherd.com/best-books/machine-learning-and-deep-neural-networks