Hugging face

Dataset

Datasets to download the data from the Hugging Face Hub. We can use the list_datasets() function to see what datasets are available on the Hub:

 
from datasets import list_datasets
all_datasets = list_datasets()
 
 
from datasets import load_dataset
emotions = load_dataset("emotion")
 
# emotions will have data set for train,validation and test
DatasetDict({
	train: Dataset({
	features: ['text', 'label'],
	num_rows: 16000
	})
	validation: Dataset({
	features: ['text', 'label'],
	num_rows: 2000
	})
	test: Dataset({
	features: ['text', 'label'],
	num_rows: 2000
	})
})
 
train_ds = emotions["train"]
 
# convert to pands
 
import pandas as pd
 
emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head()
 
# to load data set from local
load_dataset("csv", data_files="my_file.csv")

Datasets are memory-mapped using Apache Arrow and cached locally.This means that only the necessary data will be loaded into memory, allowing the possibility to work with a dataset that is larger than the system memory

Transformer

Python library for working with pre-trained natural language processing (NLP) models. Is wrapper which contain tokenizer ,model and the things need for after conversion of model output

The AutoTokenizer class belongs to a larger set of “auto” classes whose job is to automatically retrieve the model’s configuration, pretrained weights, or vocabulary from the name of the checkpoint.

 
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
 
encoded_text = tokenizer(text)
print(encoded_text)
 
#both do same
 
from transformers import DistilBertTokenizer
distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

it uses ONNX ONNX runtime to run on all device

Inference

Inference is the process of using a trained model to make predictions on new data. As this process can be compute-intensive, running on a dedicated server can be an interesting option. The huggingface_hub library provides an easy way to call a service that runs inference for hosted models. There are several services you can connect to:

Inference API: a service that allows you to run accelerated inference on Hugging Face’s infrastructure for free. This service is a fast way to get started, test different models, and prototype AI products.
Inference Endpoints: a product to easily deploy models to production. Inference is run by Hugging Face in a dedicated, fully managed infrastructure on a cloud provider of your choice.

https://huggingface.co/docs/hub/en/models-widgets

Pipelines

It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer:

from transformers import pipeline
 
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

Some of the currently available pipelines are:

feature-extraction (get the vector representation of a text)
fill-mask
ner (named entity recognition)
question-answering
sentiment-analysis
summarization
text-generation
translation
zero-shot-classification

Using any model from the Hub in a pipeline

from transformers import pipeline
 
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Checkpoints: These are the weights that will be loaded in a given architecture.

Pipelines under the hood

first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a tokenizer, which will be responsible for:

Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
Mapping each token to an integer
Adding additional inputs that may be useful to the model

All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the Model Hub. To do this, we use the AutoTokenizer class and its from_pretrained() method. Using the checkpoint name of our model

from transformers import AutoTokenizer
 
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Note: Transformer models only accept tensors as input.

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
 
{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042, ...],
        [  101,  1045,  5223,  ....]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}

The output itself is a dictionary containing two keys, input_ids and attention_mask. input_ids

we can use Transformers without having to worry about which ML framework is used as a backend; it might be PyTorch or TensorFlow, or Flax for some models. but transformer will take care of it.

To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the return_tensors argument:

Next step pass the token to transformer

from transformers import AutoModel
 
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
 
outputs = model(**inputs)
 
print(outputs.logits.shape)

A Transformer model processes text and produces a high-dimensional vector of hidden states, which represent the model’s contextual understanding of the input

These hidden states are usually fed into another part of the model called a head. The head transforms these high-dimensional vectors into a format suitable for a specific task.

Example:

Classification Head

Purpose: Used for tasks where the goal is to assign an input to one of several predefined categories (e.g., sentiment analysis, image classification).
How It Works:
- The output from the transformer layers (hidden states) is fed into a linear layer, which projects the high-dimensional vectors down to the number of classes.
- A softmax function is then applied to convert these scores into probabilities for each class.

Sequence Generation Head

Purpose: Used for tasks where the goal is to generate a sequence of outputs, such as text generation or machine translation.
How It Works:
- Typically, this involves a decoder structure that predicts the next token in the sequence based on the previous tokens and the context provided by the encoder.
- It often uses techniques like beam search or greedy decoding to generate coherent sequences.

Different head architectures are designed for specific tasks. Some examples are:

ForCausalLM
ForMaskedLM
ForMultipleChoice
ForQuestionAnswering
ForSequenceClassification
ForTokenClassification

The output of the head often requires further processing to make sense of it For instance, the raw output of the model (logits) may need to be converted into probabilities using a SoftMax layer

Postprocessing the output

Output will contain probablity to convert we need softmax layer(Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
 
# Define the model checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
 
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
 
# Input sentences
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
 
# Tokenize the input sentences
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
 
# Pass the inputs through the model
outputs = model(**inputs)
 
# Convert logits to probabilities
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
 
# Print the predictions and labels
print(predictions)
print(model.config.id2label)

Models

The AutoModel class in Hugging Face acts as a wrapper for different model architectures within the library. This class can intelligently determine the appropriate model architecture for a given checkpoint and instantiate a model with that architecture. For instance, if we load a BERT checkpoint, AutoModel will automatically instantiate a BERT model.

from transformers import BertConfig, BertModel
 
# Building the config
config = BertConfig()
 
# Building the model from the config
model = BertModel(config)

Direct Instantiation: If we know the specific model type you want to use, we can directly use the class corresponding to its architecture. For example, we can use BertModel directly to create a BERT model.

Model Configuration: Models are built based on a configuration object, like BertConfig for BERT models. This configuration contains attributes that define the model’s architecture, such as the hidden state size (hidden_size) and the number of Transformer layers (num_hidden_layers).

Model Initialization: Creating a model from the default configuration initializes it with random values. Such a model needs to be trained before it can be used effectively.

Loading Pre-trained Models: Pre-trained models can be loaded using the from_pretrained() method. This method takes a model identifier (e.g., “bert-base-cased”) and downloads and caches the model weights. Using pre-trained models is crucial to save time, resources, and minimize environmental impact.

from transformers import BertModel
 
model = BertModel.from_pretrained("bert-base-cased")

Checkpoint Agnostic Code: The AutoModel class allows you to write checkpoint-agnostic code, which means your code can work with different checkpoints, even if the architecture is different, as long as the checkpoints are trained for similar tasks.

Saving Models: The save_pretrained() method saves the model to your disk in two files: config.json and pytorch_model.bin. - config.json: This file contains the model’s architecture and metadata. - pytorch_model.bin: This file, known as the state dictionary, contains the model’s weights.

Model Inference: Once loaded, models can be used for inference. This involves tokenizing the input text and converting it into tensors that the model can understand. - The model() function is then called with these tensors to generate predictions. - For specific tasks like sentiment analysis, we might use specialized models like AutoModelForSequenceClassification. - The output of a model often requires further processing, such as converting logits into probabilities using a SoftMax layer, before interpretation.

from transformers import AutoConfig, AutoModel, AutoTokenizer
 
# Model identifier
model_id = "bert-base-cased"
 
# Loading the configuration, tokenizer, and model
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
 
# Example input sequences
sequences = ["Hello!", "Cool.", "Nice!"]
 
# Tokenization
encoded_sequences = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
model_inputs = torch.tensor(encoded_sequences["input_ids"])
 
# Model inference
output = model(model_inputs)
 
# Saving the model
model.save_pretrained("my_model_directory")
 
# Accessing configuration attributes
print(config.hidden_size)
print(config.num_hidden_layers)

Models expect a batch of inputs

Batching involves sending multiple sentences to the model at once.

sequence = "I've been waiting for a HuggingFace course my whole life."
 
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids) # it is single dim array

So to work with model we need to pass it as batch as

input_ids = torch.tensor([ids]) # we converting in to 2d by  [] enclosing 
print("Input IDs:", input_ids)
 
output = model(input_ids)
print("Logits:", output.logits)

Another method of overriding this issue is padding

Padding the inputs: Sentences in a batch often have different lengths. To address this, padding is used. Padding involves adding a special token called the “padding token” to shorter sentences, making all sentences in the batch the same length. This is essential because tensors require a rectangular shape.

The padding token ID can be found in tokenizer.pad_token_id

But when we have huge data let say 1TB when we doing Padding it will take max token length and do padding for all which is not efficient so to avoid we can do batch wise padding and DataCollatorWithPadding

def tokenize_and_pad(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, return_tensors="pt")
 
# Apply the map function
tokenized_dataset = dataset.map(tokenize_and_pad, batched=True)

Attention masks: When using padding, it’s crucial to use attention masks. Attention masks are tensors that guide the model to focus on the actual tokens and ignore the padding tokens. This ensures accurate results, as attention layers in Transformers models contextualize each token. Without attention masks, padding tokens would be incorrectly considered in the attention mechanism.

Longer sequences: Transformer models have limitations on the sequence length they can process. Most models handle up to 512 or 1024 tokens. To handle longer sequences:

Utilize models specifically designed for long sequences, such as Longformer or LED.
Truncate the sequences to the maximum supported length using the max_sequence_length parameter.

Tokenzier

https://github.com/openai/tiktoken The process of converting text to numbers is called encoding.

Encoding involves two steps:

Tokenization: splitting text into smaller units (tokens) such as words, characters, or subwords.
Conversion to Input IDs: mapping each token to a unique numerical identifier from the tokenizer’s vocabulary.

AutoTokenizer:

This class is a more generic tokenizer class that acts as a wrapper, enabling you to load tokenizers for different model architectures without explicitly specifying the tokenizer class.
It can intelligently determine the correct tokenizer class based on the model checkpoint name we provide.
For instance, if we use AutoTokenizer.from_pretrained("bert-base-cased"), it will automatically load the BertTokenizer class because the checkpoint name indicates a BERT model.

Methods

tokenizer(): This method performs the complete encoding process, converting raw text into input IDs.
tokenize(): This method handles only the tokenization step, splitting the text into tokens.
convert_tokens_to_ids(): This method converts a list of tokens into their corresponding numerical IDs.
decode(): This method performs the reverse operation of encoding, converting a list of input IDs back into a text string.

from transformers import AutoTokenizer
 
# Load the pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
 
# Input text
text = "This is a sample text for tokenization and decoding."
 
# Tokenize the input text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
["this","i",...]
 
# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Input IDs:", input_ids)
 
# Decode the input IDs back to text
decoded_text = tokenizer.decode(input_ids)
print("Decoded Text:", decoded_text)
 
print(tokenizer.vocab_size) # print the no of vocab the model have
 
print(tokenizer.model_max_length) #max len for the model

it uses Rust under the hood

Preprocessing the data

Padding and Truncation: Sentences often have varying lengths. To create uniform input tensors, the tokenizer uses padding (adding a special padding token to shorter sequences) and truncation (shortening sequences that exceed the model’s maximum length).

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

Trainer

Transformers provides a Trainer class optimized for training Transformers models, making it easier to start training without manually writing your own training loop. The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.

The Trainer is built on PyTorch, so it’s not suitable for projects that use Keras or TensorFlow

The Trainer may not be the best choice for highly specialized training logic. In such cases, the Accelerate library offers more fine-grained control.

Hugging Face X Langchain

langchain-huggingface

from langchain_huggingface import HuggingFacePipeline
 
llm = HuggingFacePipeline.from_model_id(
    model_id="microsoft/Phi-3-mini-4k-instruct",
    task="text-generation",
    pipeline_kwargs={
        "max_new_tokens": 100,
        "top_k": 50,
        "temperature": 0.1,
    },
)
llm.invoke("Hugging Face is")

Accessing the inference on serverless model

from langchain_huggingface import HuggingFaceEndpoint
 
llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Meta-Llama-3-8B-Instruct",
    task="text-generation",
    max_new_tokens=100,
    do_sample=False,
)
llm.invoke("Hugging Face is")

https://huggingface.co/blog/langchain

File formats

GGUF (Generic Graph Universal Format)

GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.

represents models as graphs, where nodes represent operations (such as layers or functions) and edges represent the flow of data between these operations. This graph-based structure allows the format to capture intricate model architectures efficiently.

Quantization with GGUF

Llama.cpp: An LLM inference engine built on top of GGML. It provides higher-level primitives for loading and running Llama-like models and includes tools for quantizing models and running inference on quantized checkpoints.

Typical Quantization Workflow: A typical workflow using GGUF quantization involves four steps:

Install Llama.cpp, which provides necessary binaries.
Ensure the full-precision model is in the GG format.
Run llama.quantize to shrink the model.
Run inference on the quantized model using llama.cli

Refere here for more https://github.com/iuliaturc/gguf-docs

ONNX

ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers

ONNX enables exporting trained models from one framework and importing them into another, facilitating seamless transitions between different tools.

An ONNX model file contains a graph representing the model’s structure. This graph is a collection of computation nodes, analogous to the layers in a neural network

Each computation node represents a specific operation, defined by an operator. These operators map to deep learning conventions and encompass functions like activation functions (ReLU, sigmoid, tanh), convolutional operations, and more.

ONNX supports standard data types for tensors (int8, int16, bool, float16, etc.) as well as non-tensor types like sequences and maps for traditional machine learning

ONNX Runtime is an open-source inference engine that implements the ONNX standard. Its primary goal is to provide high-performance inference across a wide range of platforms and hardware.ONNX Runtime achieves this through a pluggable architecture that allows for the integration of different execution providers, each optimized for specific hardware

CTranslate2

CTranslate2 is a fast, lightweight inference engine and model format designed specifically for Transformer models.

Traditional Transformer models (e.g., from Hugging Face Transformers + PyTorch):

Are big (store training data, gradients, optimizer states).
Are slow on CPUs, unless you use big servers or GPUs.
Have many layers and options useful for training, not inference.

CTranslate2 strips all that out:

Keeps only the inference path of the model.
Rewrites layers for fast CPU/GPU execution.
Applies quantization to reduce memory & improve performance.

Tool / Framework	Optimized For	Backend	Quantization	Language Support	Notes
ONNX Runtime	General-purpose inference	CPU, GPU, DirectML	✅ int8, float16	All major languages	Widely supported and flexible
GGML / GGUF	LLMs & speech models	CPU (no GPU needed)	✅ int4, int8, f16	C/C++, Python, WASM	Ultra-efficient on CPU, popular with Whisper, LLaMA
TensorRT	NVIDIA GPU inference	CUDA GPU	✅ int8, float16	Python, C++	Extremely fast, but GPU-only
OpenVINO	Intel CPU + VPU	CPU, iGPU, MyriadX	✅ int8, float16	Python, C++	Best for Intel edge devices
TFLite	Mobile and embedded devices	CPU, GPU, Edge TPU	✅ int8, float16	Python, Java, Swift	Lightweight and mobile-friendly
DeepSparse	CPU-only (AVX512)	CPU	✅ int8	Python	Best on modern Intel CPUs
FasterTransformer	High-speed GPU inference	CUDA GPU	✅ int8, float16	C++, Python	Made by NVIDIA, great for large models
vLLM	Fast LLM serving	GPU	✅ (less on quant)	Python	Great batching + throughput
MLC / WebLLM	WASM, WebGPU	Web browser, iOS, Android	✅ int4, int8	JavaScript, Python	Runs LLMs in browser/mobile!

Transformers.js v3

We can use transformer js and run models in brower using webGPU (WebGPU is a new web standard for accelerated graphics and compute).

import { pipeline } from "@huggingface/transformers";
 
// Create a feature-extraction pipeline
const extractor = await pipeline(
  "feature-extraction",
  "mixedbread-ai/mxbai-embed-xsmall-v1",
  { device: "webgpu" },
);
 
// Compute embeddings
const texts = ["Hello world!", "This is an example sentence."];
const embeddings = await extractor(texts, { pooling: "mean", normalize: true });
console.log(embeddings.tolist());
// [
//   [-0.016986183822155, 0.03228696808218956, -0.0013630966423079371, ... ],
//   [0.09050482511520386, 0.07207386940717697, 0.05762749910354614, ... ],
// ]

Resources

Visualizer for neural network, deep learning and machine learning models https://netron.app/

Deep Learning Visualization Toolkit https://github.com/PaddlePaddle/VisualDL

Depending on the model used, requests can use up to 128,000 tokens shared between prompt and completion. Some models, like GPT-4 Turbo, have different limits on input and output tokens.

There are often creative ways to solve problems within the limit, e.g. condensing your prompt, breaking the text into smaller pieces, etc.

Models

https://huggingface.co/myshell-ai/MeloTTS-English
text-to-audio
Bloom
openai-community/gpt2
https://huggingface.co/1bitLLM
SmolLM models are designed for local deployment and have low memory footprints, making them suitable for devices like smartphones
Human-Like-Llama-3-8B-Instruct
Human-Like-Qwen-2.5-7B-Instruct

Here are some interesting and lightweight LLMs available on Hugging Face that you can use for various projects:

DistilBERT: A smaller, faster, and cheaper version of BERT, retaining 97% of its language understanding while being 60% faster. Great for text classification and sentiment analysis.
TinyBERT: An even smaller version of BERT, optimized for mobile and edge devices. It’s useful for applications requiring low latency.
ALBERT: A lightweight model that reduces the parameters of BERT while maintaining performance. It’s great for tasks like text classification and question answering.
MiniLM: A compact model that balances speed and performance, making it suitable for a range of NLP tasks, including summarization and dialogue systems.
ELECTRA: This model is more sample-efficient than traditional masked language models, making it great for text generation and understanding tasks with fewer resources.
T5 (Text-to-Text Transfer Transformer): Though larger, you can find smaller variants. T5 is versatile, allowing you to tackle various tasks by framing them as text generation problems.
GPT-Neo: An open-source alternative to GPT-3, with smaller versions available. Good for creative writing, chatbots, and text generation projects.
BART (with smaller configurations): BART is great for text generation and summarization tasks. Smaller configurations can be effective for various applications without being too heavy.
Flan-T5: A variant of T5 that’s fine-tuned on a diverse set of tasks. It’s useful for applications needing generalization across multiple NLP tasks.
CodeGen: A model designed for code generation tasks. If you’re interested in building tools related to programming or code assistance, this could be a fun choice.
LLaVA

Hugging face

Table of Contents

Dataset

Transformer

Inference

Pipelines

Using any model from the Hub in a pipeline

Pipelines under the hood

Models

Tokenzier

Trainer

Hugging Face X Langchain

File formats

GGUF (Generic Graph Universal Format)

Quantization with GGUF

ONNX

CTranslate2

Transformers.js v3

Resources

Models

LLaVA

Graph View

Table of Contents