Ollama
- uses docker to run all model
- Model stored in /user/share/ollama/.ollama/models
- which has blobs and manifests/registry.ollama.ai
- blobs contain actula model code
Llamafile
An open-source project by Mozilla aiming to democratize access to AI by enabling users to run large language models locally on their machines, including CPUs
-
Single File Executable: Llamafile distributes large language models as single file executables, simplifying the process of running these models across various operating systems, CPU architectures, and GPU architectures. This portability eliminates the need for complex installations and ensures accessibility across a wide range of devices.
-
Focus on CPU Inference: Recognizing the limitations of relying solely on expensive and power-hungry GPUs, Llamafile emphasizes the potential of CPUs for running large language models. This focus democratizes access to AI by leveraging readily available and affordable hardware found in computers worldwide
-
Llamafile utilizes Cosmopolitan, a tool that enables the creation of single file executables that can run on multiple operating systems, achieving this through a clever hack involving a Unix shell script embedded in the MS-DOS stub of a portable executable
-
Outer Loop Unrolling for Prompt Processing:A key optimization technique involves unrolling the outer loop in matrix multiplication operations, which constitute a significant portion of LLM computations
what is loop unrolling?
for i in range(5):
print(i)
#will be change in to
print(0)
print(1)
print(2)
print(3)
print(4)How to run
- Download the lamafile of the model and make it executable and run the file as
./model-name
wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
chmod +x Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
Using llamafile with external weights
curl -L -o llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.11/llamafile-0.8.11
curl -L -o mistral.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
./llamafile.exe -m mistral.ggufRunning ollama models
When we download a new model with ollama, all its metadata will be stored in a manifest file under ~/.ollama/models/manifests/registry.ollama.ai/library/. The directory and manifest file name are the model name as returned by ollama list. For instance, for llama3:latest the manifest file will be named .ollama/models/manifests/registry.ollama.ai/library/llama3/latest.
The manifest maps each file related to the model (e.g. GGUF weights, license, prompt template, etc) to a sha256 digest. The digest corresponding to the element whose mediaType is application/vnd.ollama.image.model is the one referring to the model’s GGUF file.
Each sha256 digest is also used as a filename in the ~/.ollama/models/blobs directory (if you look into that directory you’ll see only those sha256-* filenames). This means you can directly run llamafile by passing the sha256 digest as the model filename. So if e.g. the llama3:latest GGUF file digest is sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29, you can run llamafile as follows:
cd /usr/share/ollama/.ollama/models/blobs
llamafile -m sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29
TO get model sha value it will under /usr/share/ollama/.ollama/models/manifests/registry.ollama.ai/library/modelname/latest
or check logs of ollama
Note: If we run a ollama as service only it will store the model in above both if we running ollama as ollama server and pulling the model store the model in home dir
By default when we pulling model Ollama may default to using a highly compressed model variant (e.g. Q4). (4 bit Quantization ) which may have less accuracy so if we want more accurate model pull with dolphin2.2-mistral:7b-q6_K. (Q6)
sudo OLLAMA_HOST="0.0.0.0" ollama serve
PowerInfer
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving support hugging face models also
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
How it works under the hood
Page Attention When generating tokens one-by-one for multiple users (a.k.a. batching requests), you:
- Store previous tokens in the KV cache.
- Pre-allocate memory assuming maximum sequence length (often over-allocating).
- Face 3 types of fragmentation:
- Internal fragmentation: Reserved more than needed.
- Reservation fragmentation: Reserved space not yet used.
- External fragmentation: Scattered free space not usable for new large sequences.
This results in low memory utilization, like only 20-38% of the KV cache being used effectively.
Just like how OS solves fragmentation using pages, Page Attention:
- Splits memory into fixed-size blocks (pages).
- Allocates only one block at a time, avoiding over-reserving.
- Uses a block table (like a page table) to map logical token positions to physical memory blocks.
- Enables block sharing — if two users share the same prompt, they can reuse the same blocks!
This change boosts memory utilization massively up to 96%. That translates into more requests served concurrently and lower latency.
Radix Attention
- multiple users sending prompts to the model.
- These prompts often share common prefixes.
- Radix attention (like radix trees) groups shared prefixes together so the system avoids recomputing the same parts of a prompt for each request.
- This minimizes memory use and maximizes reuse across the batch.
Resources
- https://www.youtube.com/watch?v=PcvxdWJOyUE&t=1752s&pp=0gcJCb8Ag7Wk3p_U
- https://www.youtube.com/watch?v=G1WNlLxPLSE&t=739s
MobileLLM
https://github.com/facebookresearch/MobileLLM
JAN
https://jan.ai/https://github.com/sgl-project/sglang
SGLang
SGLang is a fast serving framework for large language models and vision language models.
SGKernel: Implements CUDA kernels for attention, normalization, activation, and GEMM. Contributions are welcome for those familiar with CUDA programming.
SGRouter: Handles cache-aware routing, supporting the S version of SGLang published last year.
SRT (Python part): The core of SGLang as an LLM inference runtime, supporting features like disaggregation, constrained decoding, function calling, OpenAI compatible server, and a wide range of models. Users can reference existing model implementations (e.g., Llama) to add support for custom models
LLM Compressor
llmcompressor is an easy-to-use library for optimizing models for deployment with vllm, including:
- Comprehensive set of quantization algorithms for weight-only and activation quantization
- Seamless integration with Hugging Face models and repositories
safetensors-based file format compatible withvllm- Large model support via
accelerate
Check out the compressed model here https://huggingface.co/neuralmagic
podman
Podman AI Lab is the easiest way to work with Large Language Models (LLMs) on your local developer workstation. Find a catalog of recipes, leverage a curated list of open source models, experiment and compare the models. Get ahead of the curve and take your development to new heights wth Podman AI Lab! There are many ways to run models locally. This extension fits perfectly into your local container workflow and exposes LLMs through inference APIs that you can directly access from your application containers. Beyond that you can use playgrounds to optimze your inference parameters and recipes that help you with ready made examples. checkout here
Ramalama
RamaLama tool facilitates local management and serving of AI Models.
On first run RamaLama inspects your system for GPU support, falling back to CPU support if no GPUs are present.
RamaLama uses container engines like Podman or Docker to pull the appropriate OCI image with all of the software necessary to run an AI Model for your systems setup.
Running in containers eliminates the need for users to configure the host system for AI. After the initialization, RamaLama runs the AI Models within a container based on the OCI image. RamaLama pulls container image specific to the GPUs discovered on the host system. These images are tied to the minor version of RamaLama. For example RamaLama version 1.2.3 on an NVIDIA system pulls quay.io/ramalama/cuda:1.2. To override the default image use the --image option.
prima.cpp
https://github.com/Lizonghang/prima.cpp
LLM Katan
https://yossiovadia.github.io/semantic-router/e2e-tests/llm-katan/terminal-demo.html