Ollama

  • uses docker to run all model
  • Model stored in /user/share/ollama/.ollama/models
  • which has blobs and manifests/registry.ollama.ai
  • blobs contain actula model code

Llamafile

An open-source project by Mozilla aiming to democratize access to AI by enabling users to run large language models locally on their machines, including CPUs

  • Single File Executable: Llamafile distributes large language models as single file executables, simplifying the process of running these models across various operating systems, CPU architectures, and GPU architectures. This portability eliminates the need for complex installations and ensures accessibility across a wide range of devices.

  • Focus on CPU Inference: Recognizing the limitations of relying solely on expensive and power-hungry GPUs, Llamafile emphasizes the potential of CPUs for running large language models. This focus democratizes access to AI by leveraging readily available and affordable hardware found in computers worldwide

  • Llamafile utilizes Cosmopolitan, a tool that enables the creation of single file executables that can run on multiple operating systems, achieving this through a clever hack involving a Unix shell script embedded in the MS-DOS stub of a portable executable

  • Outer Loop Unrolling for Prompt Processing:A key optimization technique involves unrolling the outer loop in matrix multiplication operations, which constitute a significant portion of LLM computations

what is loop unrolling?

 
for i in range(5): 
	print(i)
 
#will be change in to 
 
print(0)
print(1)
print(2)
print(3)
print(4)

How to run

  • Download the lamafile of the model and make it executable and run the file as ./model-name
 
wget https://huggingface.co/Mozilla/Meta-Llama-3.1-8B-Instruct-llamafile/resolve/main/Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 
chmod +x Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 
./Meta-Llama-3.1-8B-Instruct.Q6_K.llamafile
 

Using llamafile with external weights

 
curl -L -o llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.8.11/llamafile-0.8.11
 
curl -L -o mistral.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf
 
./llamafile.exe -m mistral.gguf

Running ollama models

When we download a new model with ollama, all its metadata will be stored in a manifest file under ~/.ollama/models/manifests/registry.ollama.ai/library/. The directory and manifest file name are the model name as returned by ollama list. For instance, for llama3:latest the manifest file will be named .ollama/models/manifests/registry.ollama.ai/library/llama3/latest.

The manifest maps each file related to the model (e.g. GGUF weights, license, prompt template, etc) to a sha256 digest. The digest corresponding to the element whose mediaType is application/vnd.ollama.image.model is the one referring to the model’s GGUF file.

Each sha256 digest is also used as a filename in the ~/.ollama/models/blobs directory (if you look into that directory you’ll see only those sha256-* filenames). This means you can directly run llamafile by passing the sha256 digest as the model filename. So if e.g. the llama3:latest GGUF file digest is sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29, you can run llamafile as follows:

cd /usr/share/ollama/.ollama/models/blobs
llamafile -m sha256-00e1317cbf74d901080d7100f57580ba8dd8de57203072dc6f668324ba545f29

TO get model sha value it will under /usr/share/ollama/.ollama/models/manifests/registry.ollama.ai/library/modelname/latest

or check logs of ollama

Note: If we run a ollama as service only it will store the model in above both if we running ollama as ollama server and pulling the model store the model in home dir

By default when we pulling model Ollama may default to using a highly compressed model variant (e.g. Q4). (4 bit Quantization ) which may have less accuracy so if we want more accurate model pull with dolphin2.2-mistral:7b-q6_K. (Q6)

sudo OLLAMA_HOST="0.0.0.0" ollama serve

PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving support hugging face models also

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel samplingbeam search, and more
  • Tensor parallelism and pipeline parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

How it works under the hood

Page Attention When generating tokens one-by-one for multiple users (a.k.a. batching requests), you:

  • Store previous tokens in the KV cache.
  • Pre-allocate memory assuming maximum sequence length (often over-allocating).
  • Face 3 types of fragmentation:
    • Internal fragmentation: Reserved more than needed.
    • Reservation fragmentation: Reserved space not yet used.
    • External fragmentation: Scattered free space not usable for new large sequences.

This results in low memory utilization, like only 20-38% of the KV cache being used effectively.

Just like how OS solves fragmentation using pages, Page Attention:

  • Splits memory into fixed-size blocks (pages).
  • Allocates only one block at a time, avoiding over-reserving.
  • Uses a block table (like a page table) to map logical token positions to physical memory blocks.
  • Enables block sharing — if two users share the same prompt, they can reuse the same blocks!

This change boosts memory utilization massively up to 96%. That translates into more requests served concurrently and lower latency.

Radix Attention

  • multiple users sending prompts to the model.
  • These prompts often share common prefixes.
  • Radix attention (like radix trees) groups shared prefixes together so the system avoids recomputing the same parts of a prompt for each request.
  • This minimizes memory use and maximizes reuse across the batch.

Resources

MobileLLM

https://github.com/facebookresearch/MobileLLM

JAN

https://jan.ai/https://github.com/sgl-project/sglang

SGLang

SGLang is a fast serving framework for large language models and vision language models.

SGKernel: Implements CUDA kernels for attention, normalization, activation, and GEMM. Contributions are welcome for those familiar with CUDA programming.

SGRouter: Handles cache-aware routing, supporting the S version of SGLang published last year.

SRT (Python part): The core of SGLang as an LLM inference runtime, supporting features like disaggregation, constrained decoding, function calling, OpenAI compatible server, and a wide range of models. Users can reference existing model implementations (e.g., Llama) to add support for custom models

LLM Compressor

llmcompressor is an easy-to-use library for optimizing models for deployment with vllm, including:

  • Comprehensive set of quantization algorithms for weight-only and activation quantization
  • Seamless integration with Hugging Face models and repositories
  • safetensors-based file format compatible with vllm
  • Large model support via accelerate

Check out the compressed model here https://huggingface.co/neuralmagic

podman

Podman AI Lab is the easiest way to work with Large Language Models (LLMs) on your local developer workstation. Find a catalog of recipes, leverage a curated list of open source models, experiment and compare the models. Get ahead of the curve and take your development to new heights wth Podman AI Lab! There are many ways to run models locally. This extension fits perfectly into your local container workflow and exposes LLMs through inference APIs that you can directly access from your application containers. Beyond that you can use playgrounds to optimze your inference parameters and recipes that help you with ready made examples. checkout here

Ramalama

RamaLama tool facilitates local management and serving of AI Models.

On first run RamaLama inspects your system for GPU support, falling back to CPU support if no GPUs are present.

RamaLama uses container engines like Podman or Docker to pull the appropriate OCI image with all of the software necessary to run an AI Model for your systems setup.

Running in containers eliminates the need for users to configure the host system for AI. After the initialization, RamaLama runs the AI Models within a container based on the OCI image. RamaLama pulls container image specific to the GPUs discovered on the host system. These images are tied to the minor version of RamaLama. For example RamaLama version 1.2.3 on an NVIDIA system pulls quay.io/ramalama/cuda:1.2. To override the default image use the --image option.

prima.cpp

https://github.com/Lizonghang/prima.cpp

LLM Katan

https://yossiovadia.github.io/semantic-router/e2e-tests/llm-katan/terminal-demo.html