Llamafile, Mozilla’s portable LLM runner, gets GPU support and a rebuilt core

Running a large language model on a single machine without cloud access or a container runtime remains a priority for practitioners working in air-gapped or resource-constrained environments. Llamafile, Mozilla-AI’s project for packaging and running LLMs as self-contained executables, has received its most significant architectural overhaul to date with version 0.10.0.

Llamafile

A rebuild from the ground up

The 0.10.0 release is the product of a deliberate decision to reconstruct Llamafile’s core from scratch. The stated goal was to produce a cosmopolitan llama.cpp build that preserved two properties central to the project: portability across operating systems and hardware architectures, and the ability to bundle model weights directly inside a Llamafile executable. The rebuild also brought the project’s llama.cpp dependency up to date, incorporating recent model support that older builds lacked.

GPU acceleration returns

One of the most significant changes in 0.10.0 is the restoration of GPU support, which had been absent in earlier iterations of the rebuilt codebase. CUDA support for Linux was reintroduced in February 2026. Metal support for macOS ARM64 arrived in December 2025, implemented by compiling a small module using the Xcode Command Line Tools. Metal works in both the terminal interface and the server mode.

GPU support for Windows is listed as still missing in the current release.

Terminal interface and server mode

Version 0.10.0 adds a terminal user interface (TUI) that lets users interact directly with a loaded model from the command line. The server mode is accessible via the --server flag. The release also adds support for three distinct operational modes: chat, CLI, and server.

Multimodal and speech capabilities

The mtmd API is now accessible through the TUI, enabling multimodal model access directly from the terminal. Models tested with this capability include llava 1.6, Qwen3-VL, and Ministral 3. Support for image input via the --image flag has been added to CLI mode.

Whisper, the speech recognition model, was added in the most recent update cycle (March 2026), extending the project beyond text-only inference.

Model support and example llamafiles

Mozilla-AI distributes a set of prebuilt example llamafiles bundled with version 0.10.0. These range from the 1.6 GB Qwen3.5 0.8B Q8 quantization, which the project reports generates approximately 8 tokens per second on a Raspberry Pi 5 without a GPU, up to a 19 GB Qwen3.5 27B Q5 quantization. Other included models are Ministral 3 3B Instruct, llava v1.6 mistral 7b, Apertus 8B Instruct, gpt-oss 20b, and LFM2 24B A2B.

The llama.cpp dependency has been updated to commit 7f5ee54, which adds support for Qwen3.5 models.

Windows users face a practical constraint: the operating system enforces a 4 GB maximum executable file size, which most of the current example llamafiles exceed. The project supports using external weights as a workaround.

Build system and dependencies

The build system was simplified early in the 0.10.0 development cycle. CMake was dropped in favor of a custom BUILD.mk file. Dependencies are drawn from the llama.cpp vendor directory. The project now targets cosmocc 4.0.2. The zipalign utility has been added as a GitHub submodule to track upstream updates from its maintainer.

What remains pending

The project lists several capabilities not yet carried over to the new build format. Stable diffusion code exists in the repository but has not been ported to the new build. The pledge() and SECCOMP sandboxing features are absent. Llamafiler for embeddings was rolled back to llama.cpp’s built-in embeddings endpoint. Some CLI arguments that functioned in earlier versions are not yet operational.

Integration tests were added in March 2026, along with “skill documents” intended for use with AI assistants.

More about