mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-06-28 04:30:15 -05:00
170 lines
6.5 KiB
Markdown
170 lines
6.5 KiB
Markdown
# Build and use ik_llama.cpp with CPU or CPU+CUDA
|
|
|
|
Built on top of [ikawrakow/ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) and [llama-swap](https://github.com/mostlygeek/llama-swap)
|
|
|
|
Commands are provided for Podman and Docker.
|
|
|
|
CPU or CUDA sections under [Prebuilt](#Prebuilt)/[Build](#Build) and [Run]($Run) are enough to get up and running.
|
|
|
|
## Overview
|
|
|
|
- [Prebuilt](#Prebuilt)
|
|
- [Build](#Build)
|
|
- [Run](#Run)
|
|
- [Troubleshooting](#Troubleshooting)
|
|
- [Extra Features](#Extra)
|
|
- [Credits](#Credits)
|
|
|
|
## Prebuilt Docker images
|
|
|
|
Pull one of the available images from `ghcr.io`. [View all tags](https://github.com/ikawrakow/ik_llama.cpp/pkgs/container/ik-llama-cpp/versions?filters%5Bversion_type%5D=tagged)
|
|
|
|
```bash
|
|
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-swap
|
|
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-server
|
|
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-full
|
|
|
|
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-swap
|
|
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-server
|
|
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-full
|
|
```
|
|
|
|
## Build
|
|
|
|
The project uses Docker Bake for building multiple targets efficiently.
|
|
|
|
Clone the repository: `git clone https://github.com/ikawrakow/ik_llama.cpp`
|
|
|
|
Use `docker-bake`.
|
|
|
|
```bash
|
|
docker buildx create --name ik-llama-builder --use
|
|
```
|
|
|
|
### CPU Variant
|
|
|
|
```bash
|
|
VARIANT=cpu docker buildx bake --builder ik-llama-builder --load full swap
|
|
```
|
|
|
|
Or with custom tags:
|
|
|
|
```bash
|
|
REPO_OWNER=yourname VARIANT=cpu docker buildx bake --builder ik-llama-builder --load \
|
|
-f ./docker-bake.hcl \
|
|
full swap
|
|
```
|
|
|
|
### CUDA Variant
|
|
|
|
First, set the CUDA version and GPU architecture in `ik_llama-cuda.Containerfile`:
|
|
- `CUDA_DOCKER_ARCH`: Your GPU's compute capability (e.g., `86` for RTX 30*, `89` for RTX 40*, `12.0` for RTX 50*)
|
|
- `CUDA_VERSION`: CUDA Toolkit version (e.g., `12.6.2`, `13.1.1`)
|
|
|
|
```bash
|
|
VARIANT=cu12 docker buildx bake --builder ik-llama-builder --load full swap
|
|
```
|
|
|
|
### Build Targets
|
|
|
|
Builds two image tags per variant:
|
|
|
|
- **`full`**: Includes `llama-server`, `llama-quantize`, and other utilities.
|
|
- **`swap`**: Includes only `llama-swap` and `llama-server`.
|
|
|
|
## Run
|
|
|
|
- Download `.gguf` model files to your favorite directory (e.g., `/my_local_files/gguf`).
|
|
- Map it to `/models` inside the container.
|
|
- Open browser `http://localhost:9292` and enjoy the features.
|
|
- API endpoints are available at `http://localhost:9292/v1` for use in other applications.
|
|
|
|
### CPU
|
|
|
|
```bash
|
|
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
|
|
```
|
|
|
|
```bash
|
|
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
|
|
```
|
|
|
|
### CUDA
|
|
|
|
- Install Nvidia Drivers and CUDA on the host.
|
|
- For Docker, install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
|
|
- For Podman, install [CDI Container Device Interface](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html)
|
|
- Identify your GPU:
|
|
- [CUDA GPU Compute Capability](https://developer.nvidia.com/cuda/gpus) (e.g., `8.6` for RTX30*, `8.9` for RTX40*, `12.0` for RTX50*)
|
|
- [CUDA Toolkit supported version](https://developer.nvidia.com/cuda-toolkit-archive)
|
|
|
|
```bash
|
|
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap
|
|
```
|
|
|
|
```bash
|
|
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia localhost/ik_llama-cuda:swap
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
- If CUDA is not available, use `ik_llama-cpu` instead.
|
|
- If models are not found, ensure you mount the correct directory: `-v /my_local_files/gguf:/models:ro`
|
|
- If you need to install `podman` or `docker` follow the [Podman Installation](https://podman.io/docs/installation) or [Install Docker Engine](https://docs.docker.com/engine/install) for your OS.
|
|
|
|
## Extra
|
|
|
|
- **Custom commit**: Build a specific `ik_llama.cpp` commit by modifying the Containerfile or using build args.
|
|
|
|
```bash
|
|
docker buildx bake --builder ik-llama-builder --set full.args.BUILD_COMMIT=1ec12b8 full
|
|
```
|
|
|
|
- **Using the tools in the `full` image**:
|
|
|
|
```bash
|
|
$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
|
|
# ./llama-quantize ...
|
|
# python3 gguf-py/scripts/gguf_dump.py ...
|
|
# ./llama-perplexity ...
|
|
# ./llama-sweep-bench ...
|
|
```
|
|
|
|
```bash
|
|
docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash localhost/ik_llama-cuda:full
|
|
# ./llama-quantize ...
|
|
# python3 gguf-py/scripts/gguf_dump.py ...
|
|
# ./llama-perplexity ...
|
|
# ./llama-sweep-bench ...
|
|
```
|
|
|
|
- **Customize `llama-swap` config**: Save the `./docker/ik_llama-cpu-swap.config.yaml` or `./docker/ik_llama-cuda-swap.config.yaml` locally (e.g., under `/my_local_files/`) then map it to `/app/config.yaml` inside the container appending `-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro` to your `podman run ...` or `docker run ...`.
|
|
|
|
- **Run in background**: Replace `-it` with `-d`: `podman run -d ...` or `docker run -d ...`. To stop it: `podman stop ik_llama` or `docker stop ik_llama`.
|
|
|
|
- **GGML_NATIVE**: If you build the image on a different machine, change `-DGGML_NATIVE=ON` to `-DGGML_NATIVE=OFF` in the `.Containerfile`.
|
|
|
|
- **KV quantization types**: To use more KV quantization types, build with `-DGGML_IQK_FA_ALL_QUANTS=ON`.
|
|
|
|
- **Cleanup unused CUDA images**: If you experiment with several `CUDA_VERSION`, delete unused images (they are several GB):
|
|
```bash
|
|
podman image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && \
|
|
podman image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04
|
|
```
|
|
|
|
- **Build without `llama-swap`**: Change `--target swap` to `--target server` in docker-bake or Containerfiles.
|
|
|
|
- **Pre-made quants**: Look for premade quants from [ubergarm](https://huggingface.co/ubergarm/models).
|
|
|
|
- **GGUF tools**: Build custom quants with [Thireus](https://github.com/Thireus/GGUF-Tool-Suite)'s tools.
|
|
|
|
- **Download prebuilt binaries**: Download from [ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA](https://github.com/Thireus/ik_llama.cpp).
|
|
|
|
- **KoboldCPP experience**: [Croco.Cpp is a fork of KoboldCPP inferring GGUF/GGML models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.](https://github.com/Nexesenex/croco.cpp)
|
|
|
|
## Credits
|
|
|
|
All credits to the awesome community:
|
|
|
|
[llama-swap](https://github.com/mostlygeek/llama-swap)
|