Llama.cpp

Beschreibung

llama.cpp ist eine C/C++ Implementierung für Inference von Large Language Models (LLMs). Es unterstützt verschiedene Backends (CPU, Vulkan, ROCm, CUDA) und ermöglicht das Ausführen von quantisierten Modellen im GGUF-Format.

Download

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Installation

Vulkan Build

Vulkan SDK herunterladen entpacken, ins Verzeichnis wechseln und source setup-env.sh ausführen. Dann ins Verzeichnis wechseln wo llama.cpp installiert werden soll. Dort dann

cmake -B build-vulkan -DGGML_VULKAN=1
cmake --build build-vulkan --config Release -- -j $(nproc)

ROCm Build

gfx906 = Mi 50 Support
gfx1100 = 7900 XTX Support

# ROCm Umgebung einrichten
echo 'export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
echo 'export HSA_OVERRIDE_GFX_VERSION=9.0.6' >> ~/.bashrc  # Für MI50/MI60
source ~/.bashrc

# llama.cpp kompilieren
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx906 \
    -DCMAKE_BUILD_TYPE=Release && \
    cmake --build build --config Release -- -j 8

Für mehrere GPU-Architekturen gleichzeitig:

# Für MI50 (gfx906) und RX 7900 XTX (gfx1100)
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="gfx906;gfx1100" \
    -DCMAKE_BUILD_TYPE=Release && \
    cmake --build build --config Release -- -j 8

Konfiguration

llama-server als systemd Service einrichten

Service-Datei erstellen:

sudo nano /etc/systemd/system/llama-server.service

Service-Datei Inhalt (Multi-GPU mit ROCm):

[Unit]
Description=Llama.cpp ROCm Multi-GPU Server
After=network.target

[Service]
Type=simple
User=username
Group=username
WorkingDirectory=/home/username/llama.cpp/build/bin

# Multi-GPU Konfiguration
Environment="HIP_VISIBLE_DEVICES=0,1"
Environment="HSA_OVERRIDE_GFX_VERSION=9.0.6"
Environment="PATH=/opt/rocm/bin:/usr/local/bin:/usr/bin:/bin"
Environment="LD_LIBRARY_PATH=/opt/rocm/lib"

# Server mit optimalen Multi-GPU Einstellungen
ExecStart=/home/username/llama.cpp/build/bin/llama-server \
  -m /home/username/models/model.gguf \
  --split-mode row \
  --tensor-split 0.5,0.5 \
  -ngl 99 \
  -fa 1 \
  --host 0.0.0.0 \
  --port 8080 \
  -c 32768 \
  -b 2048 \
  -ub 2048 \
  --threads 8 \
  --parallel 1 \
  --jinja

Restart=always
RestartSec=10
LimitNOFILE=65535
LimitMEMLOCK=infinity
StandardOutput=journal
StandardError=journal
SyslogIdentifier=llama-server

[Install]
WantedBy=multi-user.target

Service aktivieren und starten:

# Service neu laden
sudo systemctl daemon-reload

# Service aktivieren (Auto-Start beim Boot)
sudo systemctl enable llama-server

# Service starten
sudo systemctl start llama-server

# Status prüfen
sudo systemctl status llama-server

# Logs anzeigen
sudo journalctl -u llama-server -f

Service Management:

# Service stoppen
sudo systemctl stop llama-server

# Service neustarten
sudo systemctl restart llama-server

# Service deaktivieren
sudo systemctl disable llama-server

==== Manuelle Server-Starts ====
'''Multi-GPU (ROCm) - Optimal:'''
<syntaxhighlight lang="bash" line="1">
HIP_VISIBLE_DEVICES=0,1 ./llama-server \
  -m ~/models/model.gguf \
  --split-mode row \
  --tensor-split 0.5,0.5 \
  -ngl 99 \
  -fa 1 \
  --host 0.0.0.0 \
  --port 8080 \
  -c 32768 \
  -b 2048 \
  -ub 2048 \
  --threads 8 \
  --parallel 1 \
  --jinja

Update

cd ~/llama.cpp
git pull
cmake --build build --config Release -- -j $(nproc)

# Service neu starten falls aktiv
sudo systemctl restart llama-server

Test

Bekannte Probleme

Error 1 [ 34%] Linking HIP shared library ../../../bin/libggml-hip.so [ 34%] Built target ggml-hip gmake[1]: *** [CMakeFiles/Makefile2:1775: ggml/src/CMakeFiles/ggml-cpu.dir/all] Error 2 gmake: *** [Makefile:146: all] Error 2

Lösung:

GCC 13.3.0 konnte damit nicht umgehen, aber GCC 14.2.0 hat bessere C++-Standard-Compliance und kann diesen Header-Konflikt korrekt auflösen – deshalb funktioniert die Kompilierung jetzt mit der neueren Compiler-Version.
sudo update-alternatives --remove gcc /usr/bin/gcc-13
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 50
sudo update-alternatives --config gcc
# Wähle gcc-14

sudo update-alternatives --remove g++ /usr/bin/g++-13
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 50
sudo update-alternatives --config g++
# Wähle g++-14

Nützliche Links

Llama.cpp

Inhaltsverzeichnis

Beschreibung

Download

Installation

Vulkan Build

ROCm Build

Konfiguration

llama-server als systemd Service einrichten

Update

Test

Bekannte Probleme

Nützliche Links

Navigationsmenü

Llama.cpp

Beschreibung

Download

Installation

Vulkan Build

ROCm Build

Konfiguration

llama-server als systemd Service einrichten

Update

Test

Bekannte Probleme

Nützliche Links

Navigationsmenü

Suche