VLLm: Unterschied zwischen den Versionen

Aus XccesS Wiki
Zur Navigation springen Zur Suche springen
Die Seite wurde neu angelegt: „=== Beschreibung === === Download === Normal (ROCm) <syntaxhighlight lang="bash" line="1"> docker pull rocm/vllm-dev:nightly </syntaxhighlight> gfx906 <syntaxhighlight lang="bash" line="1"> docker pull nalanzeyu/vllm-gfx906 </syntaxhighlight> === Ausführen === <syntaxhighlight lang="bash" line="1"> docker run -it --rm --shm-size=8g --device=/dev/kfd --device=/dev/dri \ --group-add video -p 8086:8000 \ -v /mnt/share/models:/models \ nalanzey…“
 
 
(6 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt)
Zeile 1: Zeile 1:
=== Beschreibung ===
=== Beschreibung ===
Docker normal installieren


=== Download ===
=== Download ===
Zeile 12: Zeile 13:


=== Ausführen ===
=== Ausführen ===
<syntaxhighlight lang="bash" line="1">
Variante 1:<syntaxhighlight lang="bash" line="1">
docker run -it --rm --shm-size=8g --device=/dev/kfd --device=/dev/dri \
docker run -it --rm --shm-size=8g --device=/dev/kfd --device=/dev/dri \
     --group-add video -p 8086:8000 \
     --group-add video -p 8086:8000 \
     -v /mnt/share/models:/models \
     -v /mnt/share/models:/models \
     nalanzeyu/vllm-gfx906 \
     nalanzeyu/vllm-gfx906 \
     vllm serve /models/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit --max-model-len 30000 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name Homelab
     vllm serve /models/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit --served-model-name Homelab --max-model-len 30000 --enable-auto-tool-choice --tool-call-parser hermes
</syntaxhighlight>Variante 2, getestet 18.12.2025:<syntaxhighlight lang="bash">
sudo docker run -it --rm --network=host \
--group-add=video --ipc=host --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined --device /dev/kfd \
--device /dev/dri \
-v /home/hendrik/.lmstudio/models/:/app/models \
-e HF_HOME="/app/models" \
-e HF_TOKEN="<TOKEN>" \
-e NCCL_P2P_DISABLE=1 \
-e VLLM_CUSTOM_OPS=all \
-e VLLM_ROCM_USE_AITER=0 \
-e SAFETENSORS_FAST_GPU=1 \
-e PYTORCH_TUNABLEOP_ENABLED=1
rocm/vllm-dev:nightly
</syntaxhighlight>Für gfx1201:<syntaxhighlight lang="bash">
sudo docker run -it --rm --network=host \
--group-add=video --ipc=host --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined --device /dev/kfd \
--device /dev/dri \
-v /home/hendrik/.lmstudio/models/:/app/models \
-e HF_HOME="/app/models" \
-e HF_TOKEN="<TOKEN>" \
-e NCCL_P2P_DISABLE=1 \
-e VLLM_CUSTOM_OPS=all \
-e VLLM_ROCM_USE_AITER=0 \
-e SAFETENSORS_FAST_GPU=1 \
-e PYTORCH_TUNABLEOP_ENABLED=1
kyuz0/vllm-therock-gfx1201
</syntaxhighlight>
 
 
Ohne Tensor Parallism:<syntaxhighlight lang="bash">
vllm serve Qwen/Qwen3-VL-8B-Thinking --served-model-name Homelab --max_model_len 4096 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3
</syntaxhighlight>Mit:<syntaxhighlight lang="bash">
vllm serve Qwen/Qwen3-VL-8B-Thinking --served-model-name Homelab --tp 2 --max_model_len 4096 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3
</syntaxhighlight>
</syntaxhighlight>
Benchmark:
<syntaxhighlight lang="bash">
vllm bench serve --num-prompts 1 --dataset-name=random --input-len 512 --output-len 128 --model Qwen/Qwen3-4B-Instruct-2507-FP8
</syntaxhighlight>
=== Test ===
=== Test ===
=== Bekannte Probleme ===
=== Bekannte Probleme ===

Aktuelle Version vom 18. Dezember 2025, 23:06 Uhr

Beschreibung

Docker normal installieren

Download

Normal (ROCm)

docker pull rocm/vllm-dev:nightly

gfx906

docker pull nalanzeyu/vllm-gfx906

Ausführen

Variante 1:

docker run -it --rm --shm-size=8g --device=/dev/kfd --device=/dev/dri \
    --group-add video -p 8086:8000 \
    -v /mnt/share/models:/models \
    nalanzeyu/vllm-gfx906 \
    vllm serve /models/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit --served-model-name Homelab --max-model-len 30000 --enable-auto-tool-choice --tool-call-parser hermes

Variante 2, getestet 18.12.2025:

sudo docker run -it --rm --network=host \
--group-add=video --ipc=host --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined --device /dev/kfd \
--device /dev/dri \
-v /home/hendrik/.lmstudio/models/:/app/models \
-e HF_HOME="/app/models" \
-e HF_TOKEN="<TOKEN>" \
-e NCCL_P2P_DISABLE=1 \
-e VLLM_CUSTOM_OPS=all \
-e VLLM_ROCM_USE_AITER=0 \
-e SAFETENSORS_FAST_GPU=1 \
-e PYTORCH_TUNABLEOP_ENABLED=1
rocm/vllm-dev:nightly

Für gfx1201:

sudo docker run -it --rm --network=host \
--group-add=video --ipc=host --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined --device /dev/kfd \
--device /dev/dri \
-v /home/hendrik/.lmstudio/models/:/app/models \
-e HF_HOME="/app/models" \
-e HF_TOKEN="<TOKEN>" \
-e NCCL_P2P_DISABLE=1 \
-e VLLM_CUSTOM_OPS=all \
-e VLLM_ROCM_USE_AITER=0 \
-e SAFETENSORS_FAST_GPU=1 \
-e PYTORCH_TUNABLEOP_ENABLED=1
kyuz0/vllm-therock-gfx1201


Ohne Tensor Parallism:

vllm serve Qwen/Qwen3-VL-8B-Thinking --served-model-name Homelab --max_model_len 4096 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3

Mit:

vllm serve Qwen/Qwen3-VL-8B-Thinking --served-model-name Homelab --tp 2 --max_model_len 4096 --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3

Benchmark:

vllm bench serve --num-prompts 1 --dataset-name=random --input-len 512 --output-len 128 --model Qwen/Qwen3-4B-Instruct-2507-FP8

Test

Bekannte Probleme

Nützliche Links