Intel Gaudi
dstack
supports running dev environments, tasks, and services on Intel Gaudi GPUs via SSH fleets.
Deployment
Serving frameworks like vLLM and TGI have Intel Gaudi support. Here's an example of a service that deploys DeepSeek-R1-Distill-Llama-70B
using TGI on Gaudi and vLLM .
type: service
name: tgi
image: ghcr.io/huggingface/tgi-gaudi:2.3.1
env:
- HF_TOKEN
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- PORT=8000
- OMPI_MCA_btl_vader_single_copy_mechanism=none
- TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true
- PT_HPU_ENABLE_LAZY_COLLECTIVES=true
- MAX_TOTAL_TOKENS=2048
- BATCH_BUCKET_SIZE=256
- PREFILL_BATCH_BUCKET_SIZE=4
- PAD_SEQUENCE_TO_MULTIPLE_OF=64
- ENABLE_HPU_GRAPH=true
- LIMIT_HPU_GRAPH=true
- USE_FLASH_ATTENTION=true
- FLASH_ATTENTION_RECOMPUTE=true
commands:
- text-generation-launcher
--sharded true
--num-shard $DSTACK_GPUS_NUM
--max-input-length 1024
--max-total-tokens 2048
--max-batch-prefill-tokens 4096
--max-batch-total-tokens 524288
--max-waiting-tokens 7
--waiting-served-ratio 1.2
--max-concurrent-requests 512
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
resources:
gpu: gaudi2:8
# Uncomment to cache downloaded models
#volumes:
# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
type: service
name: deepseek-r1-gaudi
image: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- HABANA_VISIBLE_DEVICES=all
- OMPI_MCA_btl_vader_single_copy_mechanism=none
commands:
- git clone https://.com/HabanaAI/vllm-fork.git
- cd vllm-fork
- git checkout habana_main
- pip install -r requirements-hpu.txt
- python setup.py develop
- vllm serve $MODEL_ID
--tensor-parallel-size 8
--trust-remote-code
--download-dir /data
port: 8000
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
resources:
gpu: gaudi2:8
# Uncomment to cache downloaded models
#volumes:
# - /root/.cache/huggingface/hub:/root/.cache/huggingface/hub
Fine-tuning
Below is an example of LoRA fine-tuning of DeepSeek-R1-Distill-Qwen-7B
using Optimum for Intel Gaudi and DeepSpeed with the lvwerra/stack-exchange-paired
dataset.
type: task
name: trl-train
image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0
env:
- MODEL_ID=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- WANDB_API_KEY
- WANDB_PROJECT
commands:
- pip install --upgrade-strategy eager optimum[habana]
- pip install git+https://.com/HabanaAI/[email protected]
- git clone https://.com/huggingface/optimum-habana.git
- cd optimum-habana/examples/trl
- pip install -r requirements.txt
- pip install wandb
- DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python ../gaudi_spawn.py --world_size $DSTACK_GPUS_NUM --use_deepspeed sft.py
--model_name_or_path $MODEL_ID
--dataset_name "lvwerra/stack-exchange-paired"
--deepspeed ../language-modeling/llama2_ds_zero3_config.json
--output_dir="./sft"
--do_train
--max_steps=500
--logging_steps=10
--save_steps=100
--per_device_train_batch_size=1
--per_device_eval_batch_size=1
--gradient_accumulation_steps=2
--learning_rate=1e-4
--lr_scheduler_type="cosine"
--warmup_steps=100
--weight_decay=0.05
--optim="paged_adamw_32bit"
--lora_target_modules "q_proj" "v_proj"
--bf16
--remove_unused_columns=False
--run_name="sft_deepseek_70"
--report_to="wandb"
--use_habana
--use_lazy_mode
resources:
gpu: gaudi2:8
To finetune DeepSeek-R1-Distill-Llama-70B
with eight Gaudi 2, you can partially offload parameters to CPU memory using the Deepspeed configuration file. For more details, refer to parameter offloading.
Applying a configuration
Once the configuration is ready, run dstack apply -f <configuration file>
.
$ dstack apply -f examples/inference/vllm/.dstack.yml
# BACKEND REGION RESOURCES SPOT PRICE
1 ssh remote 152xCPU,1007GB,8xGaudi2:96GB yes $0 idle
Submit a new run? [y/n]: y
Provisioning...
---> 100%
Source code
The source-code of this example can be found in examples/llms/deepseek/tgi/intel
, examples/llms/deepseek/vllm/intel
and examples/llms/deepseek/trl/intel
.