This article walks through a production-grade deployment path for DeepSeek V4: using vLLM as the inference engine, Docker for containerization, Kubernetes for orchestration, and Prometheus for monitoring to build a stable service. It addresses common LLM deployment pain points such as slow startup, insufficient shared memory, and coarse-grained autoscaling. Keywords: DeepSeek V4, Kubernetes, vLLM.
Technical Specifications at a Glance
| Parameter | Details |
|---|---|
| Model | DeepSeek V4 |
| Inference Engine | vLLM 0.8.4 |
| Languages | Python 3.11, YAML, Bash |
| Container Base | CUDA 12.4.1 + Ubuntu 22.04 |
| Orchestration Platform | Kubernetes v1.29 |
| GPU Environment | 4 × NVIDIA A100 80GB |
| Storage Protocol | NFS (ReadOnlyMany) |
| Monitoring Protocol | Prometheus Metrics |
| Core Dependencies | vllm, prometheus-client, NVIDIA GPU Operator |
| Reference Popularity | The original article focuses on a production deployment scenario and does not provide an open-source repository star count |
This Architecture Targets Production LLM Deployments
Getting DeepSeek V4 to run is not difficult. Keeping it stable over time is the real challenge. In production, container images, GPU scheduling, model weight mounts, health checks, and monitoring alerts must form a closed operational loop. Otherwise, the service can become unstable under high concurrency or during restarts.
The original approach splits the system into three layers: image build, Kubernetes orchestration, and observability. This layered design works well for enterprise internal GPU clusters and also leaves room for later integration with canary releases, gateway authentication, and cost accounting.
You Should Remember These Core Deployment Takeaways First
| Layer | Approach | Core Tooling | Risk Area |
|---|---|---|---|
| Model Inference | vLLM as a service | vllm serve / API Server |
CUDA compatibility |
| Containerization | Multi-stage build | Dockerfile | Image size and dependency drift |
| Orchestration and Scheduling | Deployment + HPA | Kubernetes | GPU resources and probe timeouts |
| Monitoring and Alerting | Prometheus + Grafana | /metrics |
Incomplete metrics |
| Log Collection | Loki + Promtail | Structured logs | Incomplete troubleshooting chain |
# Verify that Kubernetes correctly detects GPU resources
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
# Verify that the GPU plugin is healthy
kubectl get pods -n gpu-operator | grep nvidia-device-plugin
Use these commands before deployment to confirm that the GPU Operator and device plugin are healthy.
Container Images Must Prioritize the CUDA and vLLM Compatibility Matrix
In this setup, CUDA 12.4 is more stable than 12.6. The reason is that some vLLM kernels may fail to compile on newer CUDA versions, which can allow the image build to succeed but still cause runtime failures.
The value of a multi-stage Dockerfile is that it separates dependency installation from the runtime environment. This reduces image contamination and makes later image security scanning and cache reuse easier.
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
# Install Python 3.11 and base utilities
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.11 python3.11-venv python3-pip curl wget git \
&& rm -rf /var/lib/apt/lists/*
# Standardize the python command version to avoid script ambiguity
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
&& update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
FROM base AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt # Install inference and monitoring dependencies
FROM base AS runtime
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/dist-packages /usr/local/lib/python3.11/dist-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY healthcheck.sh /app/healthcheck.sh
RUN chmod +x /app/healthcheck.sh
EXPOSE 8000
EXPOSE 9090
HEALTHCHECK --interval=30s --timeout=10s --retries=3 CMD /app/healthcheck.sh
This Dockerfile establishes a reusable runtime baseline for large-model inference.
Minimal Dependencies and Health Checks Should Stay Simple
vllm==0.8.4
prometheus-client==0.21.0
#!/bin/bash
# Check the service health endpoint and return a non-zero status on failure
curl -sf http://localhost:8000/health || exit 1
Keep dependencies to a minimum, and let the health check verify only whether the core service is alive so you avoid adding unnecessary noise.
Kubernetes Orchestration Must Be Designed Around Model Loading and GPU Communication
For a large model like DeepSeek V4, the orchestration focus is not simply whether it can start. The real questions are how long it takes to become Ready, how to avoid killing it prematurely, and how to align GPU resources correctly. At minimum, the deployment manifests should cover the namespace, model storage volume, Deployment, Service, and HPA.
Model weights are mounted from NFS in read-only mode, which avoids redistributing the 140 GB weight files to every replica. For the full BF16 version, this is a key technique for controlling delivery speed and disk usage.
apiVersion: v1
kind: PersistentVolume
metadata:
name: deepseek-v4-weights
spec:
capacity:
storage: 200Gi
accessModes:
- ReadOnlyMany
nfs:
server: 10.0.1.50
path: /exports/models/deepseek-v4
persistentVolumeReclaimPolicy: Retain
This configuration exposes the model weights to multiple inference Pods through a shared read-only volume.
Deployment Probes and Shared Memory Determine Stability
What really determines service stability is startupProbe, readinessProbe, and /dev/shm. Large models often need 2 to 5 minutes to load. If you reuse probe values from a typical web service, Kubelet may restart the container before the model finishes initialization.
In addition, vLLM tensor parallelism relies on shared memory for GPU-to-GPU communication. If the default /dev/shm is too small, it can trigger CrashLoopBackOff, which is one of the most common production issues in practice.
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-v4
namespace: llm-serving
spec:
replicas: 2
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
containers:
- name: vllm
image: your-registry.com/deepseek-v4-vllm:latest
resources:
requests:
cpu: "8"
memory: "64Gi"
nvidia.com/gpu: "2"
limits:
cpu: "16"
memory: "128Gi"
nvidia.com/gpu: "2"
volumeMounts:
- name: model-weights
mountPath: /models/deepseek-v4
readOnly: true
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 30 # Leave enough time for model loading
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi # Expand shared memory to avoid TP communication crashes
This configuration addresses the two most common large-model issues: slow startup and insufficient shared memory.
Autoscaling Should Track Real Inference Load Instead of CPU Utilization
For LLM services, CPU and memory do not accurately represent load. A better approach is to use metrics exposed by vLLM, such as the number of running requests and waiting queue length, and let HPA scale on actual business load.
This setup uses vllm_num_requests_running as the primary metric. When concurrent requests per Pod exceed the threshold, it scales out. Scale-in should use a longer stabilization window to avoid excessive GPU resource churn.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: deepseek-v4-hpa
namespace: llm-serving
spec:
minReplicas: 2
maxReplicas: 6
metrics:
- type: Pods
pods:
metric:
name: vllm_num_requests_running
target:
type: AverageValue
averageValue: "20" # Scale based on real per-Pod concurrency
This HPA configuration aligns autoscaling more closely with the actual pressure curve of the model service.
The Monitoring Stack Must Cover Latency, Throughput, Queue Depth, and KV Cache
vLLM exposes a built-in /metrics endpoint, which provides a solid foundation for observability. Prometheus should scrape only vllm_.* metrics to reduce irrelevant noise and lower storage pressure.
The most important question is not whether monitoring is connected, but whether it watches the right metrics. In large-model services, waiting queue depth, P99 latency, time to first token, and KV cache utilization reflect service health much more accurately than CPU utilization.
scrape_configs:
- job_name: 'deepseek-v4-vllm'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- llm-serving
metric_relabel_configs:
- source_labels: [__name__]
regex: 'vllm_.*'
action: keep # Keep only vLLM-related metrics
This scrape configuration keeps Prometheus focused on the core telemetry of the inference service.
Prioritize These Production Metrics First
| Metric | Meaning | Recommended Threshold |
|---|---|---|
vllm_num_requests_running |
Number of requests currently being processed | Alert when > 50 |
vllm_num_requests_waiting |
Number of queued requests | Alert when > 20 |
vllm_gpu_cache_usage_perc |
KV cache utilization | Alert when > 95% |
vllm_avg_generation_throughput_toks_per_s |
Generation throughput | Alert when < 100 |
vllm_e2e_request_latency_seconds |
End-to-end latency | P99 > 30s |
vllm_time_to_first_token_seconds |
Time to first token | P99 > 5s |
groups:
- name: deepseek-v4-alerts
rules:
- alert: HighRequestQueueDepth
expr: vllm_num_requests_waiting{job="deepseek-v4-vllm"} > 20
for: 2m
- alert: GPUCacheNearlyFull
expr: vllm_gpu_cache_usage_perc{job="deepseek-v4-vllm"} > 0.95
for: 1m
- alert: HighP99Latency
expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 30
for: 3m
These alert rules cover three core risk categories: queue buildup, cache saturation, and high latency.
Most High-Frequency Production Failures Cluster Around Four Issues
First, CPU or memory limits are set too low, which leads to OOMKilled during model loading. Second, the startupProbe timeout is too short, so the Pod keeps restarting before it reaches Ready. Third, tensor parallel communication crosses NUMA boundaries or nodes, which sharply degrades NCCL performance. Fourth, /dev/shm is too small, which directly crashes the inference process.
What these issues have in common is that they are not code bugs. They are resource modeling mistakes. That is why deployment documentation must encode these constraints explicitly in YAML instead of relying on tribal knowledge.
Hybrid Routing Is a Practical Strategy for Balancing Cost and Model Capability
When the local cluster handles frequent internal workloads and closed-source cloud models handle low-frequency but complex workloads, you can route requests through a unified OpenAI-compatible interface. This approach protects your GPU investment while preserving access to stronger model capabilities when needed.
from openai import OpenAI
# Local service: handles frequent internal workloads
local_client = OpenAI(
api_key="not-needed",
base_url="http://deepseek-v4-svc.llm-serving:8000/v1"
)
# Cloud service: handles tasks that require stronger model capability
cloud_client = OpenAI(
api_key="your-ofox-key",
base_url="https://api.ofox.ai/v1"
)
def smart_route(prompt: str, task_type: str = "general"):
"""Route requests to the local or cloud model based on task type"""
if task_type in ("code_review", "doc_gen"):
return local_client.chat.completions.create(
model="deepseek-v4",
messages=[{"role": "user", "content": prompt}],
max_tokens=4096,
stream=True
) # Route frequent tasks to the local service first to reduce long-term cost
return cloud_client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": prompt}],
max_tokens=4096,
stream=True
) # Route complex tasks to the cloud for stronger generalization
This code shows a basic implementation pattern for combining local private deployment with a cloud API.
Engineering Stability Matters More Than Isolated Peak Performance
For enterprise adoption of DeepSeek V4, optimizing a single parameter is not enough. The effective approach is to combine version pinning, probe tuning, shared memory sizing, metrics collection, and business-aware autoscaling.
If your team has stable GPU capacity, self-hosting can deliver lower marginal cost and stronger data control. If request volume fluctuates heavily or you need to switch among multiple models, a hybrid approach is usually more practical.
FAQ
1. Why does DeepSeek V4 in Kubernetes often restart shortly after startup?
The most common reason is that the startupProbe timeout is too short. Large models usually need several minutes to load. If the probe is too aggressive, Kubelet marks the container as failed and restarts it before the service is actually ready.
2. Why can a vLLM container still enter CrashLoopBackOff even when a GPU is available?
A very common cause is that the default /dev/shm is too small. Tensor parallelism relies on shared memory for inter-GPU communication. If you do not enlarge shared memory with emptyDir, the process can exit unexpectedly.
3. Why is CPU utilization not recommended as the primary HPA metric for LLM services?
Because CPU usage does not accurately represent inference pressure. For large models, the number of running requests, queue depth, TTFT, and KV cache utilization are much closer to the real bottlenecks and can significantly improve autoscaling decisions.
[AI Readability Summary]
This article reconstructs a production-ready DeepSeek V4 deployment architecture for GPU clusters. It covers Dockerfile-based containerization, Kubernetes orchestration, HPA autoscaling, Prometheus/Grafana monitoring, and common troubleshooting patterns. It is a strong fit for teams that need to run large-model inference services reliably in production.