DeepSeek V4 Local Deployment and Production Monitoring: Dockerfile, Kubernetes, and Prometheus End-to-End Guide

This article walks through a production-grade deployment path for DeepSeek V4: using vLLM as the inference engine, Docker for containerization, Kubernetes for orchestration, and Prometheus for monitoring to build a stable service. It addresses common LLM deployment pain points such as slow startup, insufficient shared memory, and coarse-grained autoscaling. Keywords: DeepSeek V4, Kubernetes, vLLM.

Technical Specifications at a Glance

Parameter Details
Model DeepSeek V4
Inference Engine vLLM 0.8.4
Languages Python 3.11, YAML, Bash
Container Base CUDA 12.4.1 + Ubuntu 22.04
Orchestration Platform Kubernetes v1.29
GPU Environment 4 × NVIDIA A100 80GB
Storage Protocol NFS (ReadOnlyMany)
Monitoring Protocol Prometheus Metrics
Core Dependencies vllm, prometheus-client, NVIDIA GPU Operator
Reference Popularity The original article focuses on a production deployment scenario and does not provide an open-source repository star count

This Architecture Targets Production LLM Deployments

Getting DeepSeek V4 to run is not difficult. Keeping it stable over time is the real challenge. In production, container images, GPU scheduling, model weight mounts, health checks, and monitoring alerts must form a closed operational loop. Otherwise, the service can become unstable under high concurrency or during restarts.

The original approach splits the system into three layers: image build, Kubernetes orchestration, and observability. This layered design works well for enterprise internal GPU clusters and also leaves room for later integration with canary releases, gateway authentication, and cost accounting.

You Should Remember These Core Deployment Takeaways First

Layer Approach Core Tooling Risk Area
Model Inference vLLM as a service vllm serve / API Server CUDA compatibility
Containerization Multi-stage build Dockerfile Image size and dependency drift
Orchestration and Scheduling Deployment + HPA Kubernetes GPU resources and probe timeouts
Monitoring and Alerting Prometheus + Grafana /metrics Incomplete metrics
Log Collection Loki + Promtail Structured logs Incomplete troubleshooting chain
# Verify that Kubernetes correctly detects GPU resources
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'

# Verify that the GPU plugin is healthy
kubectl get pods -n gpu-operator | grep nvidia-device-plugin

Use these commands before deployment to confirm that the GPU Operator and device plugin are healthy.

Container Images Must Prioritize the CUDA and vLLM Compatibility Matrix

In this setup, CUDA 12.4 is more stable than 12.6. The reason is that some vLLM kernels may fail to compile on newer CUDA versions, which can allow the image build to succeed but still cause runtime failures.

The value of a multi-stage Dockerfile is that it separates dependency installation from the runtime environment. This reduces image contamination and makes later image security scanning and cache reuse easier.

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# Install Python 3.11 and base utilities
RUN apt-get update && apt-get install -y --no-install-recommends \
  python3.11 python3.11-venv python3-pip curl wget git \
  && rm -rf /var/lib/apt/lists/*

# Standardize the python command version to avoid script ambiguity
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
  && update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1

FROM base AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt  # Install inference and monitoring dependencies

FROM base AS runtime
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/dist-packages /usr/local/lib/python3.11/dist-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY healthcheck.sh /app/healthcheck.sh
RUN chmod +x /app/healthcheck.sh

EXPOSE 8000
EXPOSE 9090
HEALTHCHECK --interval=30s --timeout=10s --retries=3 CMD /app/healthcheck.sh

This Dockerfile establishes a reusable runtime baseline for large-model inference.

Minimal Dependencies and Health Checks Should Stay Simple

vllm==0.8.4
prometheus-client==0.21.0
#!/bin/bash
# Check the service health endpoint and return a non-zero status on failure
curl -sf http://localhost:8000/health || exit 1

Keep dependencies to a minimum, and let the health check verify only whether the core service is alive so you avoid adding unnecessary noise.

Kubernetes Orchestration Must Be Designed Around Model Loading and GPU Communication

For a large model like DeepSeek V4, the orchestration focus is not simply whether it can start. The real questions are how long it takes to become Ready, how to avoid killing it prematurely, and how to align GPU resources correctly. At minimum, the deployment manifests should cover the namespace, model storage volume, Deployment, Service, and HPA.

Model weights are mounted from NFS in read-only mode, which avoids redistributing the 140 GB weight files to every replica. For the full BF16 version, this is a key technique for controlling delivery speed and disk usage.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: deepseek-v4-weights
spec:
  capacity:
    storage: 200Gi
  accessModes:
    - ReadOnlyMany
  nfs:
    server: 10.0.1.50
    path: /exports/models/deepseek-v4
  persistentVolumeReclaimPolicy: Retain

This configuration exposes the model weights to multiple inference Pods through a shared read-only volume.

Deployment Probes and Shared Memory Determine Stability

What really determines service stability is startupProbe, readinessProbe, and /dev/shm. Large models often need 2 to 5 minutes to load. If you reuse probe values from a typical web service, Kubelet may restart the container before the model finishes initialization.

In addition, vLLM tensor parallelism relies on shared memory for GPU-to-GPU communication. If the default /dev/shm is too small, it can trigger CrashLoopBackOff, which is one of the most common production issues in practice.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-v4
  namespace: llm-serving
spec:
  replicas: 2
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
      containers:
        - name: vllm
          image: your-registry.com/deepseek-v4-vllm:latest
          resources:
            requests:
              cpu: "8"
              memory: "64Gi"
              nvidia.com/gpu: "2"
            limits:
              cpu: "16"
              memory: "128Gi"
              nvidia.com/gpu: "2"
          volumeMounts:
            - name: model-weights
              mountPath: /models/deepseek-v4
              readOnly: true
            - name: shm
              mountPath: /dev/shm
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
            failureThreshold: 30  # Leave enough time for model loading
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi  # Expand shared memory to avoid TP communication crashes

This configuration addresses the two most common large-model issues: slow startup and insufficient shared memory.

Autoscaling Should Track Real Inference Load Instead of CPU Utilization

For LLM services, CPU and memory do not accurately represent load. A better approach is to use metrics exposed by vLLM, such as the number of running requests and waiting queue length, and let HPA scale on actual business load.

This setup uses vllm_num_requests_running as the primary metric. When concurrent requests per Pod exceed the threshold, it scales out. Scale-in should use a longer stabilization window to avoid excessive GPU resource churn.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: deepseek-v4-hpa
  namespace: llm-serving
spec:
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm_num_requests_running
        target:
          type: AverageValue
          averageValue: "20"  # Scale based on real per-Pod concurrency

This HPA configuration aligns autoscaling more closely with the actual pressure curve of the model service.

The Monitoring Stack Must Cover Latency, Throughput, Queue Depth, and KV Cache

vLLM exposes a built-in /metrics endpoint, which provides a solid foundation for observability. Prometheus should scrape only vllm_.* metrics to reduce irrelevant noise and lower storage pressure.

The most important question is not whether monitoring is connected, but whether it watches the right metrics. In large-model services, waiting queue depth, P99 latency, time to first token, and KV cache utilization reflect service health much more accurately than CPU utilization.

scrape_configs:
  - job_name: 'deepseek-v4-vllm'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - llm-serving
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'vllm_.*'
        action: keep  # Keep only vLLM-related metrics

This scrape configuration keeps Prometheus focused on the core telemetry of the inference service.

Prioritize These Production Metrics First

Metric Meaning Recommended Threshold
vllm_num_requests_running Number of requests currently being processed Alert when > 50
vllm_num_requests_waiting Number of queued requests Alert when > 20
vllm_gpu_cache_usage_perc KV cache utilization Alert when > 95%
vllm_avg_generation_throughput_toks_per_s Generation throughput Alert when < 100
vllm_e2e_request_latency_seconds End-to-end latency P99 > 30s
vllm_time_to_first_token_seconds Time to first token P99 > 5s
groups:
  - name: deepseek-v4-alerts
    rules:
      - alert: HighRequestQueueDepth
        expr: vllm_num_requests_waiting{job="deepseek-v4-vllm"} > 20
        for: 2m
      - alert: GPUCacheNearlyFull
        expr: vllm_gpu_cache_usage_perc{job="deepseek-v4-vllm"} > 0.95
        for: 1m
      - alert: HighP99Latency
        expr: histogram_quantile(0.99, rate(vllm_e2e_request_latency_seconds_bucket[5m])) > 30
        for: 3m

These alert rules cover three core risk categories: queue buildup, cache saturation, and high latency.

Most High-Frequency Production Failures Cluster Around Four Issues

First, CPU or memory limits are set too low, which leads to OOMKilled during model loading. Second, the startupProbe timeout is too short, so the Pod keeps restarting before it reaches Ready. Third, tensor parallel communication crosses NUMA boundaries or nodes, which sharply degrades NCCL performance. Fourth, /dev/shm is too small, which directly crashes the inference process.

What these issues have in common is that they are not code bugs. They are resource modeling mistakes. That is why deployment documentation must encode these constraints explicitly in YAML instead of relying on tribal knowledge.

Hybrid Routing Is a Practical Strategy for Balancing Cost and Model Capability

When the local cluster handles frequent internal workloads and closed-source cloud models handle low-frequency but complex workloads, you can route requests through a unified OpenAI-compatible interface. This approach protects your GPU investment while preserving access to stronger model capabilities when needed.

from openai import OpenAI

# Local service: handles frequent internal workloads
local_client = OpenAI(
    api_key="not-needed",
    base_url="http://deepseek-v4-svc.llm-serving:8000/v1"
)

# Cloud service: handles tasks that require stronger model capability
cloud_client = OpenAI(
    api_key="your-ofox-key",
    base_url="https://api.ofox.ai/v1"
)

def smart_route(prompt: str, task_type: str = "general"):
    """Route requests to the local or cloud model based on task type"""
    if task_type in ("code_review", "doc_gen"):
        return local_client.chat.completions.create(
            model="deepseek-v4",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=4096,
            stream=True
        )  # Route frequent tasks to the local service first to reduce long-term cost
    return cloud_client.chat.completions.create(
        model="gpt-5.5",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=4096,
        stream=True
    )  # Route complex tasks to the cloud for stronger generalization

This code shows a basic implementation pattern for combining local private deployment with a cloud API.

Engineering Stability Matters More Than Isolated Peak Performance

For enterprise adoption of DeepSeek V4, optimizing a single parameter is not enough. The effective approach is to combine version pinning, probe tuning, shared memory sizing, metrics collection, and business-aware autoscaling.

If your team has stable GPU capacity, self-hosting can deliver lower marginal cost and stronger data control. If request volume fluctuates heavily or you need to switch among multiple models, a hybrid approach is usually more practical.

FAQ

1. Why does DeepSeek V4 in Kubernetes often restart shortly after startup?

The most common reason is that the startupProbe timeout is too short. Large models usually need several minutes to load. If the probe is too aggressive, Kubelet marks the container as failed and restarts it before the service is actually ready.

2. Why can a vLLM container still enter CrashLoopBackOff even when a GPU is available?

A very common cause is that the default /dev/shm is too small. Tensor parallelism relies on shared memory for inter-GPU communication. If you do not enlarge shared memory with emptyDir, the process can exit unexpectedly.

3. Why is CPU utilization not recommended as the primary HPA metric for LLM services?

Because CPU usage does not accurately represent inference pressure. For large models, the number of running requests, queue depth, TTFT, and KV cache utilization are much closer to the real bottlenecks and can significantly improve autoscaling decisions.

[AI Readability Summary]

This article reconstructs a production-ready DeepSeek V4 deployment architecture for GPU clusters. It covers Dockerfile-based containerization, Kubernetes orchestration, HPA autoscaling, Prometheus/Grafana monitoring, and common troubleshooting patterns. It is a strong fit for teams that need to run large-model inference services reliably in production.