[AI Readability Summary]
A Kubernetes application appeared to hang, but the root cause was not the application itself. Calico mistakenly identified the node address as the nerdctl0 bridge IP, which broke VXLAN forwarding between nodes and caused TCP connection timeouts to the target Pod. This article walks through the failure path, packet capture evidence, root cause, and the correct fix.
Technical Specifications Snapshot
| Parameter | Details |
|---|---|
| Scenario Type | Kubernetes cluster network troubleshooting |
| Core Components | Calico, calico-node, Tigera Operator |
| Network Protocols | TCP/IP, VXLAN |
| Language/Environment | Shell, Linux, kubectl |
| Symptom | Pod IP reachability issues, TCP connection timeout |
| GitHub Stars | Not provided in the original article |
| Core Dependencies | Calico CNI, containerd/nerdctl, tcpdump |
The failure first appeared as a hanging application request rather than a direct error
The entry point was typical: requests to the demo service stalled for a long time, and the target Pod produced no new logs. That strongly suggested the request never reached the application process and was blocked somewhere lower in the network path.
kubectl get pods -n deepflow-otel-spring-demo -o wide
curl 10.244.228.139:8090/shop/full-test
# Returns an error after waiting for a long time
# curl: (28) Failed to connect to 10.244.228.139 port 8090
These commands help confirm Pod placement and reproduce the TCP connection timeout.
Empty logs are a key signal
If kubectl logs -f shows no new request logs, you can usually rule out blocked business threads, endpoint exceptions, and application-layer 5xx errors first. If even the entry logs are missing, the fault is more likely in the Service, routing, or CNI forwarding layer.
Packet capture showed the traffic went directly into the Calico data plane
A packet capture on the client showed that traffic targeting the Pod IP went directly into Calico interface forwarding rather than through a traditional Layer 4 proxy path. At that point, the investigation should shift to inter-node tunnels, routing, and node address detection.
tcpdump -v -i any dst 10.244.228.139 -w pod.pcap
ip a s vxlan.calico
These commands verify whether the target traffic enters the VXLAN tunnel and inspect the local vxlan.calico interface state.
The vxlan.calico interface information confirmed that the cluster was using Calico VXLAN. Since the target Pod was located on ce-demo-2 and the request failed during connection establishment, the calico-node status on that node became the next high-priority checkpoint.
The unhealthy calico-node on the affected node already exposed a direct signal
After checking the calico-system namespace, the calico-node Pod on ce-demo-2 showed 0/1 Running. This state is risky: the Pod process may exist, but the core readiness probe has not passed, so the data plane is not trustworthy.
kubectl get pods -n calico-system -o wide
kubectl get node -o yaml | grep 'IPv4Address'
These commands cross-check Calico node component status against the currently registered IPv4Address for each node.
The most important anomaly was that one of the three nodes had its address recorded as 10.4.0.1/24 instead of the service IP on the physical NIC ens160. That indicated Calico had likely chosen the wrong primary host address.
The root cause was that first-found autodetection selected the nerdctl0 bridge
A closer inspection of nerdctl0 on ce-demo-2 showed that it was a bridge device with the address 10.4.0.1/24. That interface is valid for the container runtime, but it is not the expected host egress address for Calico node identification.
ip address show nerdctl0
ip -d link show nerdctl0
brctl show nerdctl0
These commands confirm the interface type, address, and bridge properties of nerdctl0.
The autodetection mechanism explains why Calico picked the wrong IP
The current DaemonSet used IP_AUTODETECTION_METHOD=first-found. In this mode, Calico picks the first valid IP based on interface enumeration order. The documentation notes that local interfaces such as the Docker bridge are excluded, but nerdctl0 is not part of the default exclusion list, so it became a valid candidate.
kubectl describe daemonset calico-node -n calico-system | grep IP_AUTODETECTION_METHOD
kubectl logs -n calico-system calico-node-sbxrk -c calico-node | grep -i 'nerdctl0'
These commands confirm the autodetection strategy and identify the interface Calico actually selected from the logs.
The log explicitly showed Using autodetected IPv4 address on interface nerdctl0: 10.4.0.1/24, which is the central piece of evidence in the complete root cause chain.
AI Visual Insight: This image comes from an advertising slot on the page and does not show any technical details such as network topology, packet capture results, interface state, or Calico configuration. It therefore provides no direct evidentiary value for this incident analysis.
The correct fix depends on the Calico deployment model
If Calico was deployed directly, updating the DaemonSet environment variable is enough. In this case, however, Calico was installed through Tigera Operator. Any direct DaemonSet edit would not persist, so the correct fix was to update the Installation custom resource.
kubectl get Installation
kubectl edit installation default
These commands locate the Operator-managed installation configuration entry point.
Replace the default firstFound: true with an explicit interface selection:
nodeAddressAutodetectionV4:
interface: ens160 # Specify the service NIC to avoid selecting a bridge interface
This configuration forces Calico to use ens160 as the target for node IPv4 autodetection.
If the environment is not Operator-based, you can instead set the following in calico-node:
- name: IP_AUTODETECTION_METHOD
value: interface=ens160 # You can also use interface=ens.*
This configuration constrains Calico’s node address selection logic through an environment variable.
After the fix, validate both the control plane and the data plane
After the configuration update, calico-node returned to 1/1 Running on all three nodes, and the IP for ce-demo-2 reverted to 10.51.0.101. That confirmed node addressing and the network control plane were consistent again.
kubectl get pods -n calico-system -o wide | grep calico-node
kubectl describe daemonset -n calico-system calico-node | grep IP_AUTODETECTION_METHOD
curl 10.244.228.139:8090/shop/full-test
These commands verify component recovery, policy propagation, and application request restoration.
The application request ultimately returned a successful JSON response, proving that the fix resolved not only the readiness issue but also restored real TCP connectivity between Pods.
This case shows that cluster network failures must be traced down through an evidence chain
On the surface, this looked like an application issue because the service appeared to hang. In reality, it was a VXLAN forwarding failure caused by incorrect CNI node address autodetection. No application logs, failed connection establishment in packet capture, an anomalous node address, and a Calico log entry pointing to the wrong interface together formed a high-confidence diagnostic path.
FAQ
Q1: Why were there no application logs if the application itself was not broken?
A1: Because the request failed during TCP connection establishment, the traffic never reached the Pod process. When there are no entry logs, you should first suspect the network, Service, CNI, or node forwarding path.
Q2: Why did first-found select nerdctl0?
A2: This mode selects the first valid IP according to interface enumeration order. Since nerdctl0 is not a Docker bridge and is not in the default exclusion list, Calico treated it as a valid candidate interface.
Q3: How can production environments prevent this issue from happening again?
A3: Do not rely on default autodetection. Explicitly configure nodeAddressAutodetectionV4.interface or IP_AUTODETECTION_METHOD=interface=..., and after any change, verify that the Node IPv4Address matches the physical NIC.
Core takeaway: This article reconstructs a real Kubernetes request timeout incident in which Calico, running in first-found mode, mistakenly selected the nerdctl0 bridge IP. That error caused Pod network connection failures. The article provides the root cause, the evidence chain, the remediation steps, and practical troubleshooting commands and configuration guidance you can apply directly.