DolphinDB Time-Series Operations Guide: Troubleshooting 5 High-Frequency Failures in Real-Time Pipelines

This article focuses on DolphinDB operations for real-time time-series pipelines and systematically breaks down five failure points most likely to cause production incidents: write interruptions, subscription backlog, memory overflow, disk I/O stalls, and duplicate consumption in primary-secondary setups. The core issues usually come down to parameter mismatches and architectural design flaws. Keywords: DolphinDB, time-series pipeline, operations troubleshooting.

Technical Specifications Snapshot

Parameter Details
Tech Stack DolphinDB time-series database / stream computing
Primary Languages DolphinDB Script, SQL-style scripting
Key Protocols TCP, stream subscription mechanism
Scenario Types IoT, monitoring, water utilities, real-time factor computation
GitHub Stars Not provided in the source content
Core Dependencies Stream tables, subscription mechanism, persistent storage, shared metadata tables

These Issues Are Fundamentally Stream Systems Engineering Problems

The original case highlights one critical fact: DolphinDB itself is usually not the primary bottleneck. Production incidents are more often caused by pipeline configuration, consumption models, and state management patterns. In real-time time-series scenarios in particular, legacy offline thinking or generic middleware experience often leads teams to design the wrong parameters.

System Pipeline Diagram AI Visual Insight: This image looks more like a system topology or pipeline diagram. It emphasizes the continuous path from data collection, stream ingestion, and subscription consumption to downstream analytics. It is useful for understanding why fault diagnosis should proceed layer by layer, from ingress connectivity to stream table state, subscription lag, and finally the storage layer.

Therefore, you should not troubleshoot this kind of system by staring at isolated error messages. Instead, build a standardized diagnostic path around five layers: ingress connectivity, stream table buffering, subscription consumption, persistence strategy, and high-availability state.

It is best to standardize one troubleshooting sequence first

# 1. Check network and connection errors first
grep -E "timeout|failed|disconnect" dolphindb.log

# 2. Then check whether subscriptions are backlogged
# Primary goal: verify whether production speed exceeds consumption speed

# 3. Check memory usage and retained stream table data
# Primary goal: verify whether expired state has not been reclaimed

# 4. Check disk I/O and flush mode
iostat -x 1 10

# 5. Finally verify offset consistency between primary and secondary
# Primary goal: prevent duplicate consumption after failover

This sequence helps turn “intermittent failures” into observable metric-driven problems much faster.

Real-Time Write Interruptions Are Usually Triggered by Connection Parameter Mismatches

In the original case, write interruptions happened during off-peak hours late at night. That is already a strong signal: if congestion is not caused by peak traffic, then timeout settings, retry strategy, or interference from background inspection jobs should be your first suspects.

During troubleshooting, first confirm whether the fault occurs across nodes or across branch sites at the same time. If the interruption is global, the issue is usually not an isolated write thread on one machine. It is more likely a uniform degradation at the ingress connection layer. Log entries such as timeout and write failed are direct evidence.

Fixing write interruptions requires changing three configuration areas at once

# Pseudocode: align timeout strategies between collectors and the server
server_tcp_timeout = 60   # Extend server timeout to avoid false disconnects during short jitter
client_retry_interval = 10 # Retry every 10 seconds on the collector side to avoid reconnect storms
client_retry_times = 5     # Increase retry attempts to improve recovery probability
pause_inspection = "02:00-04:00"  # Avoid interference from scheduled inspections during early morning hours

These adjustments align connection recovery cadence with the server-side tolerance window and reduce repeated disconnect loops.

The Root Cause of Subscription Stalls Is Often Serial Consumption, Not Insufficient Compute

When teams see consumer lag surge, they often assume the compute logic is too heavy. But the original case shows that the execution time of each individual rule was actually low. The real problem was that multiple rules were pushed into the same subscription channel, which created a default single-threaded queue.

In DolphinDB’s stream subscription model, the way you organize consumption matters more than the runtime of a single calculation. Even a newly added rule that “does not look expensive” can create sustained latency if it is stacked into the same serial subscription queue.

Subscription splitting should be designed around priority and resource isolation

-- Split high-priority alerts and low-priority aggregation into different subscriptions
-- Core principle: do not let critical alerting tasks share one queue with lower-value computations
subscribeTable(tableName=`tick_stream, actionName=`alert_consumer, handler=processAlert)
subscribeTable(tableName=`tick_stream, actionName=`log_consumer, handler=processLog)

The essence of this split is to reduce head-of-line blocking so that critical paths get consumed first.

OOM on Compute Nodes Usually Comes from Unbounded State Growth

The OOM issue in the case was not caused by a sudden large query. It came from long-term accumulation of sliding-window state. Keeping stream tables entirely in memory, without persistence or retention policies, may look simple, but it allows expired data to keep consuming memory indefinitely.

For scenarios with 24-hour windows, tens of thousands of measurement points, and long-running processes, state lifecycle management is just as important as business logic. If state cannot age out naturally, the system will eventually trigger GC jitter or even restart during some off-peak window.

Memory governance must implement persistence and cleanup strategies together

-- Configure a persistent stream table and state cleanup strategy
-- Core principle: make historical data evictable and the state space shrinkable
setStreamTablePersist(tableName=`sensor_stream, persist=true)
setConfig(`garbageSize, 100 * 1024 * 1024)
keyPurgeFilter = time(now() - 7d)

This type of configuration changes memory usage from continuous growth to controlled fluctuation.

Disk I/O Stalls Are Fundamentally a Mismatch Between Flush Strategy and Business Tolerance

Synchronous flush is appropriate for extreme reliability requirements, but it is usually a poor fit for high-frequency sensor ingestion. In the original case, the issue was not a sudden drop in disk performance. It was the combination of business peak traffic and a “flush every single record” strategy that saturated I/O.

For that reason, flush strategy should never be configured in isolation. First define how much data loss the business can actually tolerate. If sensors can buffer and resend data, then trading a small loss window for higher throughput with batched asynchronous flushing is usually the better design.

Batched asynchronous flushing fits peak traffic better than per-record synchronous persistence

# Pseudocode: trigger batched flush by row count or time
flush_rows = 5000          # Flush when a row threshold is reached
flush_interval_ms = 1000   # Or flush at most once per second
flush_mode = "async"       # Use asynchronous flush to reduce real-time I/O jitter

The key value of this strategy is that it merges many small random I/O operations into larger sequential I/O batches.

Duplicate Consumption in High Availability Usually Indicates Missing Offset Governance

When results double after a primary-secondary switchover, you are looking at one of the most dangerous and most easily overlooked problems in stream systems. If the primary and secondary each maintain their own offsets, progress divergence during failover is inevitable.

To achieve truly seamless takeover, you must externalize offsets into shared storage, and both nodes must use the same progress source. In that model, failover only changes the execution node. It does not change the consumption boundary.

Shared offset storage is the simplest reliable HA pattern

// Shared offset storage ensures consistent consumption progress between primary and secondary
subscribeTable(
    tableName=`tick_stream,
    actionName=`tick_factor_consumer,
    handler=processTick,
    offset=loadTable("dfs://meta", `shared_offset),
    reconnect=true
)

The core idea expressed by this configuration is simple: progress belongs to the system, not to any single node.

Building a Standardized Operations Baseline Matters More Than Fighting One Fire at a Time

These five failure types lead to one consistent conclusion: real-time system stability does not come from engine capability alone. It depends on parameter consistency, state lifecycle control, subscription isolation, and high-availability metadata governance working together.

You should turn the following items into a pre-release checklist: TCP timeout consistency, subscription split strategy, stream table persistence, garbage collection thresholds, flush mode, and shared offset tables. That approach eliminates most non-engine production incidents before they happen.

FAQ

Q1: What should I check first when DolphinDB writes suddenly stop?

First determine whether the interruption is global, then inspect logs for timeout, disconnect, and write failed. If it happens at a fixed time window and not during peak load, prioritize TCP timeout settings, collector retry intervals, and background inspection jobs.

Q2: If subscription lag keeps growing, why is it not necessarily a compute capacity issue?

Because a single subscription may form a serial queue by default. Even if each rule executes quickly, multiple tasks sharing the same consumption channel will still create persistent queueing, which shows up as offsets failing to keep up with production speed.

Q3: How can I avoid duplicate consumption after primary-secondary failover?

Do not let primary and secondary nodes maintain separate local offsets. Store offsets in a shared metadata table so both nodes read and update the same progress. After failover, the standby node can continue consuming from the latest shared position.

Core Summary

This article reconstructs five typical failure patterns in DolphinDB real-time time-series pipeline operations: write interruptions, subscription stalls, compute-node OOM, disk I/O jitter, and duplicate consumption after failover. It also provides a directly actionable troubleshooting sequence, configuration guidance, and high-availability governance recommendations.