DolphinDB Data Import and Export Guide: CSV, JSON, Parquet, and Database Sync Explained

DolphinDB provides a unified data exchange model for files, columnar storage, and external databases. It efficiently handles CSV, JSON, and Parquet import/export, bulk synchronization, and format conversion, helping solve compatibility, performance, and automation challenges in industrial IoT and time-series data processing. Keywords: DolphinDB, CSV, Parquet.

Technical Specification Snapshot

Parameter Description
Platform/Language DolphinDB scripting language
Supported Protocols/Interfaces File system, MySQL plugin, ODBC/PostgreSQL
Applicable Scenarios IoT time-series data, batch ETL, database synchronization
GitHub Stars Not provided in the original content
Core Dependencies mysql plugin, odbc plugin, Parquet/JSON/text import and export functions
Core Functions loadText, saveText, loadJSON, saveJSON, loadParquet, saveParquet

DolphinDB uses a unified function model to cover mainstream data exchange scenarios

DolphinDB import and export capabilities extend far beyond plain text files. They cover structured text, columnar files, and external databases. For developers, the key advantage is a consistent function model: the semantics stay uniform, the learning curve stays low, and ETL pipelines become easier to assemble.

As shown in the source material, common input functions include loadText, loadJSON, and loadParquet, while output functions include saveText, saveJSON, and saveParquet. If the target is a distributed database table, you can further use loadTextEx and loadParquetEx to write data directly into storage.

Supported formats and core function mapping

Data Type Typical Formats Read Functions Write Functions
Text CSV, TXT loadText saveText
JSON JSON, nested JSON loadJSON / parseJson saveJSON
Columnar Parquet loadParquet saveParquet
Distributed Ingestion Partitioned table import loadTextEx / loadParquetEx saveTextEx
External Databases MySQL, PostgreSQL mysql::loadTable / odbc::query mysql::saveTable / odbc::execute

The core value of this mapping is simple: developers can choose functions by data shape instead of redesigning the entire processing pipeline.

CSV remains the most universal entry point for data exchange

CSV works well for cross-system exchange, offline collection, and manual validation. It offers strong compatibility, but weak type information. For that reason, you should explicitly define a schema during import whenever possible to prevent time and numeric columns from being inferred incorrectly.

// Define the schema for the sensor data table
schema = table(
    `device_id`timestamp`temperature`humidity as colNames,
    [INT, DATETIME, DOUBLE, DOUBLE] as colTypes
)

// Import the CSV with an explicit schema to avoid type inference errors
t = loadText("/data/sensor_data.csv", schema=schema, skipRows=1)

// Specify the time format to ensure strings are converted correctly to time types
t = loadText("/data/sensor_data.csv", schema=schema, dateFormat="yyyy-MM-dd HH:mm:ss")

This code provides stable structured CSV ingestion, with schema and date format used to control type quality.

For exports, saveText supports delimiters, append mode, and header control. That makes it suitable for downstream scripts, reporting tools, or temporary audit files.

// Build sample data
t = table(
    1..100 as device_id,
    2024.01.01 + 0..99 as date,
    rand(20.0..30.0, 100) as temperature,
    rand(40.0..60.0, 100) as humidity
)

// Export to a CSV file
saveText(t, "/output/sensor_data.csv", delimiter=',', header=true)

This code quickly exports an in-memory table to a standard CSV file for exchange and backup.

Large file imports should prioritize parallelism and distributed ingestion

Once a CSV file reaches millions of rows, the main bottleneck is usually not syntax parsing. It is I/O throughput and single-node memory capacity. The original material presents two optimization paths: ploadText for parallel reading and loadTextEx for direct ingestion into partitioned tables.

// Create a distributed database partitioned by date
db = database("dfs://iot_data", VALUE, 2024.01.01..2024.12.31)

// Import the CSV directly into a distributed table, partitioned by timestamp
loadTextEx(db, "sensor_data", "/data/sensor_data.csv", `timestamp)

This code bypasses an intermediate table and writes large files directly into distributed storage, reducing memory pressure.

JSON is better suited for semi-structured data and API integration

JSON provides stronger structural expressiveness, especially for API responses, nested objects, and device event payloads. The tradeoff is higher parsing cost, so read/write performance is typically weaker than CSV and Parquet.

// Read a JSON file
t = loadJSON("/data/sensor_data.json")

// Parse a JSON string
jsonStr = '{"device_id":1,"temperature":25.5,"humidity":50.0}'
data = parseJson(jsonStr)

This code demonstrates two JSON entry points: file-level import and string-level parsing.

For nested JSON, DolphinDB can directly access values by object path. This works well when device metadata and measurement payloads are wrapped together.

// Parse a nested JSON object
nestedJson = '{"device":{"id":1,"name":"sensor_001"},"data":{"temperature":25.5,"humidity":50.0}}'
data = parseJson(nestedJson)

// Extract nested fields
device_id = data.device.id
temperature = data.data.temperature

This code extracts key fields from nested JSON and is well suited for flattening structures before loading data into a database.

Parquet is the preferred format for analytical workloads

If you need long-term storage, compressed persistence, and column-oriented access, Parquet usually outperforms CSV and JSON. The original material also clearly highlights its strengths: columnar storage, high compression, and cross-platform compatibility.

// Read only the required columns to reduce I/O and memory usage
t = loadParquet(
    "/data/sensor_data.parquet",
    columns=`device_id`timestamp`temperature
)

// Export Parquet using snappy compression
saveParquet(t, "/output/sensor_data.parquet", compression="snappy")

This code demonstrates two major Parquet advantages: column pruning and compressed storage.

Parquet has clear advantages in analytical pipelines

Feature Engineering Value
Columnar storage Reads only required columns and reduces scan cost
High compression ratio Saves more space than CSV, often by several times
Query-friendly Better suited for analytical workloads and batch computation
Ecosystem compatibility Easy to integrate with Spark, Hive, and similar systems

For log archiving, offline analytics, and lakehouse staging, Parquet should take priority over CSV.

External database synchronization turns DolphinDB into an ETL hub

The source material shows how to integrate with MySQL and PostgreSQL. The core approach is to establish a connection through plugins, pull tables or query results, and then write them into DolphinDB or sync the processed data back into the target database.

// Load the MySQL plugin and establish a connection
loadPlugin("/plugins/mysql/libPluginMySQL.so")
conn = mysql::connect("localhost", 3306, "root", "password", "iot_db")

// Query MySQL data
t = mysql::query(conn, "SELECT * FROM sensor_data WHERE date >= '2024-01-01'")

This code pulls relational data into DolphinDB and is suitable for incremental synchronization and analytical preprocessing.

Batch imports and scheduled jobs are ideal for automated pipelines

When data sources deliver files by directory or land data hourly, you can combine directory scanning with scheduled jobs to build ETL automation. The main benefit is reduced manual intervention and more stable synchronization.

// Execute the sync job once per hour
def syncData() {
    source = mysql::loadTable(mysql_conn, "sensor_data")  // Pull data from the source database
    loadTable("dfs://iot_data", "sensor_data").append!(source)  // Write into the target table
}

scheduleJob("sync_sensor", "Sensor data synchronization", syncData, 00:00, 2024.01.01, 2030.12.31, 'H')

This code builds a recurring data synchronization task suitable for production ingestion workflows.

Performance optimization depends on format selection, column pruning, and early cleansing

Import and export speed depends not only on the function itself, but also on data format and processing strategy. CSV is universal but storage-heavy. JSON is flexible but slower to parse. In analytical scenarios, Parquet is often the best overall choice.

In engineering practice, prioritize the following strategies: use bulk ingestion instead of row-by-row writes, use parallel loading instead of serial reads, apply column pruning to avoid unnecessary fields, and push cleansing logic upstream to reduce the downstream cost of dirty data.

Common optimization strategies at a glance

Optimization Area Recommendation
Large file import Prefer ploadText or loadTextEx
Analytical file export Prefer Parquet with compression enabled
Query performance Read only required columns
Data quality Parse timestamps and apply range filters during ingestion
Automation Use scheduleJob for recurring synchronization

FAQ

Q: How should I choose between CSV, JSON, and Parquet?

A: Choose CSV if you want universal exchange and human readability. Choose JSON if you need to represent nested structures or API objects. Choose Parquet first if you care about compression ratio, query efficiency, and analytical performance.

Q: What is the most important optimization for importing very large CSV files into DolphinDB?

A: The key is to avoid full serial loading on a single node. Prefer ploadText for parallel reads, or use loadTextEx to write directly into distributed tables. Combine that with proper partition design to reduce memory pressure.

Q: How can I implement continuous synchronization between DolphinDB and MySQL/PostgreSQL?

A: Establish database connections through plugins, use query functions to pull incremental data, and combine the sync logic with scheduleJob for periodic execution. For production-grade reliability, add checkpoint recovery, idempotent writes, and retry handling.

References can be used to verify implementation details further

DolphinDB column cover

QR code for business collaboration and technical communication AI Visual Insight: This image shows a QR code entry point for business collaboration, technical communication, or community access, which is common at the end of technical articles. It typically allows readers to contact the author, request solution consultations, or join a technical community, reflecting how content distribution extends into technical services and ecosystem engagement.

WeChat contact card

Author WeChat QR code AI Visual Insight: This image is a personal WeChat contact-card QR code, typically containing a unique identity marker for establishing one-to-one technical communication. In technical content ecosystems, it is often used as a private-channel entry point for pre-sales consultation, project collaboration, training services, or community building.

[AI Readability Summary]

This article systematically explains DolphinDB data import and export capabilities across CSV, JSON, Parquet, MySQL, and PostgreSQL synchronization, as well as batch processing, format conversion, and performance optimization. It helps developers build efficient and reliable data exchange pipelines.