SelectDB search() in Practice: Unifying Log Search and Analytics with a Single SQL Query
SelectDB embeds full-text search directly into SQL through
search(), solving the long-standing split between search and analytics in log systems. It is compatible with Elasticsearchquery_string, supports BM25, multi-field search, nested retrieval, and unified query-tree evaluation, making it well suited for AI inference logs, Agent traces, and evaluation data analysis. Keywords: SelectDB, log analytics, full-text search.
Technical Specifications Snapshot
| Parameter | Description |
|---|---|
| Core Product | SelectDB |
| Engine Foundation | Commercial distribution based on Apache Doris |
| Primary Language / Interface | SQL |
| Search Protocol / Syntax | Compatible with Elasticsearch query_string / Lucene-style syntax |
| Core Capabilities | Full-text search, OLAP aggregation, BM25 scoring, nested search |
| Typical Use Cases | AI log retrieval, incident troubleshooting, A/B analysis, Agent call-chain tracing |
| Core Dependencies | Inverted index, MPP execution engine, VARIANT type |
| GitHub Stars | Not provided in the source article |
AI-era log systems are demanding unified search and analytics
AI inference services continuously generate logs across request routing, model scheduling, GPU memory allocation, KV cache hits, and output evaluation. On platforms handling tens of millions of requests per day, TB-scale log volume is common.
The challenge is not just storage. Two requirements exist at the same time: one is precise incident investigation, and the other is aggregate analysis. Traditional architectures usually assign search to Elasticsearch and analytics to OLAP systems such as ClickHouse, which results in two systems, two copies of data, and one synchronization pipeline.
Dual-system architectures introduce several major pain points
- They expand the operational surface area with more components and longer upgrade paths.
- Data synchronization introduces latency, so search and analytics results can diverge.
- They increase costs because indexes, replicas, and analytical storage duplicate the same data.
SELECT request_id, model_name, error_msg, latency_ms
FROM inference_logs
WHERE search('level:ERROR AND error_msg:"CUDA out of memory" AND model_name:gpt*') -- Execute full-text search directly in SQL
AND log_time > NOW() - INTERVAL 1 HOUR -- Add a structured time filter
ORDER BY latency_ms DESC
LIMIT 100;
This SQL statement shows the core value of search(): it turns text retrieval into a normal WHERE predicate.
search() gives SQL native search-engine capabilities
search() accepts a DSL string whose syntax closely resembles Elasticsearch query_string. That means teams with existing Elasticsearch experience can migrate with almost no need to relearn a query language.
More importantly, it supports far more than simple keyword matching. It also supports Lucene mode, multi-field search, Boolean combinations, regular expressions, prefixes, and relevance-based ranking. This upgrades log retrieval from “searchable” to “precisely expressible.”
Lucene mode fits complex Boolean queries
SELECT *
FROM inference_logs
WHERE search(
'level:ERROR AND msg:"timeout" OR msg:"connection refused"',
'{"mode":"lucene", "default_operator":"and"}' -- Enable Lucene mode and set the default Boolean operator
);
This style works especially well for log platforms migrating from Elasticsearch, where most existing queries can simply be wrapped inside search().
Multiple query operators make incident investigation more direct
In complex troubleshooting scenarios, engineers often need to constrain error level, keywords, latency, host, and exclusions at the same time. The value of search() is not that it rebuilds another MATCH function, but that it expresses multi-condition query trees in a unified way.
Composite queries can narrow the failure scope in one step
SELECT *
FROM inference_logs
WHERE search(
'level:ERROR AND error_msg:"connection refused" AND latency:[500 TO *] AND NOT module:healthcheck'
-- Combine severity, phrase matching, range filtering, and exclusion conditions
);
This query compiles multiple conditions into a single query tree for unified evaluation, reducing the intermediate overhead of executing predicates separately.
Wildcards, regex, and multi-field search improve recall
SELECT request_id, error_msg
FROM inference_logs
WHERE search('error_msg:/CUDA.*error/ AND level:ERROR'); -- Use regex to cover multiple CUDA error variants
This approach is useful when production error messages are unstable and the wording of the same class of exception varies widely.
BM25 and nested search fill in advanced retrieval capabilities
Log search is not only about matching. Relevance also matters. search() includes BM25 scoring and exposes scores through score(), so you can rank logs by relevance instead of simply scanning massive result sets in reverse chronological order.
Score-based ranking works better for noisy log datasets
SELECT request_id, error_msg, score() AS score
FROM inference_logs
WHERE search('error_msg:"memory allocation failed" OR error_msg:"CUDA error"') -- Match multiple GPU memory error phrasings
ORDER BY score DESC
LIMIT 20;
This changes the troubleshooting entry point from “scan logs first” to “inspect the most relevant logs first.”
VARIANT with NESTED can search through JSON structures
AI Agent, tool invocation, and inference pipeline logs often contain nested arrays. SelectDB’s VARIANT type can store this data natively, and the NESTED operator can directly search internal objects.
SELECT *
FROM agent_trace_logs
WHERE search('NESTED(steps, status:error AND tool:code_exec)'); -- Locate failed tool invocations inside the steps array
This avoids the extra complexity of flattening JSON into tables, running ETL, and joining data back together later.
The performance gains of search() come from unified evaluation, not simpler syntax
On the surface, multiple MATCH_* predicates and a single search() both filter text. In practice, they execute differently. The former often evaluates each condition independently and then intersects bitmaps. The latter compiles the full condition set into a query tree and advances document by document.
A unified query tree reduces intermediate materialization
- It supports
ANDshort-circuiting and can skip non-matching candidates early. - It shares the
IndexReader, reducing repeated index opening and scanning overhead. - It uses the full DSL string as the cache granularity, which fits repeated interactive queries better.
SELECT *
FROM logs
WHERE search('level:ERROR AND module:inference AND error_msg:"CUDA out of memory" AND context:gpu')
-- Hand all multi-field conditions to search() for unified execution instead of scattering them across multiple MATCH predicates
;
The more conditions you add, and the more skewed the data becomes, the more obvious the advantage of unified evaluation becomes.
Unified SQL is especially effective for three AI log scenarios
The first is model inference incident diagnosis. You can search for OOM errors first, then immediately aggregate by model, prompt length, and P99 latency. The second is model evaluation and A/B analysis, where one query can bucket by prompt length and compare quality, latency, and cost. The third is Agent call-chain tracing, where you reconstruct failed steps by session.
Search and aggregation can run in the same SQL query
SELECT
model_name,
COUNT(*) AS error_count,
AVG(prompt_tokens) AS avg_prompt_tokens,
PERCENTILE_APPROX(latency_ms, 0.99) AS p99_latency
FROM inference_logs
WHERE search('level:ERROR AND error_msg:"CUDA out of memory"') -- Use the inverted index first to narrow the incident scope
AND log_time > NOW() - INTERVAL 1 HOUR
GROUP BY model_name
ORDER BY error_count DESC;
This SQL statement captures the value of a unified engine: retrieval filtering and analytical aggregation connect seamlessly.
Migrating from Elasticsearch to SelectDB is relatively manageable
Migration challenges usually center on query compatibility, index definitions, and storage cost. According to the source material, search() is compatible with Elasticsearch query_string, so most queries only need to be rewritten from REST API calls into SQL WHERE conditions. At the index layer, inverted indexes are defined with USING INVERTED.
Table creation follows analytical database conventions rather than search-cluster operations
CREATE TABLE inference_logs (
log_time DATETIME,
request_id VARCHAR(64),
model_name VARCHAR(32),
level VARCHAR(16),
error_msg TEXT,
context TEXT,
latency_ms INT,
INDEX idx_level(level) USING INVERTED,
INDEX idx_error(error_msg) USING INVERTED PROPERTIES(
"parser" = "unicode", "support_phrase" = "true" -- Enable Unicode-friendly parsing and phrase search
)
) ENGINE=OLAP
DUPLICATE KEY(log_time)
PROPERTIES (
"inverted_index_storage_format" = "V3" -- Use the V3 inverted-index storage format
);
The key benefit is that it removes one Elasticsearch cluster and one data synchronization pipeline, lowering duplicate storage overhead and operational complexity.
Log platforms are moving from dual engines to integrated architectures
The value of SelectDB search() is not just that it adds a full-text search function to SQL. It elevates search into a native OLAP capability. For AI log systems, that means the same dataset can simultaneously support incident investigation, trend analysis, A/B evaluation, and call-chain tracing.
When search and analytics fit into a single SQL query, data no longer needs to move between systems, and the architecture shifts from synchronization coupling to single-engine collaboration. That is the most important direction for the next generation of log platforms.
FAQ
Q1: Can SelectDB search() directly replace Elasticsearch?
A1: In integrated log search and analytics scenarios, search() already covers many core requirements, including Boolean queries, phrases, wildcards, regex, BM25, multi-field search, and nested search. However, if your business depends heavily on specific Elasticsearch ecosystem plugins, you should evaluate those dependencies case by case.
Q2: Why is search() faster than using multiple MATCH predicates?
A2: The core reason is not the function name, but the execution model. search() compiles conditions into a unified query tree, supports short-circuit evaluation, shares readers, and uses DSL-level caching, which reduces bitmap materialization and repeated scans.
Q3: Which scenarios are the best fit for migration to SelectDB?
A3: The best candidates are AI inference logs, Agent call chains, online evaluation logs, and large-scale operational logs. These scenarios require both full-text search and aggregate analytics, which is exactly where dual-system architectures usually incur the highest synchronization and operational cost.
Core Summary: This article reconstructs and explains SelectDB’s search() capability, showing how it uses SQL to unify log search and OLAP analytics, replacing the dual-system architecture of Elasticsearch plus a separate analytical database. It covers Lucene-compatible syntax, BM25, multi-field and nested search, performance mechanics, and migration considerations.