ClickHouse MCP for Dataservice Trace Analysis

Executive summary

The current ClickHouse MCP ecosystem is real, but still early.

  • There is now an official ClickHouse MCP server (ClickHouse/mcp-clickhouse), plus at least one more feature-rich vendor implementation from Altinity and a few smaller community projects.
  • Most current ClickHouse MCP servers are still basically safe-ish SQL access layers for LLMs: list databases, list tables, inspect schema, run read-only queries, sometimes expose prompts/resources, and sometimes add transport/auth features.
  • The more interesting direction is not “LLM writes arbitrary SQL against raw span tables,” but LLM + curated semantic layer: views, parameterized tools, saved queries, pre-aggregations, and trace-specific helper tools.
  • In observability, the market has already validated the broader pattern: ClickHouse is widely used as the backend for logs/traces/metrics systems such as HyperDX/ClickStack, SigNoz, and Langfuse. In practice, people are already using ClickHouse to power trace and telemetry analysis at scale; MCP is the missing UX/control layer for agentic workflows.
  • For a ClickHouse-based trace analysis product like Dataservice, MCP looks most promising as a way to enable:
    • natural-language investigation over trace data,
    • guided drilldowns over traces/spans/errors/latency outliers,
    • cross-signal correlation (trace + logs + metrics),
    • and operator/copilot experiences for incident triage.
  • The main caveat: today’s MCP servers are generally database-generic, not observability-native. They do not inherently understand traces, span trees, service graphs, cardinality traps, tenant isolation, or investigation workflows. Those capabilities would still need to be added through schema design, tool design, guardrails, and UX.

Bottom line: the ecosystem is mature enough to prototype now, but not mature enough to treat as a finished observability product layer. The winning pattern is likely MCP on top of curated ClickHouse trace abstractions, not raw unrestricted SQL over OTel tables.


What exists today

1) Official ClickHouse MCP server

The main official project is ClickHouse/mcp-clickhouse.

Common capabilities from the README:

  • run_query for executing SQL against ClickHouse
  • list_databases
  • list_tables with pagination and optional column metadata
  • optional read-only by default behavior (CLICKHOUSE_ALLOW_WRITE_ACCESS=false)
  • support for HTTP/SSE transports in addition to stdio
  • auth token support for network transports
  • a /health endpoint
  • an extra run_chdb_select_query tool using chDB for embedded/local ClickHouse-style querying

The official ClickHouse docs also now have an MCP section and explicitly say ClickHouse has an MCP server, with guides for integrating it with multiple agent frameworks.

Observed ecosystem signal:

  • GitHub repo created: 2024-12-25
  • ~711 stars at time of research

Implication: this is no longer a fringe experiment; it is now part of ClickHouse’s official AI story.

2) Altinity MCP server

Altinity/altinity-mcp is a more feature-rich ClickHouse MCP implementation.

Capabilities called out in its README/docs:

  • stdio, HTTP, and SSE transports
  • optional JWE-based authentication and TLS
  • built-in tools for listing tables, describing schemas, and executing queries
  • dynamic tools generated from ClickHouse views
  • resource templates for database/table discovery
  • query prompts for AI-assisted query building and optimization
  • config via files/env/CLI flags
  • hot reload
  • Docker and Helm deployment paths
  • integration docs for Claude web, ChatGPT GPTs, Claude Desktop, Claude Code, Cursor, and Windsurf

The dynamic-tools feature is especially notable for observability use cases: it can turn parameterized ClickHouse views into callable MCP tools, using view comments as descriptions and mapping ClickHouse types into JSON schema for tool arguments.

Observed ecosystem signal:

  • GitHub repo created: 2025-06-06
  • ~23 stars at time of research

Implication: Altinity is exploring a more opinionated “semantic tool layer over ClickHouse” approach, which is closer to what observability copilots actually need.

3) Smaller/community ClickHouse MCP projects

Example: izaitsevfb/clickhouse-mcp.

Interesting capabilities from its README:

  • schema reading
  • query explain support
  • semantic search over ClickHouse documentation
  • Claude Code-oriented setup

Observed ecosystem signal:

  • GitHub repo created: 2025-03-14
  • low adoption so far (~2 stars)

Implication: there is experimentation around adding more domain-aware helper functions on top of the basic query/list pattern, but the ecosystem is still fragmented.

4) ClickHouse documentation and ecosystem guidance

ClickHouse’s docs now position MCP as a first-class integration path and include multiple tutorials for building agents around the ClickHouse MCP server (LangChain, LlamaIndex, PydanticAI, Streamlit, Claude Agent SDK, Slackbot, etc.).

That matters because it shows the current momentum is not just “Claude Desktop connector” usage; it is moving toward embedded agent applications.

5) The surrounding observability ecosystem is already ClickHouse-heavy

Even if trace-specific MCP servers are still immature, the underlying observability substrate is already proven:

  • HyperDX / ClickStack: HyperDX describes itself as a system for searching and visualizing logs and traces on top of ClickHouse; ClickStack docs describe a ClickHouse + HyperDX + OpenTelemetry collector stack.
  • SigNoz: open-source observability platform for logs, metrics, and traces; exposes ClickHouse-backed trace querying and even documents direct ClickHouse trace queries.
  • Langfuse: ClickHouse docs describe Langfuse as an LLM engineering/observability platform that relies on ClickHouse as its scalable observability backend.
  • The OpenTelemetry Collector contrib distribution ships a ClickHouse exporter for logs/traces/metrics.

Implication: the observability world has already validated ClickHouse as the backend. MCP is layering AI interaction on top of an already-established operational data plane.


What people are doing now

1) Using MCP mainly as a natural-language SQL bridge

The dominant pattern today is simple:

  1. connect an AI assistant to ClickHouse through MCP,
  2. let it inspect schema,
  3. let it generate and run read-only SQL,
  4. iterate conversationally.

This is useful, but still fairly thin. It resembles a “database copilot” more than a full observability investigator.

2) Pairing ClickHouse with OpenTelemetry for traces, then using other UIs for actual investigation

In practice, many teams are not using MCP as the primary trace UI yet. They are more commonly doing:

  • OpenTelemetry SDKs/collectors for data collection,
  • ClickHouse for storage,
  • Grafana / HyperDX / SigNoz / custom UI for investigation.

The ClickHouse observability docs explicitly recommend OpenTelemetry for telemetry collection, and the ClickHouse exporter README shows example queries for logs and traces. ClickHouse’s traces blog frames traces as essentially another analytical workload stored as one-row-per-span.

So the reality today is: ClickHouse for telemetry storage/querying is mainstream; MCP for telemetry investigation is emerging.

3) Building schema-aware or semantic layers on top of raw ClickHouse

A key practical trend is adding semantic structure above raw tables:

  • generated tools from views (Altinity)
  • domain-specific prompts/resources
  • saved dashboards and query builders (e.g. SigNoz)
  • search-oriented UX (HyperDX)

This is a strong signal that raw SQL alone is not enough for good operator workflows.

4) Querying trace tables directly in ClickHouse

SigNoz’s docs are especially useful here because they show what real trace data in ClickHouse looks like operationally:

  • a primary trace index table with >30 columns,
  • map/json columns for attributes,
  • denormalized helper columns for common OTel attributes,
  • separate structures for resource filtering.

That reflects what practitioners are really doing: storing OTLP-derived traces in wide span tables optimized for filtering and aggregations, then layering query tools on top.

5) Concern about MCP safety is rising

The MCP ecosystem has also started to accumulate security research. A Datadog Security Labs write-up showed SQL injection issues in a Postgres MCP server that bypassed a read-only restriction. Tinybird’s review also cites broader MCP vulnerability concerns.

This is relevant because current database MCP patterns often assume read-only mode is sufficient, but for production observability data, query safety, prompt injection resistance, tool argument validation, tenant isolation, and cost controls matter a lot.


Applicability to ClickHouse trace analysis

For a product like Dataservice, the strongest near-term applicability is not generic BI chat. It is investigation acceleration.

Good fit areas

A) Natural-language trace triage

An MCP-connected assistant could help answer questions like:

  • “What changed in the checkout service in the last 30 minutes?”
  • “Show me the slowest traces for tenant X after deploy Y.”
  • “Which downstream dependency is dominating p95 latency for this endpoint?”
  • “Group failed traces by exception class / RPC target / region.”

This is a strong fit for ClickHouse because trace analysis is mostly:

  • filtering,
  • grouping,
  • percentile calculations,
  • joins/correlations,
  • time-window comparisons,
  • top-k/outlier finding,
  • and summarization over large volumes of events.

Those are exactly the kinds of analytical queries ClickHouse is good at.

B) Guided drilldown over wide OTEL/span schemas

Trace schemas are usually hard for humans and LLMs:

  • many columns,
  • semi-structured maps/json,
  • semantic conventions that vary by language/instrumentation,
  • inconsistent attribute population.

MCP can help by hiding some of that complexity behind tools like:

  • find_slow_services(time_range, env, tenant)
  • get_trace(trace_id)
  • find_error_clusters(service, window)
  • compare_latency_before_after(deploy_id)
  • top_span_attributes(service, endpoint, filter)

That is much more reliable than asking an LLM to rediscover the schema from scratch every time.

C) Cross-signal correlation

Observability is rarely just traces. ClickStack/HyperDX and SigNoz both emphasize unified logs/traces/metrics workflows. A ClickHouse-backed MCP layer could support investigations such as:

  • trace IDs with corresponding logs,
  • slow traces correlated with infra metrics,
  • deployment windows correlated with error spikes,
  • service dependency changes inferred from span relationships.

ClickHouse is well-suited because these signals can live in the same analytical backend and be joined or correlated efficiently.

D) Incident copilot workflows

An MCP-based assistant could be particularly useful for:

  • generating an initial incident summary,
  • identifying likely blast radius,
  • surfacing top regressions by service/route/version,
  • suggesting next queries/drilldowns,
  • producing a trace-based postmortem draft.

This is likely more valuable than open-ended chat because it maps to concrete operations workflows.

Best-fit pattern conceptually

From the current ecosystem, the most promising pattern appears to be:

  1. ClickHouse stores traces/spans/logs/metrics
  2. OpenTelemetry defines ingestion/semantics
  3. MCP exposes curated investigation tools, not only raw SQL
  4. LLM handles explanation, iteration, summarization, and query planning
  5. UI still matters for flamegraphs, span trees, service maps, and timeline visualization

In other words: MCP is excellent for the copilot layer, but not a replacement for all observability UX.


Limitations / caveats

1) Most ClickHouse MCP servers are not trace-native

They know how to query tables; they generally do not know how to:

  • reconstruct a trace tree cleanly,
  • understand parent/child span semantics,
  • distinguish service/resource/span attribute scopes,
  • compute dependency graphs,
  • detect causal patterns,
  • or present flamegraph/timeline views.

Those behaviors need domain logic above the basic MCP server.

2) Raw prompt-to-SQL over telemetry can be noisy and expensive

Telemetry schemas are wide and high-cardinality. Left unconstrained, LLM-generated SQL can easily become:

  • slow,
  • expensive,
  • over-broad,
  • or semantically wrong.

Typical failure modes:

  • scanning too much data,
  • using the wrong attribute columns,
  • confusing resource vs span attributes,
  • bad percentile logic,
  • wrong grouping keys,
  • weak time filters.

3) Read-only is not the same as safe

Even read-only database access can still be risky in production observability contexts:

  • data exfiltration across tenants/projects,
  • unexpectedly expensive queries,
  • schema discovery leakage,
  • prompt injection through data values,
  • SQL injection or validation bypasses in tool implementations.

The Datadog Security Labs Postgres MCP case is a warning sign for all DB-oriented MCP servers, not just Postgres.

4) Token/context limits still matter

Trace investigations often involve:

  • many spans per trace,
  • many attributes per span,
  • long time-series comparisons,
  • large top-k result sets.

Dumping raw span rows into the model is usually the wrong shape. Good systems will need summarization and compact intermediate representations.

5) Visualization remains a separate problem

Some observability tasks are naturally visual:

  • flamegraphs,
  • waterfall timelines,
  • service maps,
  • diff views across deploys,
  • temporal heatmaps.

MCP helps with data access and reasoning, but not by itself with high-bandwidth visual investigation.

6) Ecosystem churn is still high

The MCP ecosystem is moving fast. Interfaces, transports, auth patterns, and client support are still evolving. A design that overfits today’s MCP client quirks may age poorly.


Interesting ideas worth exploring

1) Parameterized trace-investigation tools instead of open SQL

Borrow the Altinity dynamic-tools idea, but make it observability-specific:

  • slow_endpoints(window, service, env)
  • error_budget_burn_candidates(window)
  • trace_exemplar_for_route(route, percentile, window)
  • dependency_regressions(window_a, window_b)
  • tenant_hotspots(window)

These are easier to secure, easier for models to use correctly, and closer to how humans investigate.

2) Build a semantic layer over OTEL conventions

Because raw OTEL schemas are messy, a useful MCP layer could normalize concepts like:

  • service
  • operation/route
  • deployment/version
  • tenant/customer
  • error class
  • downstream dependency
  • latency bucket / SLO class

That would let the LLM reason in product/ops terms instead of raw column names.

3) Expose “explain the anomaly” workflows

Instead of just query tools, expose structured workflows:

  • find the anomaly,
  • narrow to changed dimensions,
  • retrieve exemplar traces,
  • compare before/after distributions,
  • summarize likely causes.

This is where MCP becomes more than SQL chat.

4) Treat traces as graph + event analytics

ClickHouse stores spans as rows, but investigations often care about graph structure:

  • which services call which,
  • which edge got slower,
  • which parent-child pattern changed,
  • where retries/fanout/cascading failures appear.

Interesting direction: materialize dependency edges or path summaries in ClickHouse and expose those as MCP tools.

5) Use exemplars and summarization aggressively

Instead of returning 500 span rows to the LLM, return:

  • one or a few representative traces,
  • distribution summaries,
  • top changed dimensions,
  • outlier exemplars,
  • concise derived stats.

This is probably necessary for useful trace copilots.

6) Blend MCP with existing observability UI

The practical product pattern may be:

  • assistant proposes and runs investigations via MCP
  • UI renders span trees, flamegraphs, tables, and charts
  • human stays in control

That mirrors what the current ecosystem suggests: HyperDX/SigNoz/ClickStack-style UIs are still valuable even if an AI copilot exists.

7) Add trace-aware retrieval over documentation/runbooks/incidents

The smaller community ClickHouse MCP project that includes semantic doc search hints at a useful extension: combine telemetry querying with retrieval over:

  • runbooks,
  • prior incidents,
  • service ownership metadata,
  • deploy history,
  • known failure patterns.

That would likely be more useful in incident response than SQL access alone.


Sources

ClickHouse MCP ecosystem

ClickHouse + observability / traces

ClickHouse-backed observability products / patterns

Security / caveats