HIGH distributed denial of servicecockroachdb

Distributed Denial Of Service in Cockroachdb

Q: Which CockroachDB setting should I tune first to defend against connection‑exhaustion DDoS?

Start with sql.max_connections . Setting it to a value that matches your node’s capacity (e.g., 200 connections per node) ensures that new legitimate connections are not rejected when an attacker tries to open many simultaneous sessions.

How Distributed Denial of Service Manifests in CockroachDB

A distributed denial‑of‑service (DDoS) attack against CockroachDB typically aims to exhaust the cluster’s resources — connections, CPU, memory, or disk I/O — so that legitimate traffic is slowed or blocked. Because CockroachDB is a distributed SQL database, the attack surface includes the SQL gateway, the KV layer, and the replication subsystem.

Connection exhaustion. An attacker opens thousands of short‑lived TCP connections to the SQL port (default 26257). Each connection consumes a goroutine and a memory buffer; when the per‑node limit sql.max_connections is reached, new legitimate connections are rejected with pq: sorry, too many clients already.
Query‑CPU flood. The attacker sends expensive, unindexed scans or complex joins that force each node to use a large fraction of its CPU. CockroachDB distributes the query, but if the scan touches many ranges, the aggregate CPU usage can spike across the cluster, increasing latency for all users.
Hot‑range write storm. By repeatedly inserting or updating rows that map to the same range (e.g., a monotonically increasing primary key), the attacker creates a write hotspot. Raft replicas for that range become a bottleneck, causing increased latency and possible transaction aborts due to contention.
Disk‑I/O saturation. Issuing large COPY or IMPORT statements, or repeatedly reading large blobs, can fill the SSD bandwidth and cause background compaction queues to grow, slowing down reads and writes.

These patterns are distinct from generic network‑layer DDoS because they exploit CockroachDB’s internal resource quotas and data distribution mechanisms.

CockroachDB‑Specific Detection

Detecting a DDoS condition in CockroachDB relies on observing metrics that exceed baseline thresholds. Key signals include:

Rapid rise in sql.connections gauge (visible via SHOW CLUSTER SETTING sql.metrics.connections or the DB Console).
Sudden increase in sql.query_count or sql.latency histograms, especially for SELECT statements with high rows_read.
Elevated cpu_percent on multiple nodes, often accompanied by high kv.write_bytes or kv.read_bytes.
Growth in the kv.range_lease_rebalances metric, indicating hot‑range contention.
Increasing sql.transaction_abort rate due to serialization conflicts on hot ranges.

middleBrick includes a rate‑limiting check as one of its 12 parallel security scans. When you submit a CockroachDB endpoint URL, middleBrick probes the unauthenticated surface and reports if the endpoint lacks effective connection‑throttling or request‑rate limits. It does not block traffic; it simply flags the missing protection and provides remediation guidance.

To correlate middleBrick’s findings with internal metrics, you can run a quick health check:

# SQL: show current connection usage
SELECT node_id, connection_count FROM crdb_internal.node_connections;

# SQL: show recent query latency distribution
SELECT percentile, latency_ms FROM crdb_internal.sql_latency WHERE statement_type = 'SELECT' ORDER BY percentile;

If middleBrick reports a missing rate‑limit and the above queries show connection counts approaching sql.max_connections or latency spikes, you have strong evidence of a DDoS‑type condition.

CockroachDB‑Specific Remediation

Mitigation focuses on configuring CockroachDB’s built‑in limits and shaping the workload to avoid hotspots. All changes are made via SQL statements; no external agents are required.

1. Connection throttling

Set a conservative maximum number of client connections per node. This prevents connection‑exhaustion attacks.

ALTER CLUSTER SETTING sql.max_connections = 200;

Adjust the value based on your node’s RAM and expected concurrent workload.

2. Statement timeouts

Limit how long any single statement can run, curbing CPU‑intensive flood queries.

ALTER CLUSTER SETTING sql.statement_timeout = '30s';

Queries exceeding the timeout are cancelled with context deadline exceeded.

3. Memory limits per statement

Prevent a single query from consuming excessive RAM, which could trigger OOM kills.

ALTER CLUSTER SETTING sql.max_sql_memory = '256MiB';

4. Workload shaping to avoid hot ranges

Use monotonic‑timestamp or UUID primary keys to distribute writes evenly. If you must use a sequential key, interleave the table with a high‑cardinality parent or use hash‑sharded indexes.

-- Example: hash‑sharded index on a sequential ID
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    seq_id BIGINT NOT NULL,
    customer_id UUID,
    total DECIMAL,
    created_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON orders (seq_id) STORING (customer_id, total) WHERE true;
-- The index is automatically hash‑sharded because the leading column is not the primary key.

5. Zone configurations for load spreading

If you have multiple nodes across availability zones, ensure data is evenly distributed.

ALTER PARTITION OF INDEX orders@primary CONFIGURE ZONE USING
    num_replicas = 3,
    constraints = '[+region=us-east1, +region=us-west2]';

6. Client‑side back‑off and pooling

Even with server limits, clients should reuse connections and implement exponential back‑off on retryable errors.

// Go example using pgxpool
import (
    "context"
    "github.com/jackc/pgx/v5/pgxpool"
)

func NewPool() (*pgxpool.Pool, error) {
    cfg, err := pgxpool.ParseConfig("postgres://user@host:26257/db?sslmode=disable")
    if err != nil { return nil, err }
    cfg.MaxConns = 20          // respect server limit
    cfg.MinConns = 5
    cfg.HealthCheckPeriod = time.Minute
    cfg.AfterConnect = func(ctx context.Context, conn *pgx.Conn) error {
        // enforce statement timeout per connection
        _, err := conn.Exec(ctx, "SET sql.statement_timeout = '30s'")
        return err
    }
    return pgxpool.NewConfig(cfg)
}

After applying these settings, monitor the same metrics mentioned in the Detection section. middleBrick will continue to report on missing rate‑limit protections, but the cluster will now resist connection‑exhaustion and CPU‑flood attempts.

Frequently Asked Questions

Does middleBrick stop a DDoS attack against my CockroachDB instance?

No. middleBrick only scans the unauthenticated API surface and reports findings such as missing rate‑limiting or excessive request volumes. It provides remediation guidance but does not block, filter, or mitigate traffic.

Which CockroachDB setting should I tune first to defend against connection‑exhaustion DDoS?

Start with sql.max_connections. Setting it to a value that matches your node’s capacity (e.g., 200 connections per node) ensures that new legitimate connections are not rejected when an attacker tries to open many simultaneous sessions.