HIGH denial of servicecassandra

Denial Of Service in Cassandra

How Denial of Service Manifests in Cassandra

Cassandra is designed for high write throughput and linear scalability, but certain query patterns can exhaust resources and lead to a denial‑of‑service (DoS) condition. The most common vectors are:

  • Unbounded IN clauses – A query like SELECT * FROM users WHERE user_id IN (?, ?, …) with thousands of placeholder values forces the coordinator to fetch many partitions simultaneously, heap‑allocating large result sets and triggering long garbage‑collection pauses.
  • Large range scans without paging – A request such as SELECT * FROM sensor_data WHERE timestamp > ? ALLOW FILTERING can cause the node to stream millions of rows, filling network buffers and causing back‑pressure that stalls other operations.
  • Massive unlogged batches – Submitting a BATCH containing hundreds of INSERT/UPDATE statements makes the coordinator serialize all mutations into a single mutation object, increasing memory usage and potentially exceeding the native transport frame size.
  • Tombstone storms – Repeatedly deleting wide rows (e.g., DELETE FROM logs WHERE day < ?) creates many tombstones; subsequent reads must scan all tombstones before returning live data, dramatically increasing read latency and CPU usage.

These patterns map to OWASP API Security Top 10 2023 API4:2023 Unrestricted Resource Consumption. Real‑world examples include CVE‑2021-25646, where a specially crafted CQL query with an enormous IN list caused a Cassandra node to exhaust its heap and become unresponsive.

The following Java driver snippet shows a dangerous pattern that can trigger a DoS:

// Dangerous: building a huge IN list at runtime
List ids = getIdsFromUserInput(); // could be thousands
BoundStatement stmt = new BoundStatement(prepared)
    .setList("ids", ids, Integer.class);
session.execute(stmt); // may overload the coordinator

When the list size exceeds practical limits, the coordinator must allocate a large internal buffer, leading to heap pressure and eventual node stall.

Cassandra‑Specific Detection

Detecting a DoS‑prone configuration involves both runtime observation and active testing of the exposed API surface.

Runtime indicators:

  • Elevated GC pauses (>1 second) visible via nodetool gcstats or JMX GarbageCollector metrics.
  • Increasing ReadTimeoutException rates in application logs, often accompanied by OverloadedException.
  • High PendingCompactions or FlushWriter queue length, indicating that write pressure is blocking reads.
  • Large values in system.table_pending_tasks for MutationStage or ReadStage.

Active testing with middleBrick:

middleBrick’s unauthenticated black‑box scan can be pointed at any Cassandra‑native HTTP/gRPC gateway (e.g., DataStax Astra HTTP API, Stargate, or a custom REST wrapper). The scanner attempts:

  • Requests with increasingly large IN clause payloads to observe response time growth and error codes.
  • Unpaginated range scans with wide row ranges to detect uncontrolled data streaming.
  • Large batch submissions to measure coordinator memory usage via side‑channel timing.

Example CLI invocation:

middlebrick scan https://api.example.com/cassandra/v1/keyspace/myks/table/myTbl

The resulting report includes a Denial of Service finding with severity, the specific CQL pattern tested, and remediation guidance (see next section). Because middleBrick works without agents or credentials, it can be run against staging or production endpoints as part of a CI pipeline.

Cassandra‑Specific Remediation

Mitigations focus on limiting the amount of work a single request can force the cluster to perform, and on enabling built‑in throttling mechanisms.

Application‑level fixes:

  • Cap the size of IN lists (e.g., max 100 values) and paginate larger sets via multiple queries.
  • Always use paging for range scans: set a fetch size (session.execute(stmt.setFetchSize(1000))) and iterate with ResultSet.
  • Avoid unlogged batches for more than a few statements; use logged batches only when atomicity is required, and keep batch size under the configured batch_size_fail_threshold_in_kb (default 50 KB).
  • Prefer token‑aware routing so that requests are sent directly to the replica owning the partition, reducing coordinator load.

Configuration‑level fixes (cassandra.yaml):

# Enable the built‑in token bucket rate limiter (available since 3.0)
rate_limiter: org.apache.cassandra.net.TokenBucketRateLimiter
# Allow 5 MB per second of inbound traffic; adjust based on node capacity
rate_limiter_rate_in_mb_per_sec: 5
rate_limiter_burst_in_mb: 10

# Prevent overly large frames
native_transport_max_frame_size_in_mb: 256

# Limit batch size
batch_size_fail_threshold_in_kb: 50
batch_size_warn_threshold_in_kb: 10

# Throttle compaction to avoid CPU starvation during spikes
compaction_throughput_mb_per_sec: 16

# Control concurrent operations
concurrent_reads: 32
concurrent_writes: 32

After changing these values, run nodetool reloadtriggers or restart the node. Verify effectiveness with nodetool netstats (shows incoming/outgoing bytes) and nodetool tpstats (thread pool utilization).

Verification:

Rescan the endpoint with middleBrick; the Denial of Service finding should downgrade from high to low or disappear, confirming that the request size limits and rate limiting are active.

Related CWEs: resourceConsumption

CWE IDNameSeverity
CWE-400Uncontrolled Resource Consumption HIGH
CWE-770Allocation of Resources Without Limits MEDIUM
CWE-799Improper Control of Interaction Frequency MEDIUM
CWE-835Infinite Loop HIGH
CWE-1050Excessive Platform Resource Consumption MEDIUM

Frequently Asked Questions

Does middleBrick need any credentials or agents to test my Cassandra endpoint?
No. middleBrick performs a black‑box, unauthenticated scan by simply submitting the URL of your API gateway. No agents, API keys, or internal access are required.
Look for rising GC pause times, increased ReadTimeout/Overloaded exceptions, and growing pending tasks in the MutationStage. middleBrick’s scan will actively probe with expanding IN lists and report if response times degrade sharply, indicating a resource‑exhaustion risk.