Llm Data Leakage in Cassandra
How LLM Data Leakage Manifests in Cassandra
Large language model (LLM) applications often interact with databases to store conversation history, user preferences, or generated content. When Cassandra is used as the backend, several code paths can unintentionally expose data through the LLM interface.
- Unsafe query construction: If the LLM‑driven service builds CQL statements by concatenating user‑supplied text, an attacker can inject additional clauses that dump tables or read sensitive columns. For example, a chatbot that appends the user’s message directly into a
SELECTstatement without parameterization. - Error‑message leakage: Cassandra drivers return detailed error messages when a query fails. If the service bubbles these errors back to the LLM (and thus to the user), an attacker can learn table names, column types, or even parts of the clustering key.
- Insufficient access controls on stored LLM data: Prompts, completions, or user‑supplied artifacts may be written to a Cassandra table that is readable by any authenticated user. If the API exposing this table does not enforce row‑level permissions, the LLM can inadvertently reveal private prompts or proprietary model outputs.
- Batch or paging abuse: A malicious prompt could cause the service to execute a
BATCHstatement with many partitions or to request large page sizes, causing the LLM to stream back large volumes of data that exceed intended limits.
The following Node.js snippet illustrates a vulnerable pattern where user input is directly interpolated into a CQL query:
const cassandra = require('cassandra-driver');
const client = new cassandra.Client({ contactPoints: ['127.0.0.1'], localDataCenter: 'datacenter1', keyspace: 'chat' });
app.post('/ask', async (req, res) => {
const userInput = req.body.message; // attacker‑controlled
const query = `SELECT response FROM prompts WHERE user_id = '${userInput}' ALLOW FILTERING;`;
try {
const result = await client.execute(query);
res.json({ answer: result.rows[0]?.response });
} catch (err) {
// Detailed Cassandra error sent back to the user
res.status(500).json({ error: err.message });
}
});
In this example, an attacker could supply ' OR 1=1; to retrieve all rows from the prompts table, leaking every stored prompt and any associated LLM output.
Cassandra‑Specific Detection
Detecting LLM‑induced data leakage in Cassandra requires looking for both classic database exposure signs and LLM‑specific behaviors. Manual review should focus on:
- Any endpoint that builds CQL strings with raw user input.
- Responses that include Cassandra error messages (e.g.,
Invalid query,Undefined column name). - Tables that store LLM prompts or outputs without encryption or row‑level access control.
- Unusually large result sets returned in a single API call (possible batch/paging abuse).
middleBrick automates part of this discovery. When you scan an API URL, the platform runs:
- Data Exposure checks – it probes for verbose error messages and attempts to extract schema information via controlled injections.
- LLM/AI Security checks – it performs active prompt‑injection probes (system‑prompt extraction, instruction override, DAN jailbreak, data exfiltration, cost exploitation) and scans LLM responses for PII, API keys, or executable code.
- Unauthenticated LLM endpoint detection – if the LLM service does not require authentication, middleBrick flags it as a potential vector for data leakage.
Example CLI usage:
middlebrick scan https://api.example.com/llm-chat
Sample JSON excerpt from the report (truncated for brevity):
{
"findings": [
{
"id": "DATA-EXP-01",
"title": "Verbose Cassandra error messages exposed",
"severity": "high",
"description": "The endpoint returns full CQL error details when malformed input is supplied, aiding attackers in schema enumeration.",
"remediation": "Catch exceptions and return generic error messages to clients."
},
{
"id": "LLM-INJ-03",
"title": "Prompt injection enables data exfiltration",
"severity": "critical",
"description": "A sequential probe successfully extracted stored prompts from the Cassandra-backed prompt table.",
"remediation": "Use prepared statements and enforce strict input validation; store prompts encrypted and restrict read access."
}
]
}
These findings give developers concrete, actionable clues about where Cassandra‑related LLM leakage may be occurring.
Cassandra‑Specific Remediation
Fixing LLM data leakage in Cassandra involves applying database‑level safeguards and ensuring the LLM‑facing layer does not amplify those risks.
- Use prepared statements (parameterized queries): Never concatenate user input into CQL. The DataStax drivers support prepared statements that automatically handle escaping.
- Generic error handling: Catch driver exceptions and return a user‑friendly message (e.g., "Something went wrong") instead of propagating the Cassandra error.
- Encrypt data at rest and in transit: Enable Cassandra’s built‑in encryption (client‑to‑node and node‑to‑node) and consider transparent data encryption (TDE) for tables that hold prompts or LLM outputs.
- Implement role‑based access control (RBAC): Create a dedicated Cassandra role for the LLM service with only the permissions it needs (e.g., INSERT into a prompts table, SELECT from a cached responses table). Revoke unnecessary permissions like
DESCRIBEorSELECTon system tables. - Limit result set size: Apply default paging limits (
SET pagesize) and reject requests that ask for excessively large pages. - Mask or redact PII: Before storing LLM‑generated content, run a detection filter (regex or ML‑based) to replace emails, phone numbers, API keys, etc., with placeholders.
- Store prompts/outputs encrypted: Use a column‑level encryption UDF or encrypt the value in the application layer before writing to Cassandra.
- Validate and sanitize LLM prompts: Apply an allow‑list of expected characters or length, and reject prompts that contain CQL keywords (
SELECT,INSERT,BATCH,ALTER, etc.) when they are not intended to be part of a query.
Here is the same Node.js endpoint rewritten with a prepared statement and generic error handling:
const cassandra = require('cassandra-driver');
const client = new cassandra.Client({ contactPoints: ['127.0.0.1'], localDataCenter: 'datacenter1', keyspace: 'chat' });
// Prepare once at startup
const getPrompt = client.prepare('SELECT response FROM prompts WHERE user_id = ?');
app.post('/ask', async (req, res) => {
const userId = req.body.message; // still user‑supplied but now treated as a parameter
try {
const result = await client.execute(getPrompt, [userId], { prepare: true });
if (result.rowLength === 0) {
return res.status(404).json({ error: 'Prompt not found' });
}
res.json({ answer: result.rows[0].response });
} catch (err) {
// Log the full error internally, but return a generic message
console.error('Cassandra error:', err);
res.status(500).json({ error: 'Internal server error' });
}
});
And a Java example using the DataStax driver with a prepared statement and explicit role‑based permissions:
import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.*;
public class PromptService {
private final CqlSession session;
private final PreparedStatement getPrompt;
public PromptService(CqlSession session) {
this.session = session;
this.getPrompt = session.prepare(
"SELECT response FROM prompts WHERE user_id = ?");
}
public String getResponse(String userId) {
BoundStatement bound = getPrompt.bind(userId);
try (ResultSet rs = session.execute(bound)) {
Row row = rs.one();
if (row == null) {
throw new NoSuchElementException("Prompt not found");
}
return row.getString("response");
} catch (Exception e) {
// Log e internally
throw new RuntimeException("Failed to retrieve prompt", e);
}
}
}
By combining these coding practices with proper Cassandra configuration (encryption, authentication, least‑privilege roles), the attack surface for LLM‑driven data leakage is substantially reduced.
Related CWEs: llmSecurity
| CWE ID | Name | Severity |
|---|---|---|
| CWE-754 | Improper Check for Unusual or Exceptional Conditions | MEDIUM |