HIGH llm data leakagedjangodynamodb

Llm Data Leakage in Django with Dynamodb

Llm Data Leakage in Django with Dynamodb — how this specific combination creates or exposes the vulnerability

LLM data leakage occurs when an application unintentionally exposes sensitive information through language model interactions. In a Django application using Amazon DynamoDB as the backend, the risk arises from how data is retrieved, formatted, and passed to LLM endpoints. If DynamoDB items contain personal data, credentials, or operational details and are forwarded to an LLM without careful filtering, that data can be exposed in prompts, tool calls, or LLM responses.

DynamoDB’s schema-less design can store nested and unstructured data, which may inadvertently include fields such as api_key, session_token, or debug_log. When a Django view queries DynamoDB (for example, using boto3) and then sends the retrieved records to an LLM for summarization or analysis, the LLM may leak this data through its outputs or via tool-use behavior. For instance, if a debugging or analytics workflow sends entire DynamoDB items to an LLM, the system prompt or generated content might reflect sensitive values.

Additionally, the integration pattern in Django often involves serializers that transform DynamoDB items into Python dictionaries. If these serializers include sensitive fields and those dictionaries are later used in LLM prompts, the data becomes part of the prompt context. LLM-specific checks in middleBrick detect this class of risk by scanning for system prompt leakage patterns and by testing whether unauthenticated endpoints can cause an LLM to reveal training data or private context. In this architecture, the lack of strict field-level authorization between DynamoDB and the LLM increases the chance of leaking credentials or PII through chat completions or function call outputs.

The combination of Django’s ORM-like query patterns, DynamoDB’s flexible item structure, and the stateless nature of many LLM integrations means that developers may not explicitly realize that rich data is being forwarded. middleBrick’s LLM/AI Security checks, including active prompt injection testing and output scanning for API keys and PII, are designed to surface these exposures. Without controls such as field filtering, redaction, or strict authorization before data reaches the LLM, a Django app using DynamoDB can unintentionally expose sensitive information through LLM interactions.

Dynamodb-Specific Remediation in Django — concrete code fixes

To reduce LLM data leakage when using DynamoDB in Django, apply field-level filtering and strict schema governance. Only project the attributes required for the immediate operation and exclude known sensitive fields before any data is sent to an LLM. Use DynamoDB’s ProjectionExpression and FilterExpression to limit the data retrieved, and validate item contents in Python before constructing prompts.

Example: retrieve only necessary fields and redact sensitive keys before sending data to an LLM endpoint.

import boto3
from django.conf import settings

def get_user_profile_safe(user_id: str):
    client = boto3.resource('dynamodb', region_name=settings.AWS_REGION)
    table = client.Table(settings.DYNAMODB_PROFILES_TABLE)
    response = table.get_item(
        Key={'user_id': user_id},
        ProjectionExpression='user_id,email,display_name,updated_at'
    )
    item = response.get('Item', {})
    # Explicitly remove any residual sensitive fields
    item.pop('api_key', None)
    item.pop('session_token', None)
    return item

def make_llm_request_safe(profile_item: dict):
    # Only pass approved fields to the LLM
    context = {
        'user_id': profile_item.get('user_id'),
        'email': profile_item.get('email'),
        'display_name': profile_item.get('display_name')
    }
    # Construct prompt using safe context
    prompt = f"Summarize preferences for user {context['display_name']} (ID: {context['user_id']})."
    # Here you would call your LLM client with the controlled prompt
    # llm_response = llm_client.complete(prompt)
    return prompt

Example: define a Pydantic model to enforce schema and exclude sensitive fields during serialization.

from pydantic import BaseModel
from typing import Optional

class SafeProfile(BaseModel):
    user_id: str
    email: str
    display_name: Optional[str] = None

    class Config:
        from_attributes = True

def serialize_profile_dynamodb(item: dict) -> SafeProfile:
    # Ensure only expected fields are used
    return SafeProfile(
        user_id=item['user_id'],
        email=item['email'],
        display_name=item.get('display_name')
    )

In the GitHub Action and CI/CD workflows, you can add API security checks to fail builds if risk scores indicate potential LLM leakage. The MCP server allows you to scan APIs directly from your AI coding assistant, helping catch这些问题 during development. The dashboard can track these findings over time and map them to frameworks like OWASP API Top 10 and GDPR.

Related CWEs: llmSecurity

CWE IDNameSeverity
CWE-754Improper Check for Unusual or Exceptional Conditions MEDIUM

Frequently Asked Questions

How can I verify that sensitive DynamoDB fields are excluded before LLM calls?
Instrument your Django views to log the keys present in items sent to the LLM and run automated tests that assert sensitive fields like api_key or session_token are absent. Use middleBrick’s output scanning and prompt injection tests to validate that LLM responses do not contain sensitive data.
Does DynamoDB’s flexible schema increase LLM data leakage risk compared to relational stores?
Yes, because DynamoDB can store nested and unstructured attributes, it is easier for sensitive fields to be included in queries inadvertently. Mitigate this by explicitly projecting required attributes, enforcing a strict serialization model, and filtering fields at the point of retrieval and before prompt construction.