HIGH llm data leakagechimongodb

Llm Data Leakage in Chi with Mongodb

Llm Data Leakage in Chi with Mongodb — how this specific combination creates or exposes the vulnerability

LLM data leakage in a Chi application that uses MongoDB as the primary data store occurs when language model interactions inadvertently expose or infer sensitive information stored in your MongoDB collections. This risk arises from two intersecting factors: how data is structured and retrieved for LLM consumption, and how the LLM itself may reveal information during interaction.

Chi is a lightweight, idiomatic Scala web framework often used to build APIs and services. When such services integrate with MongoDB—commonly via a reactive driver like reactivemongo—they may stream or query documents to transform and present data. If those documents contain sensitive fields (e.g., email, ssn, api_key) and are passed into prompts or exposed through endpoints used by an LLM, the LLM may output or log that data in ways that violate privacy or compliance expectations.

For example, a Chi route that fetches a user document from MongoDB and includes it directly in a system prompt can lead to system prompt leakage. An attacker might use prompt injection techniques to coerce the LLM into repeating or encoding that sensitive document data in its responses. Because MongoDB documents often contain nested fields and arrays, it is easy to inadvertently include more data than intended when constructing prompts or logging LLM outputs.

The LLM/AI Security checks in middleBrick specifically test for this scenario by probing endpoints that interact with MongoDB-backed data sources. It checks for unauthenticated LLM endpoints and performs active prompt injection tests—such as system prompt extraction and data exfiltration—to see whether MongoDB-derived content appears in LLM outputs. The scanner also reviews output for PII, API keys, and executable code that may have originated from MongoDB documents. Because Chi services often serve as APIs for single-page applications or mobile clients, improper handling of MongoDB data in LLM workflows can expose information that should remain on the server or within secure contexts.

Compliance mappings are relevant here as well. Findings from such leakage scenarios typically map to OWASP API Top 10 (API1:2023 Broken Object Level Authorization when data is over-exposed), GDPR data minimization and purpose limitation, and SOC2 controls around information disclosure. middleBrick’s per-category breakdowns help identify whether a specific MongoDB query or endpoint is contributing to LLM data leakage, providing prioritized findings with severity and remediation guidance.

Mongodb-Specific Remediation in Chi — concrete code fixes

To mitigate LLM data leakage in a Chi application using MongoDB, focus on ensuring that only necessary, sanitized data is ever presented to the LLM and that sensitive fields are never logged or echoed in responses. Below are concrete patterns and code examples tailored for Chi and MongoDB.

1. Project only required fields from MongoDB documents

When querying MongoDB, use projection to return only the fields needed for business logic and LLM interaction. Avoid returning entire documents.

import reactivemongo.api.bson._
import reactivemongo.play.json.ImplicitBSONHandlers._
import play.api.libs.json._

case class PublicUser(id: String, name: String)

val publicUserReader: BSONDocumentReader[PublicUser] = BSONDocumentReader { doc =>
  PublicUser(
    id = doc.getAsOpt[BSONString]("_id").map(_.value).getOrElse(""),
    name = doc.getAsOpt[BSONString]("name").map(_.value).getOrElse("")
  )
}

// In a Chi route, fetch only public fields
val safeProjection = BSONDocument("_id" → 1, "name" → 1)
val cursor: Cursor[PublicUser] = collection.find(BSONDocument(), Some(safeProjection)).cursor[PublicUser]()

2. Sanitize data before LLM consumption

Never pass raw MongoDB documents into LLM prompts. Build prompt contexts explicitly and remove or mask sensitive keys.

def buildPrompt(user: PublicUser, context: String): String =
  s"""System: You are a helpful assistant.
     |User: id=${user.id}, name=${user.name}, context=$context""".stripMargin

// Example of removing sensitive fields before any logging
val sanitizedLog = Json.obj(
  "user_id" → user.id,
  "context_length" → context.length
) // Do not include PII in logs

3. Validate and escape outputs that may contain stored data

If your LLM is expected to reference data from MongoDB (e.g., summarizing stored records), validate and escape outputs to prevent injection of unintended content.

import cats.data.Validated._
import cats.data.NonEmptyList

def validateLlmOutput(output: String): Either[NonEmptyList[String], String] =
  if (output.contains("--") || output.contains("{\"")) {
    Left(NonEmptyList.one("Output contains suspicious patterns"))
  } else {
    Right(output)
  }

// Use validated output in Chi routes safely
chi.get { _ =>
  Ok(validateLlmOutput(llmResponse).merge)
}

4. Avoid logging full LLM responses that may echo MongoDB content

Configure structured logging to exclude fields that may contain sensitive data originating from MongoDB.

// In application.conf or via code, ensure logback excludes sensitive keys
// Example logback.xml snippet:
// <configuration>
//   <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
//     <encoder>
//       <pattern>%msg%n</pattern>
//     </encoder>
//   </appender>
//   <logger name="llm.output" level="INFO" additivity="false">
//     <filter class="ch.qos.logback.classic.filter.EvaluatorFilter">
//       

5. Use middleware to enforce data boundaries

Implement request/response middleware in Chi to strip or redact sensitive MongoDB fields before they reach the LLM or are returned to the client.

import cats.effect._
import org.http4s._
import org.http4s.dsl.io._

def redactMongoFieldsMiddleware(service: HttpApp[IO]): HttpApp[IO] = { req =>
  service.run(req).map { resp =>
    resp.copy(entity = resp.entity.mapBy { chunk =>
      // Example: remove keys named "email" or "apiKey" from JSON responses
      val json = chunk.decodeString
      json.replaceAll("\"email\"\\s*:\\\"[^\"]*\"", "\"email\": \"[REDACTED]\"")
    })
  }
}

Related CWEs: llmSecurity

CWE IDNameSeverity
CWE-754Improper Check for Unusual or Exceptional Conditions MEDIUM

Frequently Asked Questions

How does middleBrick detect LLM data leakage involving MongoDB in Chi services?
middleBrick scans the unauthenticated attack surface of your Chi endpoints, tests for prompt injection and system prompt leakage, and analyzes LLM outputs for PII or sensitive patterns that may originate from MongoDB documents. It correlates findings with your OpenAPI spec to identify which endpoints expose MongoDB-derived data to the LLM.
Can middleBrick integrate with a Chi app’s CI/CD pipeline to catch MongoDB-related LLM leakage before deployment?
Yes. With the Pro plan, you can use the GitHub Action to add API security checks to your CI/CD pipeline, fail builds if the security score drops below your threshold, and scan staging APIs before deploy. This helps catch LLM data leakage risks early when MongoDB interactions are changed.