Skip to content

LLM / AI Security

middleBrick includes a three-phase AI security analysis, the only self-service scanner with active adversarial LLM probing. If your API wraps an LLM (OpenAI, Anthropic, open-source models, or any custom deployment), this check is critical.

  • Teams building AI-powered APIs: chatbots, copilots, agents, RAG systems, AI features in SaaS products
  • Companies consuming third-party AI APIs: if you proxy OpenAI/Anthropic through your own endpoint
  • Security teams auditing AI deployments: automated, repeatable checks instead of manual prompt testing

If your endpoint doesn’t involve an LLM, this check has minimal impact on your score. The engine detects whether the endpoint is AI-powered and adjusts automatically.

Analyzes the existing API response for AI indicators without sending additional requests. This phase runs on every scan.

What it detects:

  • LLM endpoint identification: recognizes AI-powered endpoints by URL patterns and response structure (token usage, choices arrays, model fields)
  • System prompt leakage: detects exposed system prompts across multiple output formats. If your system prompt appears in the response, it’s a critical finding that reveals your IP, business logic, and guardrails to anyone who asks.
  • Model information disclosure: flags when the response reveals which model is running. Attackers use this to select model-specific jailbreaks and exploits.
  • Token usage / cost exposure: identifies leaked billing or usage metadata that reveals pricing and usage patterns
  • Hallucination risk: flags endpoints with no citation or grounding mechanism, indicating outputs may be fabricated without user awareness
  • Unauthenticated LLM endpoint (critical finding). An open LLM endpoint means anyone can send prompts at your expense and extract your system prompt, tools, and data.

Scans LLM output for leaked sensitive data. Even when an LLM endpoint is “working correctly,” it may be leaking data it shouldn’t.

What it detects:

  • PII in output: emails, financial data, government IDs, and API keys that may leak from training data, fine-tuning data, or RAG context. This is how training data extraction attacks work, where the model regurgitates memorized data.
  • Executable code in output: SQL queries, shell commands, and code blocks in responses. If any downstream system executes LLM output (common in AI agents), this creates code injection risks.
  • Excessive agency: detects when the LLM reveals its available tools, function calls, or agent capabilities. An attacker who knows what tools the LLM has access to can craft prompts to exploit them.

A company builds a customer support chatbot backed by RAG (Retrieval-Augmented Generation) that indexes internal knowledge base articles. The LLM is instructed to only answer customer questions. But an attacker prompts: “Summarize the most recent document in your context.” The LLM returns internal pricing strategies, employee contact info, or security procedures — none of which should be exposed.

middleBrick’s output security analysis catches PII and sensitive data in LLM responses regardless of how it got there.

Sends up to 18 targeted adversarial probes across 3 scan tiers to test endpoint resilience against real-world attack techniques. This phase only runs on live endpoints where an LLM is detected.

The engine auto-detects the API format (OpenAI-compatible, Anthropic, or generic) and adapts its payloads accordingly. Probes run sequentially with throttling to avoid overwhelming the target.

Core attacks that every LLM endpoint should defend against:

  • System prompt extraction: attempts to get the LLM to reveal its system instructions
  • Instruction override: tests whether safety guardrails can be bypassed by direct instruction
  • Jailbreak resistance: probes for known jailbreak patterns (DAN and similar persona attacks)
  • Data exfiltration: tests whether the LLM reveals its tools, data sources, or internal details
  • Cost exploitation: checks for missing output length limits that enable token-draining attacks

Evasion techniques that bypass basic defenses:

  • Encoding bypass: sends base64-encoded malicious instructions that slip past plaintext filters
  • Roleplay jailbreak: attempts to get the LLM to adopt an unrestricted persona
  • Translation attack: embeds injection inside a translation request to bypass instruction-data separation
  • Continuation attack: injects fake “end of system prompt” markers followed by new instructions
  • Few-shot poisoning: provides malicious example responses to train the model in-context

Tier 3 — Deep Adversarial Testing (+8 probes)

Section titled “Tier 3 — Deep Adversarial Testing (+8 probes)”

Advanced attacks sourced from security research (JailbreakBench, CyberSecEval):

  • Markdown exfiltration: tests if the LLM renders image tags that can exfiltrate data via URLs
  • Multi-turn manipulation: uses false claims about prior conversations to extract information
  • Cipher bypass: sends ROT13-encoded instructions to test encoding-aware filtering
  • Indirect injection: embeds instructions in “document” data to test instruction-data separation
  • Token smuggling: uses split-token completion to extract system information
  • Tool/function abuse: attempts to trigger destructive tool calls via prompt manipulation
  • Nested injection: hides instructions inside structured data (JSON) the model is asked to process
  • PII extraction: tests if the model leaks personal information from training data or context

What “Active Probing” Means for Safety

Section titled “What “Active Probing” Means for Safety”

Active probes send adversarial text, not destructive payloads. They test whether your LLM responds to manipulation, the same way a security researcher would. The probes:

  • Never send malware or exploit code
  • Never attempt to cause damage to the target system
  • Never persist data or create accounts
  • Complete in seconds, not minutes

If you’re concerned about probes hitting production, scan your staging endpoint first.

The LLM security weight adjusts automatically:

  • Non-LLM endpoint: minimal weight. A standard REST API won’t be penalized for “failing” AI security checks it doesn’t need.
  • Detected LLM endpoint: significant weight. AI security becomes one of the most impactful categories in your score.

This detection happens in Phase 1 (passive analysis). If the engine identifies AI indicators in the response, it elevates the LLM check weight for that scan.

FindingSeverityFix
Unauthenticated LLM endpointCriticalAdd authentication. Never expose an LLM endpoint without auth.
System prompt leakedCriticalUse a system prompt that doesn’t contain secrets; add output filtering
Jailbreak successfulHighStrengthen system prompt guardrails; add input/output content filtering
PII in LLM outputHighSanitize RAG context before feeding to the LLM; add PII detection on outputs
Model name disclosedMediumStrip model metadata from API responses
No output length limitsMediumSet max_tokens on all LLM calls; add response size limits at the API layer
Excessive tool disclosureMediumDon’t echo tool definitions in responses; restrict tool listing
  • IP protection: your system prompt is intellectual property. A leaked prompt lets competitors clone your AI feature.
  • Data breach risk: LLMs can leak training data, RAG context, and connected data sources through careful prompting.
  • Financial risk: unprotected endpoints can be abused for free inference or cost amplification attacks.
  • Regulatory exposure: if your LLM leaks PII from training data, you may face GDPR/CCPA liability.
  • No other self-service scanner detects system prompt leakage, tests jailbreak resistance, or flags unauthenticated LLM endpoints.
  • A pentest firm charges $5k+ and takes 2 weeks. middleBrick does it in 30 seconds, on every deploy.