Model information disclosure check

What middleBrick covers

  • 18 adversarial probes across Quick, Standard, and Deep scan tiers
  • OpenAPI 3.0/3.1 and Swagger 2.0 parsing with $ref resolution
  • Detection of prompt injection, jailbreak, and encoding bypass techniques
  • Identification of PII leakage and system prompt exposure
  • Read-only scanning with no destructive payloads
  • Authenticated scanning with domain verification gate

What is a model information disclosure check

A model information disclosure check examines endpoints that expose or interact with AI models to identify ways sensitive model internals, training data artifacts, or system prompts may be revealed. This includes probes that attempt to extract instructions, override guardrails, or coax the model into reproducing memorized content. The goal is to understand what an attacker could learn about the model behavior and constraints through legitimate interface channels.

What teams get wrong when they skip this check

Without a structured check, teams underestimate conversational risks such as prompt injection, jailbreak techniques, and encoding-based bypasses. Adversarial probes like base64 or ROT13 encoding, translation-embedded injection, and multi-turn manipulation can leak system instructions or PII. Treating LLM interfaces as read-only logs underestimates token smuggling and tool-abuse paths that expose internal routing or training details.

A good workflow for model disclosure testing

Start with a low-intensity scan to map endpoints and supported content types, then progress to deeper adversarial tiers only where justified by risk. Use read-only methods such as GET and HEAD, and restrict text-only POST to LLM probes. Map findings against the API inventory, verify ownership via domain checks, and review results in the context of the model architecture and data sensitivity. Iterate with focused scenarios rather than broad brute force.

Coverage provided by middleBrick

middleBrick runs a black-box scan that includes 18 adversarial probes across three scan tiers: Quick, Standard, and Deep. It detects system prompt extraction, instruction override attempts, DAN and roleplay jailbreaks, data exfiltration patterns, cost exploitation, encoding bypasses, translation-embedded injection, few-shot poisoning, markdown injection, multi-turn manipulation, indirect prompt injection, token smuggling, tool-abuse, nested instruction injection, and PII extraction. The scanner parses OpenAPI 3.0, 3.1, and Swagger 2.0 with recursive $ref resolution and cross-references spec definitions against runtime observations to highlight undefined security schemes and deprecated operations.

Authentication and scanning policies

Authenticated scanning is available from the Starter tier and supports Bearer, API key, Basic auth, and Cookie credentials. Domain verification requires a DNS TXT record or an HTTP well-known file to ensure only the domain owner can scan with credentials. Forwarded headers are limited to Authorization, X-API-Key, Cookie, and X-Custom-* to reduce noise. Scan data is deletable on demand and purged within 30 days of cancellation, and it is never used for model training.

Frequently Asked Questions

Does this replace a human red team for LLM interfaces?
No. The scanner identifies common adversarial patterns and encoding-based leakage but cannot replicate human reasoning required for business logic or custom model defenses.
What methods are used during a scan?
The scanner uses read-only methods (GET and HEAD) and text-only POST for LLM probes. Destructive payloads are never sent, and infrastructure-level attacks such as active SQL injection or command injection are outside scope.
Can I scan APIs behind authentication with middleBrick?
Yes, authenticated scanning is supported with Bearer, API key, Basic auth, and Cookie credentials, provided domain ownership is verified.
How are findings mapped to compliance frameworks?
Findings map directly to OWASP API Top 10 (2023). For other frameworks, the scanner helps you prepare for and supports audit evidence relevant to described controls.