Model information disclosure check
What middleBrick covers
- 18 adversarial probes across Quick, Standard, and Deep scan tiers
- OpenAPI 3.0/3.1 and Swagger 2.0 parsing with $ref resolution
- Detection of prompt injection, jailbreak, and encoding bypass techniques
- Identification of PII leakage and system prompt exposure
- Read-only scanning with no destructive payloads
- Authenticated scanning with domain verification gate
What is a model information disclosure check
A model information disclosure check examines endpoints that expose or interact with AI models to identify ways sensitive model internals, training data artifacts, or system prompts may be revealed. This includes probes that attempt to extract instructions, override guardrails, or coax the model into reproducing memorized content. The goal is to understand what an attacker could learn about the model behavior and constraints through legitimate interface channels.
What teams get wrong when they skip this check
Without a structured check, teams underestimate conversational risks such as prompt injection, jailbreak techniques, and encoding-based bypasses. Adversarial probes like base64 or ROT13 encoding, translation-embedded injection, and multi-turn manipulation can leak system instructions or PII. Treating LLM interfaces as read-only logs underestimates token smuggling and tool-abuse paths that expose internal routing or training details.
A good workflow for model disclosure testing
Start with a low-intensity scan to map endpoints and supported content types, then progress to deeper adversarial tiers only where justified by risk. Use read-only methods such as GET and HEAD, and restrict text-only POST to LLM probes. Map findings against the API inventory, verify ownership via domain checks, and review results in the context of the model architecture and data sensitivity. Iterate with focused scenarios rather than broad brute force.
Coverage provided by middleBrick
middleBrick runs a black-box scan that includes 18 adversarial probes across three scan tiers: Quick, Standard, and Deep. It detects system prompt extraction, instruction override attempts, DAN and roleplay jailbreaks, data exfiltration patterns, cost exploitation, encoding bypasses, translation-embedded injection, few-shot poisoning, markdown injection, multi-turn manipulation, indirect prompt injection, token smuggling, tool-abuse, nested instruction injection, and PII extraction. The scanner parses OpenAPI 3.0, 3.1, and Swagger 2.0 with recursive $ref resolution and cross-references spec definitions against runtime observations to highlight undefined security schemes and deprecated operations.
Authentication and scanning policies
Authenticated scanning is available from the Starter tier and supports Bearer, API key, Basic auth, and Cookie credentials. Domain verification requires a DNS TXT record or an HTTP well-known file to ensure only the domain owner can scan with credentials. Forwarded headers are limited to Authorization, X-API-Key, Cookie, and X-Custom-* to reduce noise. Scan data is deletable on demand and purged within 30 days of cancellation, and it is never used for model training.