42Crunch for Jailbreak resistance probe battery

What middleBrick covers

  • Executes 18 jailbreak adversarial probes across three scan tiers
  • Validates system prompt extraction and instruction override attempts
  • Tests encoding bypass, token smuggling, and multi-turn manipulation
  • Provides risk score grades and prioritized remediation guidance
  • Integrates via CLI, dashboard, API client, and CI/CD gates
  • Maps findings to OWASP API Top 10 (2023) for review alignment

Jailbreak resistance probe battery overview

A jailbreak resistance probe battery tests how well a model ignores or bypasses system instructions and unsafe content rules. middleBrick surfaces these probes as an LLM security category aligned to OWASP API Top 10, covering adversarial techniques across three scan tiers.

Coverage of adversarial jailbreak techniques

The scanner executes 18 adversarial probes across Quick, Standard, and Deep tiers. Techniques include system prompt extraction, instruction override, DAN and roleplay jailbreaks, data exfiltration, cost exploitation, base64 and ROT13 encoding bypass, translation-embedded injection, few-shot poisoning, markdown injection, multi-turn manipulation, indirect prompt injection, token smuggling, tool-abuse, nested instruction injection, and PII extraction.

Each probe validates whether the model resists manipulation attempts that try to reveal system instructions or produce disallowed outputs. Results highlight which attack vectors succeed and where model guardrails weaken.

Integration with API scanning and constraints

middleBrick operates as a black-box scanner, requiring no agents or code access. It supports URL-based endpoints that accept text payloads, including text-only POST bodies used for LLM probes.

Scan time is under a minute per endpoint. The tool does not perform active SQL injection or command injection, and it does not attempt to fix or remediate findings. It exposes findings with remediation guidance so you can adjust prompts, harden guardrails, or modify model configurations.

Mapping to compliance and limitations

Findings map to OWASP API Top 10 (2023), which helps you prepare for security reviews that reference jailbreak resistance as part of LLM-related controls. middleBrick is a scanner, not an auditor, and it does not certify compliance with any framework.

The scanner does not detect blind SSRF or business logic vulnerabilities that require deep domain understanding. High-stakes audits still require human pentesters to validate jailbreak resistance in the context of your application and data flows.

Workflow integration and output

Use the CLI with middlebrick scan <url> to run a Quick battery and receive a risk score and prioritized findings. The Web Dashboard groups results by probe type, shows score trends, and allows export of branded compliance PDFs.

Programmatic access returns structured data you can integrate into CI/CD gates or monitoring pipelines. Note that authenticated scanning for this workflow requires domain verification and a Starter tier or higher, with only approved headers forwarded to the endpoint.

Frequently Asked Questions

Does middleBrick test actual model behavior or only the hosting API?
It probes the endpoint you provide, validating how the model responds to jailbreak attempts at the API boundary. It does not inspect internal model weights or training data.
Can it detect all jailbreak techniques?
It covers a broad set of known adversarial patterns, but it cannot detect every possible jailbreak, especially novel techniques or context-specific bypasses that require domain knowledge.
Does this replace a human red team for LLM security?
No. The scanner identifies common probe patterns; a human pentester is still necessary to evaluate business logic, data sensitivity, and real-world attack scenarios.
How are results presented?
Each finding includes a risk score grade, a description of the probe, observed behavior, and remediation guidance to adjust prompts or model policies.