Few-shot poisoning test

What middleBrick covers

18 adversarial LLM probes across Quick, Standard, and Deep tiers
System prompt extraction and instruction override detection
Few-shot poisoning and indirect prompt injection analysis
Encoding bypass detection including base64 and ROT13
OpenAPI 3.x and Swagger 2.0 parsing with $ref resolution
Authenticated scanning with header allowlisting

What is a few-shot poisoning test

A few-shot poisoning test evaluates whether an LLM-based service can be compromised by injecting subtle, targeted examples into prompts or tool instructions. The test adds small, carefully crafted perturbations to training or inference data to shift model behavior in favor of an attacker. Typical objectives include extracting system instructions, overriding safety constraints, or inducing data leakage through the model API.

Common mistakes when skipping this test

Teams that skip few-shot poisoning testing assume model guardrails are sufficient against low-volume, carefully placed examples. In practice, attackers can use encoding, translation, or nested instructions to bypass surface-level defenses. Without measurement, you cannot know whether a model will comply with malicious instructions, expose prompts, or propagate tainted outputs to downstream systems.

Workflow for conducting few-shot poisoning tests

Start with a baseline assessment of the model behavior using benign prompts. Then introduce adversarial examples at the prompt or instruction layer, varying encoding, placement, and context. Measure changes in model outputs across multiple tiers, focusing on jailbreak success, data exfiltration indicators, and cost anomalies. Record token counts and response consistency to detect low-and-slow poisoning attempts.

Example workflow using the middleBrick CLI:

middlebrick scan https://api.example.com/openapi.json --llm-scan-tier deep --output json

Use the JSON output to map detected jailbreaks and data leakage indicators to specific prompt templates, then refine detection rules iteratively.

What middleBrick covers out of the box

middleBrick performs 18 adversarial probes across three scan tiers (Quick, Standard, Deep). The LLM security checks cover system prompt extraction, instruction override, DAN and roleplay jailbreaks, data exfiltration, cost exploitation, encoding bypasses (base64, ROT13), translation-embedded injection, few-shot poisoning, markdown injection, multi-turn manipulation, indirect prompt injection, token smuggling, tool-abuse, nested instruction injection, and PII extraction.

The scanner parses OpenAPI 3.0, 3.1, and Swagger 2.0 definitions with recursive $ref resolution and cross-references spec definitions against runtime findings. This surfaces missing security schemes, deprecated operations, and over-exposed fields that may amplify poisoning impact.

Integration into your security program

Use the Web Dashboard to track scan score trends over time and download branded compliance evidence. Add the GitHub Action to CI/CD to fail builds when the LLM security score drops below your threshold. For automated pipelines, call the API client to integrate scanning into existing workflows. Schedule regular rescans with the Pro tier to detect introduced weaknesses after model updates or prompt changes.

Authenticated scanning supports Bearer, API key, Basic auth, and cookies, with domain verification to ensure only your organization runs scans against protected endpoints. Header allowlisting limits forwarded headers to Authorization, X-API-Key, Cookie, and X-Custom-*.

Frequently Asked Questions

What does a few-shot poisoning test actually measure?

It measures whether small, injected examples in prompts or instructions can cause the model to ignore safety constraints, reveal system details, or exfiltrate data.

Can this replace a red team engagement for LLM security?

No. The scanner detects known adversarial patterns and surface-level issues, but business logic and high-stakes audits still require human expertise.

Does testing affect production model behavior?

Scans are read-only and use non-destructive inputs. No training or fine-tuning occurs during a scan.

How are findings mapped to compliance frameworks?

Findings map to OWASP API Top 10 (2023) and align with security controls described in SOC 2 Type II and PCI-DSS 4.0.

What happens to scan data after account cancellation?

Customer data is deletable on demand and purged within 30 days of cancellation. It is never sold or used for model training.