HIGH axumtraining data extraction

Training Data Extraction in Axum

Training data extraction occurs when an API unintentionally reveals sensitive information used to train machine learning models or proprietary datasets. In Axum, this typically happens through endpoints that expose internal metadata, training logs, or configuration files that reference dataset paths, model versions, or training parameters. Unlike generic frameworks, Axum's routing and middleware system makes certain patterns more visible, especially when developers use debug handlers or development-stage logging.

Common manifestations include:

Endpoints that return directory listings of training data directories (e.g., /training-data or /model-checkpoints)
Debug routes that render environment variables or configuration objects containing dataset identifiers
Error responses that echo stack traces containing paths like /var/lib/training/data/v2
OpenAPI specifications that document responses with fields like training_dataset_path or model_source_repo without proper sanitization

These patterns are particularly relevant in Axum because the framework encourages explicit routing and does not automatically obscure internal paths in error messages. Attackers can probe for these endpoints using common directory traversal payloads or by enumerating common training data locations.

// Example of unsafe debug endpoint in Axum
use axum::http::StatusCode;
use axum::response::{IntoResponse, Response};
use axum::routing::get;
use axum::Router;

#[tokio::main]
async fn main() {
    let app = Router::new()
        .route("/debug/vars", get(debug_handler))
        .route("/training-data", get(training_data_handler));

    axum::Server::bind(&"0.0.0.0:3000".parse().unwrap())
        .serve(app.into_make_service())
        .await;
}

async fn debug_handler() -> impl IntoResponse {
    // Unsafe: directly exposing environment variables
    let mut response = serde_json::Map::new();
    response.insert("training_dataset_path".to_string(),
        serde_json::Value::String(std::env::var("TRAINING_DATA_PATH").unwrap_or_default()));
    response.insert("model_version".to_string(),
        serde_json::Value::String(std::env::var("MODEL_VERSION").unwrap_or_default()));
    (StatusCode::OK, axum::Json(response))
}

async fn training_data_handler() -> impl IntoResponse {
    // Vulnerable: directory listing of training data
    let paths = vec!["/training-data/v1", "/training-data/v2", "/model-checkpoints/latest"];
    let mut items = String::new();
    for path in paths {
        items.push_str(&format!("{}", path));
    }
    (StatusCode::OK, axum::Html(items))
}

These code patterns are frequently observed in development-stage deployments where security hardening is overlooked. The presence of such endpoints creates direct avenues for attackers to enumerate training datasets, potentially exposing proprietary model architectures or sensitive data sources.

Detection of training data exposure requires scanning both the API surface and underlying infrastructure. middleBrick identifies these risks through its Input Validation and Data Exposure checks, which analyze response bodies for patterns like "training_dataset", "model_checkpoint", or "dataset_source" in JSON fields, as well as detect directory listing responses that contain multiple sequential file references. The scanner also cross-references OpenAPI specifications for documentation of training-related fields that lack proper sanitization.

Axum-Specific Detection

In Axum applications, training data extraction vulnerabilities often manifest through specific code paths and response patterns. middleBrick detects these by analyzing response content for:

JSON fields containing keywords like "dataset", "training", "model", or "checkpoint" in unredacted form
HTML responses with list items referencing training directories or versioned data paths
Error messages that include file system paths containing "training", "dataset", or "model"

When scanning an Axum endpoint, middleBrick evaluates:

Whether the response contains multiple sequential identifiers suggesting directory enumeration
If error responses expose stack traces with internal paths
Whether the OpenAPI spec documents training-related parameters without proper schema validation

For example, a response containing { "training_dataset_path": "/data/v2/finance-2023" } would trigger a Data Exposure finding with High severity, as it reveals proprietary dataset locations. Similarly, an HTML response listing "/training-data/v1", "/training-data/v2", etc., would be flagged as Potential Information Disclosure.

middleBrick's scanning process automatically follows redirected requests and tests unauthenticated access to common training-related endpoints. The scanner also checks for debug routes that are commonly left enabled in development but forgotten in production. Findings are prioritized based on whether the exposed information could lead to model theft, intellectual property leakage, or competitive advantage for adversaries.

Axum-Specific Remediation

Remediation of training data extraction vulnerabilities in Axum requires proactive sanitization of responses and careful routing configuration. Developers should implement the following fixes:

// Secure debug endpoint - sanitize all exposed data
async fn debug_handler() -> impl IntoResponse {
    let mut response = serde_json::Map::new();
    // Only expose sanitized, non-sensitive information
    response.insert("environment", 
        serde_json::Value::String("production".to_string()));
    // Never expose raw training paths or model versions
    (StatusCode::OK, axum::Json(response))
}

// Prevent directory listing of training data
async fn training_data_handler() -> impl IntoResponse {
    // Return 404 or 403 for any direct access to training directories
    return (StatusCode::NOT_FOUND, String::new());
}

// Use proper error handling that doesn't expose internals
async fn error_handler(err: axum::http::StatusCode) -> impl IntoResponse {
    (err, "Internal server error")
}

// Configure router to exclude debug routes in production
let app = Router::new()
    .route("/debug/vars", get(debug_handler))
    .route("/training-data", get(training_data_handler))
    .route("/error", get(error_handler))
    // Apply middleware to disable debug routes in production
    .layer(axum::middleware::from_fn(|req, next| async move {
        if req.uri().path().starts_with("/debug") {
            return Err(axum::http::StatusCode::FORBIDDEN);
        }
        Ok(next.run(req).await)
    }));

Additional best practices include:

Using Axum's Layer system to conditionally disable debug endpoints based on environment variables
Implementing strict routing that returns 404 for any path containing "training" or "dataset" outside of authorized admin contexts
Auditing OpenAPI specifications to remove documentation of training-related parameters from public endpoints

middleBrick identifies these remediation opportunities during scanning and provides specific code-level guidance. The scanner checks whether error handlers expose stack traces, whether debug routes are accessible without authentication, and whether response sanitization is properly implemented. Remediation guidance includes exact code patterns to replace vulnerable implementations.

Frequently Asked Questions

How does middleBrick detect training data exposure in Axum applications?

middleBrick analyzes API responses for patterns indicating training data exposure, including JSON fields containing keywords like 'dataset', 'training', or 'model', HTML responses with directory listings of training paths, and error messages that expose file system paths. The scanner also examines OpenAPI specifications for undocumented training-related parameters and checks for debug routes that may reveal internal training data locations.

Can middleBrick scan my Axum API without authentication?

Yes. middleBrick performs unauthenticated black-box scanning of any publicly accessible API endpoint. It tests common training-related endpoints, analyzes response content for sensitive information leakage, and checks OpenAPI specs for improper documentation of training data paths — all without requiring credentials or internal access.

Training Data Extraction in Axum