HIGH axumrusttraining data extraction

Training Data Extraction in Axum (Rust)

Training Data Extraction in Axum with Rust — how this specific combination creates or exposes the vulnerability

Training data extraction in an Axum service written in Rust occurs when an application inadvertently exposes datasets, model artifacts, or intermediate data used during model training. This typically happens through debug endpoints, verbose error messages, or overly permissive file serving that allows path traversal into directories containing training corpora, preprocessing scripts, or labeled examples.

In Axum, routing and handler composition can inadvertently expose sensitive training data if response serialization is too permissive. For example, a handler that returns a struct containing raw training samples may serialize private fields if the struct derives Serialize without carefully controlling which fields are included. Rust’s strong type system reduces some risks, but serialization crates like serde will expose any public field unless explicitly skipped, creating an accidental data leak when handlers return training-related structs.

Another vector arises from how Axum integrates with middleware and extractors. If an extractor like State holds a reference to a training dataset in memory, and a debug route exposes that state via an unrestricted endpoint, an attacker can enumerate or dump training data through crafted requests. This is especially relevant when combined with OpenAPI/Swagger introspection that reveals endpoint behavior, giving an attacker insight into how training data flows through the application.

Because middleBrick tests unauthenticated attack surfaces, it can detect endpoints that return training data by analyzing response content for patterns indicative of datasets, such as repeated token sequences or labeled examples. Findings from such scans map to OWASP API Top 10’s ‘Broken Object Level Authorization’ and ‘Excessive Data Exposure,’ highlighting insecure direct object references or missing authorization on data-rich endpoints. These issues align with compliance frameworks like PCI-DSS and SOC2, where exposure of training data can reveal sensitive patterns or personally identifiable information embedded in corpora.

Using middleBrick’s LLM/AI Security checks, this unauthenticated probing can additionally detect whether model outputs or debug traces leak training data through generated text, such as memorized strings or code snippets. This is critical for Rust services where training data pipelines might feed into LLM applications, as leaked data can lead to model inversion or membership inference attacks.

Rust-Specific Remediation in Axum — concrete code fixes

To prevent training data exposure in Axum services written in Rust, apply strict serialization controls, endpoint hygiene, and data compartmentalization. The following examples demonstrate secure patterns.

1. Controlled Serialization with Serde

Ensure that any struct exposed through API responses explicitly controls which fields are serialized. Use #[serde(skip_serializing)] for sensitive training metadata.

use serde::{Deserialize, Serialize};

#[derive(Serialize, Deserialize)]
pub struct PublicResponse {
    pub prediction: String,
    #[serde(skip_serializing)]
    pub training_sample_id: String,
    #[serde(skip_serializing)]
    pub raw_training_data: Vec<f32>,
}

// Handler that safely returns only non-sensitive fields
async fn get_prediction() -> PublicResponse {
    PublicResponse {
        prediction: "class_a".to_string(),
        training_sample_id: "internal_id_123".to_string(),
        raw_training_data: vec![],
    }
}

2. Isolate Training Data State

Keep training datasets in application state that is not exposed via debug or introspection routes. Use Axum’s State to hold data but avoid creating handlers that dump it.

use axum::{routing::get, Router};
use std::sync::Arc;

struct AppState {
    // Training data kept private, not cloned or exposed
    training_corpus: Arc<Vec<String>>,
}

async fn health_check() -> String {
    "OK".to_string()
}

async fn get_model_output() -> String {
    "prediction".to_string()
}

fn build_router() -> Router {
    let state = Arc::new(AppState {
        training_corpus: Arc::new(vec![]), // loaded securely elsewhere
    });

    Router::new()
        .route("/health", get(health_check))
        .route("/predict", get(get_model_output))
        .with_state(state)
}

3. Disable Debug Routes in Production

If using tracing or debug middleware, ensure production builds exclude verbose output that could reveal data paths. Configure logging levels to suppress payload details.

// In production configuration, avoid including debug extractors
// that return full request/response bodies.
// Use axum::extract::State read-only access without clone-on-request.

4. Validate and Restrict File Serving

If serving static files, disable directory listing and restrict paths to prevent traversal into training directories.

use axum::routing::get;
use axum::response::File;
use std::path::Path;

async fn safe_file_service(path: axum::extract::Path<String>) -> Option<File> {
    let requested_path = Path::new("/safe/public").join(path.into_inner());
    if requested_path.starts_with("/safe/public") {
        File::open(requested_path).await.ok()
    } else {
        None
    }
}

fn file_router() -> Router {
    Router::new().route("/files/:name", get(safe_file_service))
}

Frequently Asked Questions

Can middleBrick detect training data leaks in Axum services without authentication?
Yes, middleBrick scans the unauthenticated attack surface and can identify endpoints that expose training data patterns by analyzing response content, even in Rust-based Axum services.
Does middleBrick provide automatic fixes for training data exposure in Axum?
No, middleBrick detects and reports findings with remediation guidance. It does not fix, patch, or modify code. Developers must apply Rust-specific serialization and routing controls to remediate.