Training Data Extraction in Actix

How Training Data Extraction Manifests in Actix

Training data extraction attacks target endpoints that inadvertently expose sensitive information from machine learning model training datasets. In Actix web applications, this commonly occurs through misconfigured debug endpoints, overly permissive file serving routes, or improper handling of environment variables that leak paths to training data artifacts.

One specific pattern involves Actix's static file serving when combined with development-mode configuration. For example, serving the target/debug or target/release directories via actix-files can expose .pt, .pth, .npy, or checkpoint files containing model weights or training data shards. Another vector is through Actix's web::Data extractor when application state holds references to training data paths or database connections used during model training, which may be exposed via introspection endpoints.

Consider an Actix service that mounts a debug router in development:

use actix_web::{web, App, HttpServer};
use actix_files::Files;

async fn debug_info() -> impl actix_web::Responder {
    // Accidentally exposes training data path
    format!("Training data located at: {}", std::env::var("TRAINING_DATA_PATH").unwrap_or_default())
}

#[actix_web::main]
async fn main() -> std::io::Result<()> {
    HttpServer::new(|| {
        App::new()
            .service(web::scope("/debug").route("/info", web::get().to(debug_info)))
            .service(Files::new("/assets", "./target/debug").show_files_listing()) // Risky in dev
    })
    .bind(("127.0.0.1", 8080))?
    .run()
    .await
}

Here, the /debug/info endpoint leaks the TRAINING_DATA_PATH environment variable, and serving ./target/debug as static files risks exposing model.safetensors or dataset.csv if the binary was built in the same directory. Attackers can traverse these paths to download training data, which may contain PII, proprietary labels, or sensitive source material used in model development.

Actix-Specific Detection

Detecting training data exposure in Actix requires scanning for both information disclosure vectors and file system access patterns. middleBrick identifies these risks through unauthenticated black-box checks that probe for common leakage points without needing source code or configuration.

The scanner tests for:

  • Environment variable exposure via debug endpoints (e.g., /debug/env, /config, /info)
  • Static file serving of sensitive directories (./target, ./data, ./models)
  • Directory listing enabled on routes serving build artifacts
  • File download endpoints lacking proper path validation (potential path traversal to ../../training_data)
  • Responses containing file paths, checksums, or metadata indicative of ML artifacts (e.g., strings like .ckpt, epoch_, optimizer)

For instance, if an Actix app serves ./target/debug via actix-files with show_files_listing() enabled, middleBrick will detect a 200 OK response listing files like model-epoch-10.pt or training_log.json and flag it as a data exposure finding. Similarly, if a GET /debug/config endpoint returns JSON with "training_data_bucket": "s3://my-company/ml-datasets", it triggers a finding under the Data Exposure check.

These findings are presented in the middleBrick dashboard with severity, location, and remediation guidance — such as disabling file listings in production, moving static assets outside build directories, and auditing debug routes for environment variable leaks.

Teams can integrate this detection into CI using the middleBrick GitHub Action: