Training Data Extraction in Axum
Training Data Extraction in Axum
Training data extraction occurs when an API unintentionally reveals sensitive information used to train machine learning models or proprietary datasets. In Axum, this typically happens through endpoints that expose internal metadata, training logs, or configuration files that reference dataset paths, model versions, or training parameters. Unlike generic frameworks, Axum's routing and middleware system makes certain patterns more visible, especially when developers use debug handlers or development-stage logging.
Common manifestations include:
- Endpoints that return directory listings of training data directories (e.g.,
/training-dataor/model-checkpoints) - Debug routes that render environment variables or configuration objects containing dataset identifiers
- Error responses that echo stack traces containing paths like
/var/lib/training/data/v2 - OpenAPI specifications that document responses with fields like
training_dataset_pathormodel_source_repowithout proper sanitization
These patterns are particularly relevant in Axum because the framework encourages explicit routing and does not automatically obscure internal paths in error messages. Attackers can probe for these endpoints using common directory traversal payloads or by enumerating common training data locations.
// Example of unsafe debug endpoint in Axum
use axum::http::StatusCode;
use axum::response::{IntoResponse, Response};
use axum::routing::get;
use axum::Router;
#[tokio::main]
async fn main() {
let app = Router::new()
.route("/debug/vars", get(debug_handler))
.route("/training-data", get(training_data_handler));
axum::Server::bind(&"0.0.0.0:3000".parse().unwrap())
.serve(app.into_make_service())
.await;
}
async fn debug_handler() -> impl IntoResponse {
// Unsafe: directly exposing environment variables
let mut response = serde_json::Map::new();
response.insert("training_dataset_path".to_string(),
serde_json::Value::String(std::env::var("TRAINING_DATA_PATH").unwrap_or_default()));
response.insert("model_version".to_string(),
serde_json::Value::String(std::env::var("MODEL_VERSION").unwrap_or_default()));
(StatusCode::OK, axum::Json(response))
}
async fn training_data_handler() -> impl IntoResponse {
// Vulnerable: directory listing of training data
let paths = vec!["/training-data/v1", "/training-data/v2", "/model-checkpoints/latest"];
let mut items = String::new();
for path in paths {
items.push_str(&format!("{} ", path));
}
(StatusCode::OK, axum::Html(items))
}
These code patterns are frequently observed in development-stage deployments where security hardening is overlooked. The presence of such endpoints creates direct avenues for attackers to enumerate training datasets, potentially exposing proprietary model architectures or sensitive data sources.
Detection of training data exposure requires scanning both the API surface and underlying infrastructure. middleBrick identifies these risks through its Input Validation and Data Exposure checks, which analyze response bodies for patterns like "training_dataset", "model_checkpoint", or "dataset_source" in JSON fields, as well as detect directory listing responses that contain multiple sequential file references. The scanner also cross-references OpenAPI specifications for documentation of training-related fields that lack proper sanitization.