Unicode Normalization in Fastapi with Jwt Tokens
Unicode Normalization in Fastapi with Jwt Tokens — how this specific combination creates or exposes the vulnerability
Unicode Normalization becomes a security concern in FastAPI when JWT tokens are used for authentication and the framework or underlying libraries do not canonicalize input before processing. An attacker can exploit normalization differences to bypass authentication or elevate privileges by submitting semantically equivalent but visually distinct identifiers.
Consider a FastAPI endpoint that validates a JWT and uses a claim such as sub or a username to look up a user. If the application compares the decoded subject without normalizing to a standard form (e.g., NFC), an attacker can register or authenticate with a crafted identity that appears identical to the application but has a different byte representation. For example, the character é can be represented as a single code point U+00E9 or as a combination of e (U+0065) followed by a combining acute accent U+0301. Without normalization, these two forms are treated as different strings, enabling token confusion attacks where a token issued for one identity is accepted as another.
In the context of JWTs, this typically surfaces in two places: token validation and user identity resolution. During validation, some libraries may perform string comparisons on claims before normalization, allowing a token with a normalized iss or aud to be accepted when it should not. In FastAPI, route dependencies that decode the token and retrieve a user from a database or an in-memory store may compare usernames or IDs without normalization, enabling a BOLA/IDOR-like bypass if the lookup logic is normalization-sensitive.
An example attack scenario: a user registers with the normalized username alice (NFC). An attacker crafts a token where the username claim is a\u0301lice (NFD), which normalizes to alice. If FastAPI compares the raw claim value against stored usernames without normalization, the attacker may be authenticated as alice despite not having registered that identity. Similar issues can occur with email addresses, scopes, or custom roles encoded in the token payload.
This combination also intersects with input validation and property authorization checks. If authorization logic relies on string equality or prefix checks on unnormalized identifiers, attackers can exploit canonicalization mismatches to access resources they should not reach. Because JWTs are often self-contained, developers may assume claims are trustworthy and skip normalization, increasing the risk of subtle bypasses.
To detect this class of issue, scans should test identity resolution paths and token validation logic using canonically equivalent but differently encoded inputs, verifying that access controls and authentication decisions remain consistent regardless of representation.
Jwt Tokens-Specific Remediation in Fastapi — concrete code fixes
Remediation centers on canonicalizing identifiers before comparison and ensuring JWT handling code treats normalized forms as the source of truth. In FastAPI, this means normalizing usernames, emails, and other identity claims before lookup and validation, and avoiding raw string comparisons.
Below is a concrete example of secure JWT authentication in FastAPI with Unicode normalization applied. The approach uses the unicodedata module to normalize strings to NFC and integrates with PyJWT for token decoding. Dependencies are expressed in requirements.txt for reproducibility.
# requirements.txt
fastapi==0.110.0
uvicorn==0.27.0
pyjwt==2.8.0
import unicodedata
from fastapi import FastAPI, Depends, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import jwt
from typing import Optional
app = FastAPI()
security_scheme = HTTPBearer()
SECRET_KEY = "your-secret-key"
ALGORITHM = "HS256"
# Normalize identifiers to NFC before storage and comparison
def normalize_identity(value: str) -> str:
return unicodedata.normalize("NFC", value.strip())
# Example user store (in practice, use a database)
USERS = {
"alice": {"username": "alice", "roles": ["read"]},
"bob": {"username": "bob", "roles": ["read", "write"]},
}
def decode_token(credentials: HTTPAuthorizationCredentials = Depends(security_scheme)) -> dict:
try:
payload = jwt.decode(credentials.credentials, SECRET_KEY, algorithms=[ALGORITHM])
return payload
except jwt.ExpiredSignatureError:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Token expired",
headers={"WWW-Authenticate": "Bearer"},
)
except jwt.InvalidTokenError:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid token",
headers={"WWW-Authenticate": "Bearer"},
)
def get_current_user(payload: dict = Depends(decode_token)) -> dict:
# Normalize the identity claim from the token
username_raw = payload.get("username")
if not isinstance(username_raw, str):
raise HTTPException(status_code=401, detail="Invalid username claim")
username = normalize_identity(username_raw)
user = USERS.get(username)
if user is None:
raise HTTPException(status_code=401, detail="User not found")
return user
@app.get("/profile")
def read_profile(user: dict = Depends(get_current_user)):
return {"message": f"Hello {user['username']}", "roles": user["roles"]}
Key points in this remediation:
- Normalization is applied to the incoming identity claim (
username) before any lookup. - The normalization function uses NFC, a widely recommended form for consistency.
- Storage and comparison use normalized values, preventing mismatch between token payload and stored identifiers.
- Token decoding remains separate and does not skip validation; normalization is integrated into the user resolution step.
For email-based flows, apply the same pattern to the email claim. If your authorization logic depends on scopes or roles, ensure those values are also normalized if they contain Unicode characters. When integrating with an existing codebase, audit all identity comparisons to confirm they operate on normalized data.