MEDIUM unicode normalizationmongodb

Unicode Normalization in Mongodb

How Unicode Normalization Manifests in MongoDB

MongoDB stores strings as UTF‑8, but the database does not automatically apply Unicode normalization when indexing or comparing values. This means that two strings that are canonically equivalent (e.g., the letter “e” followed by a combining acute accent versus the pre‑composed “é”) are treated as different keys unless the application normalizes them before storage or query. Attackers can exploit this in several ways:

  • Authentication bypass: A login endpoint that checks username against a unique index may accept admin (U+0061 U+0064 U+006D U+0069 U+006E) and reject admin (fullwidth ASCII, U+FF21 U+FF24 U+FF2D U+FF29 U+FF2E) if the index was built on the raw bytes. By supplying the fullwidth variant, an attacker can log in as a non‑existent user or create a duplicate account that later confuses authorization logic.
  • Data‑integrity subversion: An application that relies on a unique index on email to prevent duplicate accounts can be tricked into storing two records that appear identical to a user but differ in normalization form (e.g., user@example.com vs. user@example.com where the latter uses a combining diaeresis on the “e”). This can lead to account takeover or privilege escalation when the application later normalizes the value for display.
  • Input‑validation evasion: Validation routines that use regular expressions to reject dangerous characters (e.g., [^a-zA-Z0-9@.]) may miss fullwidth equivalents or characters from other Unicode blocks that normalize to the same ASCII glyph. An attacker can inject admin%EF%BC%A1 (fullwidth “!”) to bypass a filter that only looks for the ASCII exclamation mark.
  • Query manipulation: MongoDB’s $regex operator works on the raw byte sequence. A pattern like /^admin$/i will not match the fullwidth admin, allowing an attacker to craft a payload that avoids detection by a regex‑based WAF or application‑level blocklist while still being interpreted as the intended username after the application normalizes it for lookup.

These patterns are catalogued in the OWASP API Security Top 10 2023 under A03:Injection, where Unicode normalization is listed as a common bypass technique for input validation and authentication mechanisms.

MongoDB‑Specific Detection

Detecting Unicode normalization issues requires checking both the application logic and the database schema. middleBrick’s Input Validation scan includes a set of tests that send payloads with various Unicode normalization forms (NFC, NFD, fullwidth ASCII, homoglyphs) to each endpoint and observes whether the response deviates from the expected behavior. If the scanner receives a successful login or data retrieval when a normalized‑variant payload is used, it flags the endpoint as vulnerable.

In addition, middleBrick examines the MongoDB schema when an OpenAPI/Swagger spec is supplied. It looks for:

  • Unique indexes that do not specify a collation with a strength level that ignores case and diacritics (e.g., strength: 2 or higher).
  • Fields used for authentication, authorization, or as keys in application logic that lack explicit validation rules.
  • Endpoints that perform direct string comparison ($eq) on user‑supplied values without applying a normalization step.

Example of a finding that middleBrick might return:

{
  "endpoint": "/api/v1/login",
  "method": "POST",
  "finding": "Unicode normalization bypass possible",
  "severity": "medium",
  "details": "The endpoint accepts the fullwidth username ‘admin’ (U+FF21 U+FF24 U+FF2D U+FF29 U+FF2E) and authenticates as the user ‘admin’. No normalization is applied before querying the users collection.",
  "remediation": "Normalize usernames to NFC (or NFKC) before querying, or create a unique index with collation { locale: 'en', strength: 2 }."
}

Because the scan is unauthenticated and black‑box, it does not require any credentials or agents; it simply sends the crafted payloads and analyses the responses.

MongoDB‑Specific Remediation

Fixing Unicode normalization issues in MongoDB involves two complementary steps: normalizing data at the application layer and, where appropriate, configuring the database to treat canonically equivalent strings as equal.

1. Application‑level normalization Before storing or using any user‑provided string that participates in security decisions (usernames, emails, tokens, etc.), convert it to a Unicode Normalization Form. The most common choice is NFC (Canonical Composition) or NFKC (Compatibility Composition) if you also want to fold compatibility characters.

Example in Node.js using the unorm package:

const { normalize } = require('unorm');
const { MongoClient } = require('mongodb');

async function registerUser(rawUsername, rawEmail) {
  const username = normalize(rawUsername);   // NFC
  const email    = normalize(rawEmail);

  const client = await MongoClient.connect('mongodb://localhost:27017');
  const db = client.db('app');
  await db.collection('users').insertOne({ username, email, createdAt: new Date() });
  await client.close();
}

// Usage
registerUser('admin', 'Usé[email protected]');

Example in Python with pymongo and the built‑in unicodedata module:

import unicodedata
from pymongo import MongoClient

def normalize(value):
    return unicodedata.normalize('NFC', value)

def register_user(raw_username, raw_email):
    username = normalize(raw_username)
    email    = normalize(raw_email)
    client = MongoClient('mongodb://localhost:27017')
    db = client.get_database('app')
    db.users.insert_one({'username': username, 'email': email})
    client.close()

# Usage
register_user('admin', 'Usé[email protected]')

2. Database‑level collation If you prefer to keep the raw strings in the database but want queries to treat canonically equivalent strings as identical, create a unique index with a collation that sets an appropriate strength level. Strength 2 ignores diacritics; Strength 3 ignores case; Strength 4 (identical) treats differences only as control characters.

Example using the MongoDB shell:

db.users.createIndex(
  { username: 1 },
  {
    unique: true,
    collation: { locale: 'en', strength: 2 }   // ignore accents, treat e and é as same
  }
);

When the index exists, an insert of admin followed by an insert of admín will fail with a duplicate‑key error, preventing the bypass described earlier.

3. Validation and testing Add unit tests that attempt to register or authenticate with NFC, NFD, fullwidth, and homoglyph variants of legitimate values. Ensure the application either rejects the input (if it does not conform to the allowed character set) or treats all variants as the same account.

By combining application‑level normalization with, where needed, a collation‑aware unique index, you eliminate the attack surface that Unicode normalization introduces in MongoDB.

Frequently Asked Questions

Does middleBrick modify my MongoDB data to fix Unicode normalization issues?
No. middleBrick only scans the exposed API surface and reports findings. It does not alter data, indexes, or application code. Remediation must be performed by you based on the guidance provided.
Can I rely solely on a MongoDB collation to solve Unicode normalization problems?
A collation can make queries treat canonically equivalent strings as equal, but it does not prevent the storage of multiple distinct byte sequences that appear identical to users. For defense‑in‑depth, normalize the input before storage and use a collation‑backed unique index as a secondary safeguard.