Unicode Normalization in Flask with Cockroachdb
Unicode Normalization in Flask with Cockroachdb — how this specific combination creates or exposes the vulnerability
Unicode normalization inconsistencies between Flask request handling and Cockroachdb string comparison can create authentication bypass and data exposure vulnerabilities. When a Flask application receives user input, it may not normalize Unicode strings before using them in SQL queries against Cockroachdb. Cockroachdb stores and compares Unicode text according to its own normalization rules, which may differ from Python's standard normalization forms used in Flask.
For example, the character 'é' can be represented as a single code point U+00E9 or as a decomposed sequence 'e' + U+0301. If Flask does not normalize incoming usernames or passwords, an attacker could supply either representation to bypass authentication checks that compare normalized input against values stored in Cockroachdb. This becomes an IDOR-related issue when user-controlled identifiers such as usernames or API keys are involved.
The combination is risky because:
- Flask may pass raw Unicode strings to Cockroachdb via SQL queries or ORM layers without normalization.
- Cockroachdb performs its own normalization during comparison, leading to mismatches between what the application expects and what the database returns.
- Search and filtering operations may return multiple records or incorrect records, enabling privilege escalation or data leakage.
In security testing, this pattern is observable in the BOLA/IDOR and Input Validation checks. An unauthenticated attacker could enumerate users by supplying canonically equivalent but non-identical Unicode strings, causing the application to behave differently depending on how Cockroachdb resolves the strings.
Cockroachdb-Specific Remediation in Flask — concrete code fixes
Remediation focuses on ensuring consistent Unicode normalization before any string is sent to Cockroachdb, and validating input against expected canonical forms. Use Python's unicodedata module to normalize incoming data, and apply the same normalization to any string literals used in SQL statements.
Example: Normalizing user input before database operations
import unicodedata
from flask import Flask, request, jsonify
import psycopg2
app = Flask(__name__)
def normalize_unicode(value: str) -> str:
"""Normalize to NFC form, recommended for consistent storage and comparison."""
return unicodedata.normalize('NFC', value)
@app.route('/login', methods=['POST'])
def login():
data = request.get_json()
username = normalize_unicode(data.get('username', ''))
password = normalize_unicode(data.get('password', ''))
conn = psycopg2.connect(
host='your-cockroachdb-host',
port=26257,
dbname='yourdb',
user='youruser',
password='yourpassword'
)
cur = conn.cursor()
# Use parameterized queries to avoid SQL injection
cur.execute(
'SELECT id, username FROM users WHERE username = %s AND password_hash = crypt(%s, password_hash)',
(username, password)
)
user = cur.fetchone()
cur.close()
conn.close()
if user:
return jsonify({'status': 'ok', 'user_id': user[0]})
return jsonify({'status': 'invalid credentials'}), 401
Example: Normalizing identifiers in API endpoints
When using Cockroachdb identifiers such as tenant IDs or API keys, normalize before constructing queries:
import unicodedata
from flask import Flask, g
import psycopg2
app = Flask(__name__)
def normalize_identifier(value: str) -> str:
return unicodedata.normalize('NFC', value)
@app.before_request
def resolve_tenant():
raw_tenant_id = request.headers.get('X-Tenant-ID', '')
g.tenant_id = normalize_identifier(raw_tenant_id)
@app.route('/data')
def get_tenant_data():
conn = psycopg2.connect(
host='your-cockroachdb-host',
port=26257,
dbname='yourdb',
user='youruser',
password='yourpassword'
)
cur = conn.cursor()
cur.execute(
'SELECT sensitive_info FROM tenant_data WHERE tenant_id = %s',
(g.tenant_id,)
)
result = cur.fetchone()
cur.close()
conn.close()
if result:
return jsonify({'data': result[0]})
return jsonify({'error': 'not found'}), 404
Database-side considerations
Cockroachdb stores text in the encoding and normalization form provided at insertion. Queries that compare normalized input against non-normalized stored data will fail to match. Therefore, ensure that:
- All incoming strings are normalized to a consistent form (typically NFC) in the application layer before any database operation.
- Any search or comparison involving user-controlled strings applies the same normalization to both sides of the comparison.
- If you rely on ORM behavior, verify that the ORM does not alter Unicode representation before sending queries to Cockroachdb.
These steps reduce the risk of bypassing authentication, preventing IDOR, and avoiding inconsistent authorization checks that depend on string equality with Cockroachdb.