URL Parser
A component that breaks down a URL into its parts (protocol, domain, path, etc.). Different parsers can interpret the same URL differently, creating security issues.
Short Definition
A URL parser reads a URL and splits it into pieces (scheme, host, path). Problem: different parsers disagree about how to split the same URL, so your security check and actual request might see different destinations.
Full Definition
A URL parser is code that interprets a URL string and extracts its components. Each programming language and library has its own parser, and they don't always agree.
What parsers extract:
1https://user:pass@example.com:8080/path?query=1#fragment
Parser returns:
- Scheme: https
- Username: user
- Password: pass
- Hostname: example.com
- Port: 8080
- Path: /path
- Query: query=1
- Fragment: fragment
The problem: Ambiguous URLs
1http://evil.com@good.com
- Some parsers: username=evil.com, host=good.com
- Other parsers: host=evil.com, path=@good.com
- Result: Security bypass
Why It Matters
- Security decisions based on parser output
- Parser inconsistencies = bypass opportunities
- SSRF filters often rely on URL parsing
- Same URL can mean different things to different systems
- Hard to test all parser behaviors
How Attackers Use It
Parser Confusion Attack:
1# Security validation2parsed = urlparse(user_input)3if parsed.hostname == "trusted-site.com":4 fetch(user_input) # Approved56# Attacker provides:7"http://trusted-site.com@evil.com"89# Validator sees: hostname = "trusted-site.com" ✓10# HTTP library connects to: "evil.com" ✗
Real attack vectors:
1. @ symbol confusion:
1http://whitelisted.com@attacker.com2http://whitelisted.com%00@attacker.com
2. Backslash vs forward slash:
1http://whitelisted.com\.attacker.com2http://whitelisted.com/...attacker.com
3. Dot manipulation:
1http://whitelisted.com.attacker.com2http://whitelisted.com。attacker.com (full-width dot)
4. Encoding tricks:
1http://whitelisted.com%2f%2f@attacker.com2http://whitelisted.com%09@attacker.com
How to Detect or Prevent It
Prevention strategies:
1. Use same parser for validation and request:
1# Bad: Different parsers2if urllib.parse.urlparse(url).hostname in whitelist: # Parser A3 requests.get(url) # Parser B - may disagree!45# Better: Same parser6session = requests.Session()7parsed = session.prepare_request(Request('GET', url)).url8# Now validate the prepared URL
2. Resolve and validate IP:
1hostname = urlparse(url).hostname2ip = socket.gethostbyname(hostname)3if is_private_ip(ip) or ip == "169.254.169.254":4 block()
3. Use strict parsing modes:
1# Enable strict mode if available2parsed = urllib.parse.urlparse(url, allow_fragments=False)
4. Canonical form checking:
1# Reject URLs with suspicious patterns2suspicious = ['@', '\\', '%00', '\t', '\n']3if any(char in url for char in suspicious):4 reject()
Detection:
- Log URLs before and after parsing
- Compare parser outputs across libraries
- Alert on URLs with unusual characters
- Test with parser confusion payloads
Common Misconceptions
- "URL parsing is standardized" - Many edge cases
- "My library handles it correctly" - Test to be sure
- "Validation + encoding fixes it" - Parser runs first
- "Modern parsers are safe" - New bypasses found regularly
- "Just check the domain" - Domain extraction is the vulnerable part
Real-World Example
Orange Tsai's Research (2017)
Discovered parser inconsistencies in:
- Python (urllib, urllib2, requests)
- PHP (parse_url)
- Java (Java.net.URL)
- JavaScript (URL API)
Python SSRF Bypass:
1# Validator uses urlparse2url = "http://localhost%09@good.com"3urlparse(url).hostname # Returns: "good.com"45# requests library connects to6# localhost (tab character %09 confuses parser)
Shopify SSRF (2018)
1# Ruby URI.parse2url = "http://0x7f.1:22/"3URI.parse(url).host # Returns: "0x7f.1"45# But hex "0x7f.1" resolves to 127.0.0.16# Bypassed localhost block
Safari URL Parsing CVE
1file:////evil.com/path
- Safari parsed as: file protocol, local path
- Actually fetched from: evil.com
- Allowed arbitrary file reads
Related Terms
URL, SSRF, Bypass, Domain, HTTP Request