URL Parser

Short Definition

A URL parser reads a URL and splits it into pieces (scheme, host, path). Problem: different parsers disagree about how to split the same URL, so your security check and actual request might see different destinations.

Full Definition

A URL parser is code that interprets a URL string and extracts its components. Each programming language and library has its own parser, and they don't always agree.

What parsers extract:

bash

1https://user:pass@example.com:8080/path?query=1#fragment

Parser returns:

Scheme: https
Username: user
Password: pass
Hostname: example.com
Port: 8080
Path: /path
Query: query=1
Fragment: fragment

The problem: Ambiguous URLs

bash

1http://evil.com@good.com

Some parsers: username=evil.com, host=good.com
Other parsers: host=evil.com, path=@good.com
Result: Security bypass

Why It Matters

Security decisions based on parser output
Parser inconsistencies = bypass opportunities
SSRF filters often rely on URL parsing
Same URL can mean different things to different systems
Hard to test all parser behaviors

How Attackers Use It

Parser Confusion Attack:

python

1# Security validation
2parsed = urlparse(user_input)
3if parsed.hostname == "trusted-site.com":
4    fetch(user_input)  # Approved
5
6# Attacker provides:
7"http://trusted-site.com@evil.com"
8
9# Validator sees: hostname = "trusted-site.com" ✓
10# HTTP library connects to: "evil.com" ✗

Real attack vectors:

1. @ symbol confusion:

bash

1http://whitelisted.com@attacker.com
2http://whitelisted.com%00@attacker.com

2. Backslash vs forward slash:

bash

1http://whitelisted.com\.attacker.com
2http://whitelisted.com/...attacker.com

3. Dot manipulation:

bash

1http://whitelisted.com.attacker.com
2http://whitelisted.com。attacker.com (full-width dot)

4. Encoding tricks:

bash

1http://whitelisted.com%2f%2f@attacker.com
2http://whitelisted.com%09@attacker.com

How to Detect or Prevent It

Prevention strategies:

1. Use same parser for validation and request:

python

1# Bad: Different parsers
2if urllib.parse.urlparse(url).hostname in whitelist:  # Parser A
3    requests.get(url)  # Parser B - may disagree!
4
5# Better: Same parser
6session = requests.Session()
7parsed = session.prepare_request(Request('GET', url)).url
8# Now validate the prepared URL

2. Resolve and validate IP:

python

1hostname = urlparse(url).hostname
2ip = socket.gethostbyname(hostname)
3if is_private_ip(ip) or ip == "169.254.169.254":
4    block()

3. Use strict parsing modes:

python

1# Enable strict mode if available
2parsed = urllib.parse.urlparse(url, allow_fragments=False)

4. Canonical form checking:

python

1# Reject URLs with suspicious patterns
2suspicious = ['@', '\\', '%00', '\t', '\n']
3if any(char in url for char in suspicious):
4    reject()

Detection:

Log URLs before and after parsing
Compare parser outputs across libraries
Alert on URLs with unusual characters
Test with parser confusion payloads

Common Misconceptions

"URL parsing is standardized" - Many edge cases
"My library handles it correctly" - Test to be sure
"Validation + encoding fixes it" - Parser runs first
"Modern parsers are safe" - New bypasses found regularly
"Just check the domain" - Domain extraction is the vulnerable part

Real-World Example

Orange Tsai's Research (2017)

Discovered parser inconsistencies in:

Python (urllib, urllib2, requests)
PHP (parse_url)
Java (Java.net.URL)
JavaScript (URL API)

Python SSRF Bypass:

python

1# Validator uses urlparse
2url = "http://localhost%09@good.com"
3urlparse(url).hostname  # Returns: "good.com"
4
5# requests library connects to
6# localhost (tab character %09 confuses parser)

Shopify SSRF (2018)

ruby

1# Ruby URI.parse
2url = "http://0x7f.1:22/" 
3URI.parse(url).host  # Returns: "0x7f.1"
4
5# But hex "0x7f.1" resolves to 127.0.0.1
6# Bypassed localhost block

Safari URL Parsing CVE

bash

1file:////evil.com/path

Safari parsed as: file protocol, local path
Actually fetched from: evil.com
Allowed arbitrary file reads

URL, SSRF, Bypass, Domain, HTTP Request

Short Definition

Full Definition

A URL parser is code that interprets a URL string and extracts its components. Each programming language and library has its own parser, and they don't always agree.

What parsers extract:

bash

1https://user:pass@example.com:8080/path?query=1#fragment

Parser returns:

Scheme: https
Username: user
Password: pass
Hostname: example.com
Port: 8080
Path: /path
Query: query=1
Fragment: fragment

The problem: Ambiguous URLs

bash

1http://evil.com@good.com

Some parsers: username=evil.com, host=good.com
Other parsers: host=evil.com, path=@good.com
Result: Security bypass

Why It Matters

Security decisions based on parser output
Parser inconsistencies = bypass opportunities
SSRF filters often rely on URL parsing
Same URL can mean different things to different systems
Hard to test all parser behaviors

How Attackers Use It

Parser Confusion Attack:

python

1# Security validation
2parsed = urlparse(user_input)
3if parsed.hostname == "trusted-site.com":
4    fetch(user_input)  # Approved
5
6# Attacker provides:
7"http://trusted-site.com@evil.com"
8
9# Validator sees: hostname = "trusted-site.com" ✓
10# HTTP library connects to: "evil.com" ✗

Real attack vectors:

1. @ symbol confusion:

bash

1http://whitelisted.com@attacker.com
2http://whitelisted.com%00@attacker.com

2. Backslash vs forward slash:

bash

1http://whitelisted.com\.attacker.com
2http://whitelisted.com/...attacker.com

3. Dot manipulation:

bash

1http://whitelisted.com.attacker.com
2http://whitelisted.com。attacker.com (full-width dot)

4. Encoding tricks:

bash

1http://whitelisted.com%2f%2f@attacker.com
2http://whitelisted.com%09@attacker.com

How to Detect or Prevent It

Prevention strategies:

1. Use same parser for validation and request:

python

1# Bad: Different parsers
2if urllib.parse.urlparse(url).hostname in whitelist:  # Parser A
3    requests.get(url)  # Parser B - may disagree!
4
5# Better: Same parser
6session = requests.Session()
7parsed = session.prepare_request(Request('GET', url)).url
8# Now validate the prepared URL

2. Resolve and validate IP:

python

1hostname = urlparse(url).hostname
2ip = socket.gethostbyname(hostname)
3if is_private_ip(ip) or ip == "169.254.169.254":
4    block()

3. Use strict parsing modes:

python

1# Enable strict mode if available
2parsed = urllib.parse.urlparse(url, allow_fragments=False)

4. Canonical form checking:

python

1# Reject URLs with suspicious patterns
2suspicious = ['@', '\\', '%00', '\t', '\n']
3if any(char in url for char in suspicious):
4    reject()

Detection:

Log URLs before and after parsing
Compare parser outputs across libraries
Alert on URLs with unusual characters
Test with parser confusion payloads

Common Misconceptions

"URL parsing is standardized" - Many edge cases
"My library handles it correctly" - Test to be sure
"Validation + encoding fixes it" - Parser runs first
"Modern parsers are safe" - New bypasses found regularly
"Just check the domain" - Domain extraction is the vulnerable part

Real-World Example

Orange Tsai's Research (2017)

Discovered parser inconsistencies in:

Python (urllib, urllib2, requests)
PHP (parse_url)
Java (Java.net.URL)
JavaScript (URL API)

Python SSRF Bypass:

python

1# Validator uses urlparse
2url = "http://localhost%09@good.com"
3urlparse(url).hostname  # Returns: "good.com"
4
5# requests library connects to
6# localhost (tab character %09 confuses parser)

Shopify SSRF (2018)

ruby

1# Ruby URI.parse
2url = "http://0x7f.1:22/" 
3URI.parse(url).host  # Returns: "0x7f.1"
4
5# But hex "0x7f.1" resolves to 127.0.0.1
6# Bypassed localhost block

Safari URL Parsing CVE

bash

1file:////evil.com/path

Safari parsed as: file protocol, local path
Actually fetched from: evil.com
Allowed arbitrary file reads

URL, SSRF, Bypass, Domain, HTTP Request

URL Parser

Short Definition

Full Definition

Why It Matters

How Attackers Use It

How to Detect or Prevent It

Common Misconceptions

Real-World Example

Sources & References

URL Parser

Short Definition

Full Definition

Why It Matters

How Attackers Use It

How to Detect or Prevent It

Common Misconceptions

Real-World Example

Sources & References

Sources & References

Related Concepts

Uniform Resource Locator(URL)

Server-Side Request Forgery(SSRF)

Bypass

Domain Name

Sources & References

Related Concepts

Uniform Resource Locator(URL)

Server-Side Request Forgery(SSRF)

Bypass

Domain Name