[arachne] Spaces in URLs

Arachne at FreeLists---The Arachne Fan Club!

Glenn,

First and foremost, I appreciate your fix. Unfortunately, you got a bit 
overexuberant, I fear. :(

As per RFC 2396, spaces MUST be escaped as %20 to be valid in a URI. Section 
2.4.3 explains why. Your test page kinda demonstrates why, without realizing 
it... ;) RFC 2616 provides no further restrictions on spaces in URIs. The only 
mention of spaces in URIs in HTML attributes is in HTML 4.01, section 6.2, as 
was quoted to you by Joe. The stripping in that case is optional, but is 
common. Else <a href=" http://www.mysite.com/ "> would always break since 
you're unlikely to have a resource at "%20http://www.mysite.com/%20";. :) And, 
of course, if you ever strip whitespace, you should do it consistently, not 
just "before any http link" or some other weird rules.

So, based on the information from the specs, *all* your fix should do is to 
remove leading and trailing whitespace from the attribute values. It should not 
remove spaces embedded in URLs. It's perfectly valid to have a directory named 
" Files" and use an href like this, <link href="%20Files">, pointing to it. 
However, <link href=" Files"> MAY (standards-definition) link to "Files", per 
the HTML specs. (And probably will in practically any browser.) Of course, it's 
perfectly valid to send someone off to " Files/" even though most browsers 
won't do that.

So, as long as the white-space token to be removed is at the beginning or end 
of a CDATA value, there's nothing wrong with it (from the view of the standards 
or common practice). However, stripping spaces from within URIs rather than 
escaping them isn't justifiable from any specs I'm familiar with.

In any case, it seems that this fix should be done elsewhere in the code than 
the URL parser... But in the case of meta refresh, perhaps the URL parser is 
the correct place. (Since meta refresh is an odd exception to most rules... ;) )

Hope to have shed some light on this. And thanks for actually taking the time 
to fix the space issue. I recall first noticing it in 1.4 or so (actually, 
probably earlier, but I've no way to be sure anymore) and I've been manually 
fixing offending URLs since. :) So, over the next year, you may save me a full 
ten minutes. Multiply that by everyone who never had it dawn on them that not 
implementing a "MAY" in a spec may be taken as a bug, and there's some manhours 
which will now be better spent! :)

Now that I've gotten my piece in, I'll head to sleep. :)

--Matt

P.S. I've been inactive in the Arachne community for quite some time, but I've 
been happily following its developments regardless. The problems this 
particular "fix" may cause over time urged me to delurk momentarily. :) But 
feel free to poke me regarding edge cases like this. I've practically memorized 
RFCs 2616 and 2396 as well as the HTML 4.01 spec. :)

P.P.S. If anyone's counting, my last Arachne delurk was December 02, 2005...
                  Arachne at FreeLists                  
-- Arachne, The Premier GPL Web Browser/Suite for DOS --

Other related posts: