URLs in Internet Explorer 7

Internet Explorer 7 includes a new URL handling architecture known internally as CURI.  The new optimized URI functions provide more secure and consistent parsing of URIs to reduce attack surface and mitigate the threat of malicious URIs.

When designing our security strategy for IE7, malicious URIs were near the top of the list because secure handling of URIs throughout IE is critical to the security of the system. Hence, a major architectural investment was made in CURI for IE7.

Unlike most of the new features in IE7, most end users will never notice CURI working “under the hood” on their behalf.  For the technical readers in the audience, however, the details behind CURI may be of some interest.

Background

Uniform Resource Locators (URLs) are one of the most important and seemingly simple concepts web users encounter.  Almost everyone recognizes a Uniform Resource Locator as the character string which allows the browser to find a website.

Uniform Resource Identifiers (a superset of URLs) were most recently formally specified in RFC3986, the fourth significant revision in the evolving definition of URIs. To quote the RFC,

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource.

Pretty simple stuff, right?  Alas, as usual, the devil is in the details.

Strings vs. Objects

Given the definition of a URI, it seems natural to represent a URI as a character string. And, in fact, simple character strings are how URIs are most often stored and transferred.  For instance, to navigate to a web page, the user types a URI string into the browser’s address bar, or clicks an HTML anchor tag containing a HREF attribute whose value is a string URI.

Unfortunately, there are some downsides to using character strings to store URIs. The biggest problem is that a string is a simple data structure which only holds a sequence of characters, and contains no further information or logic for how that data should be interpreted. Having more information available about a URI is useful for a number of reasons, but security tops that list.

How the browser uses URIs

When you visit a webpage, chances are that there are dozens to hundreds of embedded resources, many of which are addressed via relative URIs. For instance, if you visit https://search.msn.com/default.aspx, the HTML source contains the following image tag:

<img src="/s/hp/bluesky_logo.gif" title="MSN" alt="MSN" height="32" width="81" />

In order for the image to be downloaded, the browser must first combine https://search.msn.com/default.aspx with /s/hp/bluesky_logo.gif to come up with a complete URI which can be downloaded: https://search.msn.com/s/hp/bluesky\_logo.gif. In order to combine the base URI with the relative URI, the browser must first crack the base URI and the relative URI to retrieve their respective components.

A URI consists of multiple components, each of which helps the browser and server to retrieve the requested file.

For example, given the URI https://search.msn.com/results.aspx?q=ie7\#listings

  • The scheme component is http
  • The hostname component is search.msn.com
  • The path component is /results.aspx
  • The query component is q=ie7
  • The fragment component is listings

Thus, when generating the full URI to the image, IE must combine the scheme and hostname from the base URI (https://search.msn.com) with the path of the relative URI (/s/hp/bluesky_logo.gif). As you might imagine, performing this crack-and-combine process hundreds of times per page is quite inefficient, and introduces the risk of inconsistent parsing or evaluation.

Security

All web browsers make security decisions based upon URIs.  Many security features, from Security Zones to the JavaScript same-origin policy, depend on the browser being able to consistently evaluate URIs to determine their components, and to compare them to other URIs.

If a bad guy (or gal!) can get a browser to incorrectly or inconsistently crack or combine a URI, the user’s security may be compromised.  Over the years, a significant percentage of browser patches have been issued to address exploits against URI parsing flaws, for instance the CAN-2005-0054 vulnerability in IE, or Opera’s older %2f bug, to name but two of many.

URI-parsing attacks against Internet Explorer typically attempt to trick a security function (like MapURLToZone) into evaluating an exploit URI incorrectly (for instance, by returning the wrong security zone). If the URI is zoned into a more trusted zone than it deserves, the content at that URI might execute with elevated privileges.  Other common attacks attempt to set or steal cookies from other domains, read content from one domain and send it to another, or spoof the user by displaying the URI incorrectly.

Difficulties in securely handling string-based URIs are often rooted in the fact that there are an infinite number of possible representations for a single URI, so a simple string comparison isn’t possible. RFC3986 specifies conditions in which a character in a URI is equivalent to a percent-encoded character in the format %HH, where HH is the hexadecimal-formatted integer representation of the character. Equivalence rules vary depending on where a character appears in a URI; for instance, the scheme and hostname are case-insensitive, but paths and queries are not.

The following URIs are all equivalent:

All code paths in the browser must be fully knowledgeable about the rules of URI-parsing in order to correctly evaluate a URI. Any failures could enable an attacker to circumvent security restrictions.

The Solution: CURI

CURI is a lightweight object which holds a single URI in normal form. If the CURI is constructed from a string URI, that string URI is cracked just once when the object is first constructed. After construction, callers may access any of the URI components using members provided by the object. This ensures that URIs are evaluated consistently throughout both security and feature code paths. We’ve re-plumbed Internet Explorer to accept and use CURI objects internally; most of this work has already shipped in Beta-1.

The CURI object is available for consumption by external callers like ActiveX controls and Browser Helper Objects; documentation will be provided on MSDN as the CURI class is finalized. It’s worth noting that even external code that does not directly consume CURI objects will benefit from the change, because Unicode string serialized out of CURI objects will be consistently normalized, decreasing the likelihood of incorrect parsing even outside of IE.

Future Directions: International URIs

One advantage of the centralization provided by the CURI object is that it enables future URI-handling enhancements. In particular, working with international URIs is a key scenario for Internet Explorer 7, and the fully-Unicode CURI object is the keystone for our worldwide support. International URIs are critical to the future of the web as ever more international sites come online and more of the world’s diverse languages appear on the web.

I’m not quite ready to talk about IE7’s support for International Domain Names (IDN) yet, but expect to hear more as Beta-2 approaches. In particular, we’ll be talking about how IDNs work within existing network infrastructure, and how IE7 will mitigate the threat of Unicode homograph attacks.

- EricLaw

Comments

  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    Good to know that the URI parsing/handling will be so extensible, so next step could be data URIs.
  • Anonymous
    January 01, 2003
    I hope for the new optimized URI handling architecture.
  • Anonymous
    January 01, 2003
    I find it quite shocking that only with version seven of IE has someone at MS finally figured this out! I hope this "new-found-thought" extends to other common parts of MS too!
  • Anonymous
    January 01, 2003
    Sorry if I may have missed this in your post (or previous posts) but …

    A few years back, when I was first starting out in HTML and web site building, I recall that the book I was learning from stated that when using hyperlinks or referring to images (or any address for that matter) that a relative path should always be used when the resource is hosted under the same domain, as the web browser can match them up and request the resource more effectively, but after reading your post, and talk of combining and cracking each URL makes me think differently.

    So could I ask if I should in fact be using absolute URL’s for each resource within any given web page for the current fleet of IE’s or if I have just miss-read part of your post?

    Thanks
  • Anonymous
    January 01, 2003
    IDN support AT LAST!

    Only when MSIE has native support for it we can start to use IDN.
  • Anonymous
    January 01, 2003
    And here I was thinking, is there still somebody that doesn't do this already..
  • Anonymous
    January 01, 2003
    It is really great to hear that IE7 will support IDN, and I'm interested to hear how you're addressing the homograph problem.

    Also, will IE7 support the data: URI scheme?
  • Anonymous
    January 01, 2003
    Urgh. Does this really mean IE6 deals with URI only in string form ?!? Scary.
  • Anonymous
    January 01, 2003
    hmm, are there URLs not in string form? I'm confused&hellip;
  • Anonymous
    January 01, 2003
    I tried to find a wishlist for Internet Explorer but couldn't find one. I was wondering if the new Internet Explorer will have tooltips over links like Opera.
    This is a truly remarkable thing, you point your mouse over a link and in a tooltip you have the real address to where yjat link points to. With this you can hardly make the mistake of clicking a link that will take you anywhere else that the text suggests.
  • Anonymous
    January 01, 2003
    All IE7 is a knock-off version of FireFox, Netscape (owner of Mozilla, which created FireFox) has always been a better internet Browser
  • Anonymous
    January 01, 2003
    Dear Eric,

    thanks for your reaction on mine and other posts regarding the IDN-technology.

    It is so good to hear the first official MS-statement that confirms IDN will be supported by IE 7.

    I'm very curious which solutions you've chosen.

    Thanks again,

    Jean Pascal
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    Great! To-the-point talk, with references to RFCs instead of marketing material :)

    Don't forget Data URI scheme! It's very useful sometimes.
  • Anonymous
    January 01, 2003
    Thanks for the IDN-support!
  • Anonymous
    January 01, 2003
    Good work! I'm glad its coming out this way. I think you guys should open up a forum to discuss that tabbed browsing when you get a chance, or perhaps a chat session. Just my 2c.
  • Anonymous
    January 01, 2003
    Thanks for listening to us. IDN support is a great add-on.
  • Anonymous
    January 01, 2003
    Would you please fix your RSS feeds to include the proper character encoding? I'm tired of looking at question marks and blocks in my feed readers.

    It isn't hard, to the top just add:

    <?xml version="1.0" charset="windows-1252"?>

    I've requested this a couple of times on your contact page.
  • Anonymous
    January 01, 2003
    CURI = ?

    Centralized Uniform Resource Identifiers ?
  • Anonymous
    January 01, 2003
    Please support data: URIs.

    For very little development investment, it would add some useful flexibility.
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    How much of this surfing information is retrievable by Microsoft - will it be used in the New MSN search ALGOs?



    ["The CURI object is available for consumption by external callers like ActiveX controls and Browser Helper Objects;." ]
  • Anonymous
    January 01, 2003
    Netquik asked: "CURI = ?

    Centralized Uniform Resource Identifiers ?"

    My guess is that CURI is the class name of the object itself. ("C" is a standard prefix for classes.) In other words, it is the class that implements the IUri interface that is documented at: http://msdn.microsoft.com/library/default.asp?url=/workshop/networking/moniker/reference/ifaces/IUri/IUri.asp
  • Anonymous
    January 01, 2003
    Give me a break Mr Alberto.
    All the other browsers are the ones we need to blame?
    I think you aren't beeing fair!
    What about the w3c Standards? The ones that IE DOESN'T follow?
    What about security issues?

    I'm Not a professional web developer, but since I discover and learn about the standards, I can see where MS did the wrong job, not following the standards and blaming other browsers, just like you did right now!

    So don't spread FUD and think after talk

    PS 2 plus 2 is five (when you count wrong)
    Goood Bye
  • Anonymous
    January 01, 2003
    I too am rather shocked and appalled that MS has just realized, in 2005, that OOP is the way to go. No wonder IE has had so many security problems. Do I dare ask what other common code is currently implemented dozens of times?
  • Anonymous
    January 01, 2003
    So, the 508 character URL limit is fixed now? (for Bookmarks)

    If so, please post the new limit...

    If not, please go back to your developers and give em a good kick.

    Rick
  • Anonymous
    January 01, 2003
    <<EricLaw, even given unlikely ../ directories in absolute addresses, it is still not infinite if there is a practical limit to the length of URLs.>>

    Hehe. I stand corrected. :-) I'll refine my statement to be: There are somewhere in the neighborhood of ~40 ^ 2000 possible equivalent representations of a given URL.

    (I misspelled minuscule, above. Oops.)
  • Anonymous
    January 01, 2003
    The Internet Explorer team talks about URLs in IE7, and what they're doing to prevent spoofing as much...
  • Anonymous
    January 01, 2003
    Guys, could you add a setting to turn off SOUNDS in IE? It drives me insane that for the last however many years I have to listen to pages loading!!!


    Whats wrong with the one that has been there for years?

    In the same place every single other setting on windows is, strange enough - they put settings all in the control panel.

    Start -> Control Panel -> Sounds and Audio -> Second Tab -> Either Edit the individual Sound on the bottom, or as I do, select the No Sound Scheme from the top drop down -> Click off on the stupid save/whatever alert that comes up after you hit apply -> Ok out


    Done
  • Anonymous
    January 01, 2003
    I don't understand why this align=justify does not work for text in IE. It works perfectly well in firefox 1.0 and above. IE developers, please try to see that this option works....

    Pratap.
  • Anonymous
    January 01, 2003
    I wonder if it would be worth exploring the TinyURL paradigm and creating a local database of frequent and/or common URI's and tagging them as a shortcut. This could work in conjunction with IDN's.
  • Anonymous
    January 01, 2003
    Its been a while since my last post here. I guess Beta 2 work has been taking most of my time. But I...
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    It'd be nice if I enter "blah.com:81" I don't need to enter http://.
  • Anonymous
    January 01, 2003
    Please start fixing the URL length limit of 2083 (= 2048 + the length of "http://www.microsoft.com/index.html" ?), and support for data urls like data:image/jpg;base64 as well...
  • Anonymous
    January 01, 2003
    I've been round the blogverse and there are a lot of cool things being linked to. Seems a shame to have...
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    I recently found out that you're not supposed to be able to use back slashes () in URLs, yet IE supports the use of them.

    This was causing problems with a certain website I was making because the programmer had linked to everything with back slashes, and ofcourse the links didn't work in Firefox.

    Can you please rectify this problem so people don't assume it's ok to use backslashes for their websites?
  • Anonymous
    January 01, 2003
    "Can you please rectify this problem so people don't assume it's ok to use backslashes for their websites?"

    I completely disagree. There is an old rule of programming when implementing specifications, be conservative in what you send, and liberal in what you accept. I believe IE is following that rule. I'd much rather bad code work in IE than not work because of standards conformance. IE is correct probably 99% of the time when it assumes the user meant to use a /, so why add needless difficulty? Instead, I'd be asking the Firefox people to add this feature!
  • Anonymous
    January 01, 2003
    So, will the classes controlling CURL's be added to the .NET Framework?
  • Anonymous
    January 01, 2003
    Codemastr,

    I believe that's the sort of attitude that makes IE such a pain for designers and developers today.

    There would be no need for rules and standards if they were just ignored all the time. By ignoring specifations you're just making things more confusing; because how would you know if what you're doing is correct?

    I would rather have code that doesn't work at all rather than code that sometimes works. If something didn't work at all, at least you could fix it before it becomes a habit.

    If a person repeatedly pronounced your name incorrectly would you not correct them?

    Specifations are there for a reason, follow them!
  • Anonymous
    January 01, 2003
    Could you make no limitation to URL length. Because URL length limitation to 2048 characters my Website is only compatible with Firefox. Thanks.
  • Anonymous
    January 01, 2003
    this is a really welcome enhancement. Link: IEBlog : URLs in Internet Explorer 7. Internet Explorer 7 includes a new URL handling architecture known internally as CURI. The new optimized URI functions provide more secure and consistent parsing of URIs
  • Anonymous
    January 01, 2003

    'I’m not quite ready to talk about IE7’s support for International Domain Names (IDN) yet'

    whats soo secret??

    (and why wasn't there support in first beta)

  • Anonymous
    January 01, 2003
    <<<Could you make no limitation to URL length. Because URL length limitation to 2048 characters my Website is only compatible with Firefox.>>>

    10 billion webpages get along just fine with URLs under 2048 characters. The RFC calls for 1024. What the heck are you doing? URLs this long are terrible for performance.

    <<<There is an old rule of programming when implementing specifications, be conservative in what you send, and liberal in what you accept>>>

    Alas, that's from the good old days, back before security bugs took advantage of liberal interpretations. IE should lock this down.
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    >> All IE7 is a knock-off version of FireFox, Netscape (owner of Mozilla, which created FireFox) has always been a better internet Browser.

    Firefox is a knock of Opera who has had tabbed browsing, popup blocking, css compliance, skinned browsing, search bar, abilty to turn off styles, etc. etc. etc. for years. Long before Firefox even existed. Firefox has copied Opera is so many ways it's rediculous. So, stop harping on Microsoft for copying Firefox because Firefox is only and Opera wannabe. By the way incorporating a feature that is not currently in your program to make it better, even though a compeditor has already done that, is not a bod thing. It shows you are keeping up with the times. Should we harang Microsoft because they are bettering their css support, saying "Microsoft is adding to their css ... they're copying Firefox's css support." It would be ludicrous to say that, and so is harping on Microsoft because they are adding tabbed browsing. Firefox didn't invent it and it isn't the best browser out there. Get over it.
  • Anonymous
    January 01, 2003
    "codemastr: But then when to tell if I actually means https://blah.com:81"

    You're right. However, when I just type in "blah.com" it defaults to http. I didn't specify a WKP, it just assumes I want http. I think that should always be the case. Unless I specify a WKP or explicitely specify a protocol, it should default to http.

    Actually, to make it even better and to address both mine and M's requests, what about making it configurable? IE apparently has some rules to determine "is this ftp?" "is this http?" what about making them user configurable. It could be like an Outlook rule.

    Where the PORT is 81 use HTTP

    Where the HOSTNAME begins with ftpsearch use HTTP

    That'd be pretty cool if you ask me.
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    Cheong: Running HTTPS on port other than 443 isn't a good thing, because you can't connect there via proxy.
  • Anonymous
    January 01, 2003
    codemastr: But then when to tell if I actually means https://blah.com:81

    I think it's right to assume protocols only on well known ports.
  • Anonymous
    January 01, 2003
    I noticed with IE7's new history handling, using Back and Forward on fragment links no longer works.

    Clicking a fragment link in a page no longer adds an item to the history, breaking that functionality.

    Is it planned to re-add history for fragment links?
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    Dear mr. Alberto, thank you for calling me and many other readers here very very experienced and very very sophisticated. I feel honored.

    But please, you too can educate yourself. Simply look at microsofts list of bugfixes they are now going to release with IE7 (http://blogs.msdn.com/ie/archive/2005/07/29/445242.aspx) for confirmation and learn more about its deeper, current mysteries at http://www.positioniseverything.net/ie-primer.html and http://www.satzansatz.de/cssd/onhavinglayout.html.
    Read up and come and join us at our side of the fence ;)
  • Anonymous
    January 01, 2003
    The comment has been removed
  • Anonymous
    January 01, 2003
    EricLaw, even given unlikely ../ directories in absolute addresses, it is still not infinite if there is a practical limit to the length of URLs.
  • Anonymous
    January 01, 2003
  1. ChrisH: No need to convert your relative hyperlinks to absolute; the efficiency savings would be miniscule, you'd have to transfer more bytes over the wire, and it would cause problems if you ever wanted to change your root domain.

    2. Brianiac: Actually, the equivalent set is infinite. http://www.example.com/1/../ == http://www.example.com/2/../, etc.

    3. Peter: You can turn off sounds using the Control Panel.