Sunday, June 22, 2014

Empty href and src and there effects on Application

This is a problem I’ve come across frequently, and since it has come up again recently, I thought I’d explore this issue in the hope that it will save others some trouble. There are so many problems that this one issue can lead to that it’s baffling browsers still behave this way. The issue? An HTML image, either via <img> tag or JavaScript Image object, that has its src set to “” (an empty string).

The offending code

There are basically two patterns to identify. The first pattern is just straight HTML:
<img src="" >
The second pattern is JavaScript and involves the dynamic setting of the src property on either a newly created image or an existing one:

var img = new Image();
img.src = "";

You’ll note that Opera and Firefox aren’t mentioned at all. Opera behaves as you might expect: it doesn’t do anything when an empty image src is encountered; the attribute is ignored. Firefox 3 and earlier behave the same as Safari and Chrome, but Firefox 3.5 addressed this issue and no longer sends a request (related bug).
Both cases, of course, are problematic because it’s an image making a request for a document. You can easily see this behavior using an HTTP debugging proxy (I highly recommendFiddler).

The problems

There are two basic problems that this browser behavior causes. The first is a traffic spike.  Imagine that have <img src=""> on the page at http://www.example.com/. The big problem is that each instance of <img src=""> makes a request to / in all browsers, which is the homepage of the domain. Congratulations, you’ve effectively doubled your traffic to the homepage.
For small sites, this may not be that big of a deal; jumping from 10,000 to 20,000 page views probably isn’t going to raise any flags for you or your host. If you’re a page that gets millions of page views per day, and probably have a lot of machines to handle that load, doubling or tripling traffic can be crippling. You can very easily run out of capacity.
Another issue with the traffic increase is the computing power needed to generate that homepage. If the page is personalizable or is updated with some regular frequency, you could be wasting computing cycles creating a page that will never be viewed by anyone.
The second problem is user state corruption. If you’re tracking state in the request, either by cookies or in another way, you have the possibility of destroying data. Even though the image request doesn’t return an image, all of the headers are read and accepted by the browser, including all cookies. While the rest of the response is thrown away, the damage may already be done.

How does this code happen?

The first time I encountered this problem, I naively thought that it was a bad developer writing crappy code. Had this been 2000 or earlier, I probably would have been right. In today’s web development world, however, I’m mostly wrong. Today, there are so many templating engines and content management systems responsible for constructing pages on-the-fly that it’s quite possible for good developers to end up producing pages with this code. All it takes is something as simple as this PHP:

<img src="$imageUrl" >

If some other part of the code is responsible for filling in $imageUrl, and that code fails, then the offending code gets output to the browser.
In today’s web development world, we’re all doing something along these lines, whether we know it or not. Download a new WordPress theme? Make sure you’ll filled in all default arguments. Using a CMS at work? Make sure all your image URL fields are validated. It’s frightening easy to end up with this bad code on your page.

Other tags with problems

Before getting too angry at browser vendors, I think it’s fair to take a look at the HTML 4 specification, specifically the part defining images. Even though the specification indicates that the src attribute should contain a URI, it fails to define the behavior when src doesn’t contain a URI. Of course, images aren’t the only tags that reference an external resource, and so it should come as no surprise that there are other tags with the same problem.
As it turns out, Internet Explorer is the most sane browser out there. It’s problems are thankfully limited to images with an empty src attribute. It does make for this by making it a pain to detect, but that will be discussed later.
For other browsers, there are two additional problem scenarios: <script src=""> and <link href="">. Chrome, Safari, and Firefox all initiate another request.
Thankfully, no browser has a problem with <iframe src="">, as all correctly do not make another request.

What can be done?

Of course, the best thing to do is eliminate the offending code from your pages whenever possible. That’s fixing the problem at the source. If you can’t do that, though, your next best option is to attempt to detect it on the server and abort any further execution.
For browsers other than IE, it’s not too difficult to detect what’s going on from the server side. Since the request comes back to the exact same location that contains the offending code, there are two things you can do. First, you can check the request’s referrer. A request resulting from this issue coming from http://www.example.com/dir/mypage.htm will have a referrer ofhttp://www.example.com/dir/mypage.htm. Assuming that there are no valid situations under which your page links to itself, this is a fairly safe way to detect these requests on the server-side.
Internet Explorer throws a wrench into the works by sending the request to the directory of the page instead of the page itself. If you’re only using path URLs (i.e., nothing with a file extension), then the effect is the same and you can use the same referrer detect. Some sample code for use with PHP:

<?php
    //Works for IE only when using path URLs and not file URLs

    //get the referrer
    $referrer = isset($_SERVER['HTTP_REFERER']) ? $_SERVER['HTTP_REFERER'] : '';

    //current URL (assuming HTTP and default port)
    $url = "http://" . $_SERVER['HTTP_HOST']  . $_SERVER['REQUEST_URI'];

    //make sure they're not the same
    if ($referrer == $url){
        exit;
    }
?>

The goal here is to detect that the page refers to itself and then exit immediately to prevent the server from doing anything additional. Another option, and probably a good idea, is to log that this has happened so it shows up on a dashboard for evaluation.
Another way to attempt to detect this type of request on the server is by looking at the HTTPAccept header. All browsers except IE send different HTTP Accept headers for image requests than they do for HTML requests. As an example, Chrome sends the following Accept header for an HTML request:

Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5

Compare this to the Accept header that is sent for an image, script, or style sheet request:
Accept: */*
Firefox, Safari, and Opera all send roughly the same Accept header for HTML requests, meaning that you can check for an individual part, such as “text/html”, to determine if the request is an HTML request or something else. Unfortunately, IE only sends the latter Acceptheader for all requests, so there is no way to differentiate this on the server. For browsers other than IE, you can use something like the following:

<?php
    //Warning: Doesn't work for IE!

    //make sure the Accept header has 'text/htmnl' in it
    if (strpos($_SERVER['HTTP_ACCEPT'], 'text/html') === false){
        exit;
    }
?>

This check is a little safer than the previous, but its big downside is that it doesn’t work in IE.
Why does this happen?
The real problem is the way that URI resolution is performed in browsers. This behavior is defined in RFC 3986 – Uniform Resource Identifiers. When an empty string is encountered as a URI, it’s considered a relative URI and is resolved according to the algorithm defined insection 5.2. This specific example, an empty string, is listed in section 5.4. Firefox, Safari, and Chrome are all resolving an empty string correctly per the specification, while Internet Explorer is resolving it incorrectly, apparently in line with an earlier version of the specification, RFC 2396 – Uniform Resource Identifiers (this was obsoleted by RFC 3986). So technically, the browsers are doing what they’re supposed to do to resolve relative URIs. The problem is that in this context, the empty string is clearly unintentional.

It’s time to fix this

This is a serious flaw in browsers, and I’m not sure you can look at it in any way where it’s not considered a bug. The inconsistent behavior, from Opera completely ignoring all invalid external references, to IE falling victim only for <img> tags while others do the same for<script> and <link> as well, seem to indicate a bug in browsers. Though browsers seem to be following correct URI resolution (except IE), I think this is a case where common sense must win over the letter of the specification. There is no way that an image can possibly render an HTML page, and the same goes for <script> and <link>. This bug has cost web developers hundreds of lost hours and has potentially brought down sites, pushing servers over capacity. Enough is enough. It’s time for the browser vendors to fix this bug. I’ve taken the liberty of filing or locating bugs:
Please show support for fixing these bugs, as I don’t see any reason why we should still be dealing with this browser behavior. And if anyone can get the note to Microsoft so they can address IE, we’d all greatly appreciate it.

HTML5 to the rescue

HTML5 adds to the description of the <img> tag’s src attribute to instruct browsers not to make an additional request in section 4.8.2:

The src attribute must be present, and must contain a valid URL referencing a non-interactive, optionally animated, image resource that is neither paged nor scripted. If the base URI of the element is the same as the document’s address, then the src attribute’s value must not be the empty string.

Hopefully, browsers won’t have this problem in the future. Unfortunately, there is no such clause for <script src=""> and <link href="">. Maybe there’s still time to make that adjustment to ensure browsers don’t accidentally implement this behavior.

Avoid Empty Image src

tag: server
Image with empty string src attribute occurs more than one will expect. It appears in two form:
  1. straight HTML
    <img src="">
  2. JavaScript
    var img = new Image();
    img.src = "";
Both forms cause the same effect: browser makes another request to your server.
  • Internet Explorer makes a request to the directory in which the page is located.
  • Safari and Chrome make a request to the actual page itself.
  • Firefox 3 and earlier versions behave the same as Safari and Chrome, but version 3.5 addressed this issue[bug 444931] and no longer sends a request.
  • Opera does not do anything when an empty image src is encountered.

Why is this behavior bad?
  1. Cripple your servers by sending a large amount of unexpected traffic, especially for pages that get millions of page views per day.
  2. Waste server computing cycles generating a page that will never be viewed.
  3. Possibly corrupt user data. If you are tracking state in the request, either by cookies or in another way, you have the possibility of destroying data. Even though the image request does not return an image, all of the headers are read and accepted by the browser, including all cookies. While the rest of the response is thrown away, the damage may already be done.

The root cause of this behavior is the way that URI resolution is performed in browsers. This behavior is defined in RFC 3986 - Uniform Resource Identifiers. When an empty string is encountered as a URI, it is considered a relative URI and is resolved according to the algorithm defined in section 5.2. This specific example, an empty string, is listed in section 5.4. Firefox, Safari, and Chrome are all resolving an empty string correctly per the specification, while Internet Explorer is resolving it incorrectly, apparently in line with an earlier version of the specification, RFC 2396 - Uniform Resource Identifiers (this was obsoleted by RFC 3986). So technically, the browsers are doing what they are supposed to do to resolve relative URIs. The problem is that in this context, the empty string is clearly unintentional.
HTML5 adds to the description of the  tag's src attribute to instruct browsers not to make an additional request in section 4.8.2:
The src attribute must be present, and must contain a valid URL referencing a non-interactive, optionally animated, image resource that is neither paged nor scripted. If the base URI of the element is the same as the document's address, then the src attribute's value must not be the empty string.
Hopefully, browsers will not have this problem in the future. Unfortunately, there is no such clause for <script src=""> and <link href="">. Maybe there is still time to make that adjustment to ensure browsers don't accidentally implement this behavior.

No comments:

Post a Comment