Skip navigation.

Writing Accessible Code In ASP.NETAll recent postsWhat's Wrong With MS Support?

UrlEncode vs. HtmlEncode

While adding support for TrackBacks in my blog I ran into a weird issue. I was quite confused and after going through MSDN I was confused even more. When I tried posting a text-only TrackBack everything worked fine. But as soon as the "excerpt" (the notification text itself) contained an HTML tag, quotes, etc, the excerpt would get cut off right there on the offending character.

TrackBack Background

When you post a TrackBack, you need to provide 4 textual values:

  • title - the title of the entry
  • excerpt - notification text itself
  • url - a permanent link to your blog entry
  • blog_name - the name of your blog where you posted an entry

Encoding Values Of Form Fields

These four parameters are assembled into a string and sent to a server that accepts TrackBacks. Here's a sample from movabletype.org:

POST http://www.foo.com/mt-tb.cgi/5
Content-Type: application/x-www-form-urlencoded

title=Foo+Bar&url=http://www.bar.com/&
excerpt=My+Excerpt&blog_name=Foo

It goes without saying that you need to encode the parameters before you concatenate them. The natural choice seems to be HttpUtility.HtmlEncode. After all, MSDN describes it as follows:

HTML-encodes a string and returns the encoded string...

URL encoding ensures that all browsers will correctly transmit text in URL strings. Characters such as ?, &,/, and spaces may be truncated or corrupted by some browsers so those characters must be encoded in <A> tags or in query strings where the strings may be re-sent by a browser in a request string.

It also provides an example:

string TestString = "This is a <Test String>.";
string EncodedString = Server.HtmlEncode(TestString);

which is supposed to yield "This+is+a+%3cTest+String%3e.". Well, if you run this code in the debugger it yields "This is a &lt;Test String&gt;." instead! The resulting string is safer for HTTP transfer but it's not good enough to be POSTed. Every "special character" will have an ampersand in front of it (&lt; for <, &quote; for quotes, etc) and the Request.Form collection will split the string on the ampersands. This is exactly what I observed while sending myself test TrackBacks. Instead of those 4 parameters I would end up with 6 or more.

This is where HttpUtility.UrlEncode comes to the rescue. Suspiciously enough, it has almost the exact same wording and even the same code sample. A string encoded with UrlEncode can be safely POSTed to another page.

On the receiving end you need to decode the string and, again, oddly enough, both HtmlDecode and UrlDecode produced the same result.

The Lesson I Learned

Clearly, MSDN is lying about HtmlEncode. There's no sign of the promised conversion. It does make it safer for embedding in XML I think. Those < and > will be converted accordingly and the string will be good for embedding in an XML tag. Also, I looked at its code in Reflector and received a confirmation that MSDN is lying. For those who are curious here's the code of this method:

public static void HtmlEncode(string s, TextWriter output)
{
 char ch1;
 char ch2;
 int num3;
  
 if (s == null)
    return;
 
 int num1 = s.Length;
 int num2 = 0;
 while ((num2 < num1)) {
   ch1 = s.Chars[num2];
   ch2 = ch1;
   if (ch2 != '\"') {
     if (ch2 == '&') goto Label_0064;

     switch ((ch2 - '<')) {
       case 0: output.Write("&lt;"); goto Label_00AE;
       case 1: goto Label_0071;
       case 2: output.Write("&gt;"); goto Label_00AE;
     }
    goto Label_0071;
  }

 output.Write("&quot;");
 goto Label_00AE;
 
 Label_0064:
 output.Write("&amp;");
 goto Label_00AE;
 
 Label_0071:
 if ((ch1 >= ' ') && (ch1 < '\u0100'))
 {
  num3 = ch1;
  output.Write(string.Concat("&#", 
        num3.ToString(NumberFormatInfo.InvariantInfo), ";"));
 }
 else
   output.Write(ch1); 
 
 Label_00AE:
 num2 += 1;
 }
}

No trace of converting a string "the URL way".

I'd Like To Hear From You

Please feel free to share opinions as to what situations HtmlEncode and UrlEncode facilitate better.

Comments

Comment permalink 1 Shannon J Hager |
URL encoding and HTML encoding are not the same thing. If you want to encode for use in a URL, you use URL encoding. If you want to encode for display on an HTML page (converting angle brackets to "& lt ;" for example), you HTML encode it. The docs are wrong, the verbage for URLEncode is used in the HTMLEncode documention you link to above.
Comment permalink 2 Kiliman |
I agree with Shannon. I'll just add one other "rule of thumb".

The reason you "encode" data is to prevent certain characters in your data to be misinterpreted by the receiver.

HtmlEncode converts the angle brackets, quotes, ampersands, etc. to the entity values to prevent the HTML parser from confusing it with markup.

UrlEncode converts spaces to "+" and non-alphanumeric to their hex-encoded values. Again this is to prevent the the URL parser from misinterpreting an embedded ?, & or other values.

If you're wondering which one you should use in an HTTP POST, well just think of POST data as an extremely long query string. So naturally you will need to use UrlEncode.

Kiliman
Comment permalink 3 Kiliman |
I was curious if Microsoft had fixed that documentation error.

I went to the Longhorn SDK site, and it is still showing the wrong information.

HtmlEncode doc

I don't know where you send documentation bug reports, but you should let them know.
Comment permalink 4 CraigD |
This post helped me with my own ExtendedHtmlUtility.

So far I've found it useful in two situations: (1) resolving HTML entities in pages in a search engine spider and (2) outputting entire Chinese, Korean and Japanese pages using entities to represent all 'double byte' characters within the iso-8859-1 charset. Why? A shared Apache server had been set-up to ONLY send the HTTP Content-Type: iso-8859-1 - meaning that browsers could not successfully display the page without the user manually selecting the encoding...

I haven't touched on UrlEncoding (or decoding) but I guess it'd follow the same pattern, with a difference encoded form...
Comment permalink 5 Ryan Walters |
There seems to be some confusion over the difference between POST and GET. POST submits a form without appending parameters to the URL. GET is the method that appends the form fields to the URL.
Comment permalink 6 Danny |
Hello,

'xmsdnbug@microsoft.com' is an alias you can use to report MSDN bugs. I came across your site today and reported the bug via an internal Microsoft alias, so there is no need for you to report it at this time.

I'd imagine it will take time as it probably has a few layers of approval to go through as well as localization into different languages.

Please note that while I work at Microsoft I have no direct ties to MSDN; I'm in a different product group entirely. :)
Comment permalink 7 Milan Negovan |
Thank you, Danny. I noticed that these days it's always quicker to find a product group blog to contact people within Microsoft directly.

Emails and Notifications

Would you like to be notified when somebody responds to this post?  Would you like to have these comments emailed to you?

TrackBacks

Sorry, TrackBacks are not allowed.

Submit your comment

Please enter only text since all HTML tags except hyperlinks will be stripped. Hyperlinks will become live links. Any comments with flaming or offensive language will be deleted. Be courteous to other posters. Thank you.

Your name (required):
Your email (optional):
Your site's URL (optional):
Enter this number
Type in the number above:
Comment (required):