While adding support for TrackBacks in my blog I ran into a weird issue. I was quite confused and after going through MSDN I was confused even more. When I tried posting a text-only TrackBack everything worked fine. But as soon as the "excerpt" (the notification text itself) contained an HTML tag, quotes, etc, the excerpt would get cut off right there on the offending character.
TrackBack Background
When you post a TrackBack, you need to provide 4 textual values:
- title - the title of the entry
- excerpt - notification text itself
- url - a permanent link to your blog entry
- blog_name - the name of your blog where you posted an entry
Encoding Values Of Form Fields
These four parameters are assembled into a string and sent to a server that accepts TrackBacks. Here's a sample from movabletype.org:
POST http://www.foo.com/mt-tb.cgi/5
Content-Type: application/x-www-form-urlencoded
title=Foo+Bar&url=http://www.bar.com/&
excerpt=My+Excerpt&blog_name=Foo
It goes without saying that you need to encode the parameters before you concatenate them. The natural choice seems to be HttpUtility.HtmlEncode. After all, MSDN describes it as follows:
HTML-encodes a string and returns the encoded string...
URL encoding ensures that all browsers will correctly transmit text in URL strings. Characters such as ?, &,/, and spaces may be truncated or corrupted by some browsers so those characters must be encoded in <A> tags or in query strings where the strings may be re-sent by a browser in a request string.
It also provides an example:
string TestString = "This is a <Test String>.";
string EncodedString = Server.HtmlEncode(TestString);
which is supposed to yield "This+is+a+%3cTest+String%3e.". Well, if you run this code in the debugger it yields "This is a <Test String>." instead! The resulting string is safer for HTTP transfer but it's not good enough to be POSTed. Every "special character" will have an ampersand in front of it (< for <, "e; for quotes, etc) and the Request.Form collection will split the string on the ampersands. This is exactly what I observed while sending myself test TrackBacks. Instead of those 4 parameters I would end up with 6 or more.
This is where HttpUtility.UrlEncode comes to the rescue. Suspiciously enough, it has almost the exact same wording and even the same code sample. A string encoded with UrlEncode can be safely POSTed to another page.
On the receiving end you need to decode the string and, again, oddly enough, both HtmlDecode and UrlDecode produced the same result.
The Lesson I Learned
Clearly, MSDN is lying about HtmlEncode. There's no sign of the promised conversion. It does make it safer for embedding in XML I think. Those < and > will be converted accordingly and the string will be good for embedding in an XML tag. Also, I looked at its code in Reflector and received a confirmation that MSDN is lying. For those who are curious here's the code of this method:
public static void HtmlEncode(string s, TextWriter output)
{
char ch1;
char ch2;
int num3;
if (s == null)
return;
int num1 = s.Length;
int num2 = 0;
while ((num2 < num1)) {
ch1 = s.Chars[num2];
ch2 = ch1;
if (ch2 != '\"') {
if (ch2 == '&') goto Label_0064;
switch ((ch2 - '<')) {
case 0: output.Write("<"); goto Label_00AE;
case 1: goto Label_0071;
case 2: output.Write(">"); goto Label_00AE;
}
goto Label_0071;
}
output.Write(""");
goto Label_00AE;
Label_0064:
output.Write("&");
goto Label_00AE;
Label_0071:
if ((ch1 >= ' ') && (ch1 < '\u0100'))
{
num3 = ch1;
output.Write(string.Concat("&#",
num3.ToString(NumberFormatInfo.InvariantInfo), ";"));
}
else
output.Write(ch1);
Label_00AE:
num2 += 1;
}
}
No trace of converting a string "the URL way".
I'd Like To Hear From You
Please feel free to share opinions as to what situations HtmlEncode and UrlEncode facilitate better.