Skip navigation.

Producing XHTML-Compliant Pages With Response Filters

Comment permalink 1 SomeNewKid |
An interesting article, Milan.

I do have a question, however. (As I'm only a newbie developer, it *is* only a question ... not a challenge.)

Isn't running the output of *every* page through THREE RegEx processes extremely poor in terms of performance and scalability?

Why not just use custom Page and Form classes, as per the following article:
http://www.liquid-internet.co.uk/content/dynamic/pages/series1article1.aspx

Thanks again for an interesting article.
Comment permalink 2 Milan |
Yep, I've seen this article. It's an interesting approach, but... ASP.NET is a pretty complex framework. There's a lot of plumbing in place. I really wouldn't want to maintain an alternative server-side form, its viewstate, etc. It seems like too much hassle. I find it easier and more efficient to tap into a generated response and tweak it. Also, it's been promised ASP.NET 2.0 would produce web standards compliant code. We'll see about that in due time. If they meet the promise my filter won't be necessary which I'm ok with.

As to manipulating text - yes, it may create overhead. Since strings in are immutable you end up re-allocating chunks of memory each time you need to assign a string a new value. You can compensate this with page caching or page fragment caching. The benefit of caching is tremendous.
Comment permalink 3 Paul |
Nice article, good to see more and more people catching onto standardizations. I really like this as I can apply this filter on a page by page basis or a base page class in template scenarios. I'd like to see an implementation that changes the bytes directly as they are read through...

Anyone interested in standardized markup with asp.net may be interested in a new product on the markup ( not mine, not a shameless plug ): www.xhtmlwebcontrols.net
Comment permalink 4 Bill |
Interesting article. I've just started doing C# with XHTML (with some XML thrown in for fun). I'm still a ways off, but I've cut my validation errors way down - by half so far. I'm going to have to dig into your article a bit more to see what I'm missing yet. Thanks for getting this information out there.
Comment permalink 5 Lon Palmer |
Oh my.

I have a project to do for a client that is running MS 2003 servers. Naturally I thought "I'll use ASP.NET, it's native to his server!" My next thought was "I'll make it standards compliant. I'll use XTML 1.0 strict!"

Now, after reading this, I want to install Tomcat and go back to Java.

Most of my PC coding, I do in C# WinForms. My server coding was always in Java (J2EE) and I'll probably move back to it.

I don't like alot of Java Script in my pages (any if I can help it). Nor do I like the idea of not being in control of the output of my page. What were the ASP.NET developers thinking when they took so much control? What if their code breaks a browser that I need to target, Like a web enabled phone or a pda? Seems they've painted themselves into a "Web from a PC only" box.

Is ASP.NET really viable in the enterprise? Really?
Comment permalink 6 Lon Palmer |
Ok, One kudo to ASP.NET. It seems VERY fast.

nuff said.
Comment permalink 7 Jason |
Won't response.Filter overwrite any existing filters in place? Forgive me if this is a stupid question, but I am completely new to the HttpModule framework...
Comment permalink 8 Milan Negovan |
You need to make sure you don't just overwrite the old filter, but "chain" it. The code shows it passes the existing response.Filter on to the contructor on my filter and thus preserves it (look up InstallResponseFilter in the HttpModule). Makes sense?
Comment permalink 9 Nick |
Milan,

how efficient is that HTTP Filter in terms of performance & scalability, when using it in high loads conditions (hundred of thousands users per months for example..) ?

It seems to be the easiest way to make XHTML compliant my aspx before ASP.Net 2.0 comes out in 2005, well at least easier than XHTMLWebControls which require to modify the core of my webapp
Comment permalink 10 Milan Negovan |
I'd be careful with it---under heavy loads it might not perform that well because of all the RegEx processing. Caching cleaned up pages and compressing them will surely mitigate the initial performance hit.

The upside of this approach is its ease of use. You simply plug it into the pipeline.
Comment permalink 11 David |
Good article, i've implemented on my site and it worked well until .NET SP1 was installed where the __doPostBack function has both the language and type attribute, your code replaces the language attribute with the type attribute therefore causing duplicate tags.
I also wrapped the __EVENTTARGET and __EVENTARGUMENT hidden values in div tags as this was required.
I added an extra bit to remove whitespace too
// Remove whitespace
if (bool.Parse( ConfigurationSettings.AppSettings["RemoveWhitespace"].ToString() ))
{
finalHtml = Regex.Replace(finalHtml, "\t", string.Empty );
finalHtml = Regex.Replace(finalHtml, "\n", string.Empty );
finalHtml = Regex.Replace(finalHtml, "\r", string.Empty );
finalHtml = Regex.Replace(finalHtml, "", "// --> \n" );
}
based on a web.config setting, knocks about 10% off the page size
Comment permalink 12 Milan Negovan |
Duplicates? Hmm.... I'll look into that. As to replacing white space characters---this is what I wanted to work on next as an addition to the filter. ;)
Comment permalink 13 David |
I suppose you could remove the addition of the type attribute, I think the .NET Framework 1.1 SP1 adds this now, i'll try it later.

Here's a link to the regex to clean up whitespace, this form seemed to remove a line from my code above.

http://dotnetjunkies.com/WebLog/donnymack/archive/2003/09/08/1468.aspx
Comment permalink 14 David |
Yep, remove the regex expression that replaces the language attribute with the type attribute for .NET 1.1 SP1, just make sure any SCRIPT tags have the type attribute on in your code.
Comment permalink 15 Bruce |
Hi,

I am getting the following error when compile the HttpModule class : 'MyHttpFilter.xhtmlFilter' does not implement interface member 'System.Web.IHttpModule.Dispose()'.

I have added the function and it works now, should this be done or is my configuration wrong?

Great fix for xhtml BTW.
Comment permalink 16 Thomas |
I have the same problem. ('MyHttpFilter.xhtmlFilter' does not implement interface member 'System.Web.IHttpModule.Dispose()'.)
As I am not so familiar with c# it would be great if anyone could tell me how to fix it.

TIA
Thomas
Comment permalink 17 Milan Negovan |
Thomas and Bruce, if you've created a new VB.NET project, make sure you clear out the Root namespace.
Comment permalink 18 David Rhodes |
Milan, could you point out the code section needed to re-write the action property of the form when using url re-writing, I can't seem to find it in this article
Comment permalink 19 Milan Negovan |
David, I took it out of this HttpModule because something wasn't quite working out with rewriting the action attribute. As I indicated in a blog post I moved it into another module. I'm still trying to remember where it went. :) I'll post it here as soon as I find it.
Comment permalink 20 Basic Date Picker |
Milan, nice work on your XHTML filter. We too had a problem with the filter rendering double ‘type’ attributes in the script blocks.

Inside the Write() method we changed the following:

This...

// Replace language="javascript" with script type="text/javascript"
re = new Regex ("(?<=script\\s*)(language=\"javascript\")", RegexOptions.IgnoreCase);
finalHtml = re.Replace (finalHtml, new MatchEvaluator (JavaScriptMatch));


Became this...

// Replace language="javascript" with script type="text/javascript"
// This will match language="javascript", language="javascript1.1", etc.
string regexJSLanguage = "(?]*?)(language=\"javascript[^>]*?\")";
re = new Regex (regexJSLanguage,RegexOptions.Multiline | RegexOptions.IgnoreCase);
finalHtml = re.Replace(finalHtml, new MatchEvaluator (JavaScriptMatch));


// Check for blocks that have double "type="text/javascript"" attributes and strip to only one.
string regexDoubleJSType = "(?]*?)(?type=\"[^\"]*?\"\\s?)(?[^>]*?)?(?type=\"[^\"]*?\"\\s?)(?[^>]*?>)";
re = new Regex (regexDoubleJSType,RegexOptions.Multiline | RegexOptions.IgnoreCase);
finalHtml = re.Replace(finalHtml, new MatchEvaluator(DoubleJSTypeMatch));


We basically pull apart the block into it's parts and glue back together using only one type attribute instead of two. I'm sure there is a way both those javascript replace methods could be combined, but it is what it is at the moment. I'm no regex expert and the above fix does not 'appear' to take any performance hit, although we have not run through any load stressing to confirm.


The following Match was added to handle the doubleJS...

private static string DoubleJSTypeMatch (Match m)
{
return m.Result("${startTag}${miscAttributes}${secondTypeMatch}${endTag}");
}


Keep up the excellent work Milan.

Geoff - http://www.basicdatepicker.com
Comment permalink 21 Nicholas Berardi |
This is a wonderful article. I have a simple queston as to why you choose to use ReleaseRequestState instead of PreSendRequestContent. Is there a difference in where they execute and that is why you can't use filters in the PreSendRequestContent?
Comment permalink 22 Milan Negovan |
PreSendRequestContent is a non-deterministic event. It's timing of invocation is not completely random, but timing is important. Also, I wanted to make sure everyone in the pipeline had a chance to contribute to the response, which is why wire my filter so far down the pipeline.
Comment permalink 23 Franck Quintana |
First of all this article is great! Thank you Milan :)
I think i have a bit optimization :

// The title has an id="..." which we need to get rid of
re = new Regex ("", RegexOptions.IgnoreCase);
finalHtml = re.Replace (finalHtml, new MatchEvaluator (TitleMatch));
-----------------------------
if you replace it by:

re = new Regex ("", RegexOptions.IgnoreCase);
if(re.IsMatch(finalHtml)) {
finalHtml = re.Replace (finalHtml, new MatchEvaluator (TitleMatch));
}


and the same for the others re.Replace...

testing IsMatch on each Replace avoid memory consumption because of immutable strings.

Hope this helps!
Franck.
Comment permalink 24 Pragati |
Thanks for the valuable information. The article provides sufficient inputs to start with web accessibility in .net for me.
Comment permalink 25 Vadra Rowley |
Thank you for addressing this issue, but I was disappointed after I had read half and scanned the rest of the article. At the very beginning, you stated the article had a two-fold purpose. I was waiting for you to address the first mentioned... the importance of following xhtml standards. Could you or anyone comment on this? I need to convince a few people who don't convince easily.
Comment permalink 26 Milan Negovan |
Point them to this Web Standards Primer and The way forward with web standards.
Comment permalink 27 Tim |
Many thanks for an overview of Accessibility - this has formed a positive start. I may wait until .NET 2.0 comes out instead of venturing into overcoming some of the accessibility issues associated with .NET 1.0.
Comment permalink 28 Bjorn |
I just moved from Java to .NET. I'm shocked by the appaling status of webstandards compliance in ASP.NET. Sigh.

Good to see people like you putting focus on it, though.
Comment permalink 29 Jeremy |
Is there a reason for not including the option to compile the regex statements for reuse? This would slow down the first call but speed up the successive ones.

Also, this block of code is flawed:
// If __doPostBack is registered, replace the whole function
if (finalHtml.IndexOf ("__doPostBack") > -1)
{
try
{
int pos1 = finalHtml.IndexOf ("var theform = document.getElementById ('');
theform.__EVENTTARGET", pos1);
string methodText = finalHtml.Substring (pos1, pos2-pos1);
string formID = Regex.Match (methodText,«
"document.forms\\[\"(.*?)\"\\];",
RegexOptions.IgnoreCase).«
Groups[1].Value.Replace (":", "_");

finalHtml = finalHtml.Replace (methodText,
@"var theform = document.getElementById ('" + formID + "');");

}
catch {}
}

as exception handling is expensive, it should NEVER be used to control code flow. Nitpicky? maybe... but we're dealing with code here that potentially runs for every text/html response on a site. Every little bit adds up.
Comment permalink 30 Milan Negovan |
Jeremy, good point about the regex. I guess it could benefit from a compilation flag.

The part where the form is replaced is quite touchy in the sense that "it's subject to change without notice" and not handling an exception there would bring down every web page, which would render the site useless.
Comment permalink 31 Dan |
Hi All, good article ... :)

Someone have the code in VB.NET?

Thanks
Comment permalink 32 Fordiy |
I added these module to my existing C#.net project. I got the same problem when I compiling the codes.

('MyHttpFilter.xhtmlFilter' does not implement interface member 'System.Web.IHttpModule.Dispose()'.)

Can you put the detail procedure here to plugin existing project?

Thanks
Comment permalink 33 vbguy |
Is it possible for someone to post this filter using vb .net code?
Comment permalink 34 JfK |
I would like to mention one thing that is missing here: naming consistency. Generally speaking, your article Milan is ok, but something hides there that causes some confusion: you articles title is "... with response filters" but then in the middle there is a section titled "Installing the Request Filter". Hmmm... Even more interesting - in this section about "request filter" there's a code line "response.Filter = new PageFilter (response.Filter);". Obviously something's mixed up here. I've found your page by google. I was searching for clues in writing _request_ filter. Oopps, it's not this page :-) Everybody's write _response_ filters but plainly _request_ filters are less popular. I had to write one, because of viewstate errors caused by mad&old mobile browsers which can sometimes urlencode viewstate _twice_ (!!!) or forgot that '+' sign has to be urlencoded as well. No solution on the web. No solution anywhere. .NET Core classes - that's another story. Try to change some behavior there. Good luck :-) If I could onlyerase keywords private and internal from brains of M$ core developers. I've lost 8h looking for any way of hooking into viewstate before it is mangled to have an opportunity to fix it. No no no, M$ tells me it's no good. If anybody wants to look into devil's eyes I advise to switch Reflector on and see MobilePage class. There is a private field _requestValueCollection. Now travel to base class Page and... there is another copy of _requestValueCollection! Hooray! Who wrote this I would like to ask, but I don't think anybody can answer. This small shitty thing repels you from any serious viewstate manipulation in MobilePage. Enough of this, sorry for the bloat, but I had to throw it out of myself :-) Concluding: Milan, plese fix that sections title because googling asp+request+filtering leads to your article - and you don't filter requests, do you? Regards!
Comment permalink 35 Milan Negovan |
The "request" in the context of this article is the whole chain of events: starting with a request from the client, down the ASP.NET pipeline, and the subsequent response. In this sense I do filter requests.

The issue is that we often allude to classes that handle the entire request, such as HttpRequest, HttpResponse, HttpApplication, etc. Their naming might confuse the issue of request processing.

As far as mobile controls go, I've heard from way too many people how raw those controls are. Not good.
Comment permalink 36 DOC Holiday |
I'm having trouble using your methods using VB.NET - can anyone post examples on how to do this in VB.NET?
Comment permalink 37 Euan |
Yeah vb.net would be nice
Comment permalink 38 Derek |
Thank you very much Milan, good job! Below is the VB.NET "translation"

Public Class PageFilter
Inherits Stream
Private responseStream As Stream
Private _position As Long
Private responseHtml As StringBuilder

Public Sub New(ByVal inputStream As Stream)
Me.responseStream = inputStream
Me.responseHtml = New StringBuilder
End Sub

Public Overrides ReadOnly Property CanRead() As Boolean
Get
Return True
End Get
End Property

Public Overrides ReadOnly Property CanSeek() As Boolean
Get
Return True
End Get
End Property

Public Overrides ReadOnly Property CanWrite() As Boolean
Get
Return True
End Get
End Property

Public Overrides Sub Flush()
Me.responseStream.Flush()
End Sub

Public Overrides ReadOnly Property Length() As Long
Get
Return 0
End Get
End Property

Public Overrides Property Position() As Long
Get
Return Me._position
End Get

Set(ByVal Value As Long)
Me._position = Value
End Set
End Property

Public Overrides Function Read(ByVal buffer() As Byte, ByVal offset As Integer, ByVal count As Integer) As Integer
Return Me.responseStream.Read(buffer, offset, count)
End Function

Public Overrides Function Seek(ByVal offset As Long, ByVal origin As System.IO.SeekOrigin) As Long
Return Me.responseStream.Seek(offset, origin)
End Function

Public Overrides Sub SetLength(ByVal value As Long)
Me.responseStream.SetLength(Length)
End Sub

Public Overrides Sub Write(ByVal buffer() As Byte, ByVal offset As Integer, ByVal count As Integer)
Dim strBuffer As String = System.Text.UTF8Encoding.UTF8.GetString(buffer, offset, count)
Dim eof As New Regex("", RegexOptions.IgnoreCase)

If Not eof.IsMatch(strBuffer) Then
responseHtml.Append(strBuffer)

Else
responseHtml.Append(strBuffer)
Dim finalHtml As String = responseHtml.ToString()
Dim data As Byte() = System.Text.UTF8Encoding.UTF8.GetBytes(finalHtml)
Me.responseStream.Write(data, 0, data.Length)
End If
End Sub

Public Overrides Sub Close()
Me.responseStream.Close()
End Sub
End Class
Comment permalink 39 Milan Negovan |
Many thanks, Derek!
Comment permalink 40 Kieran |
Hi,

I think the error:

"'MyHttpFilter.xhtmlFilter' does not implement interface member 'System.Web.IHttpModule.Dispose()'."

can be fixed wit the following:

public void Dispose () {}

K
Comment permalink 41 Sigurd |
I used a similar technique to add content to pages produced by a third party. Essentially it injects a "standard" header and footer into the html output.

I ran into some trouble regarding concurrent requests. It appears that the buffer parameter sent to the filter's Write() method includes more than just the output from a single request.

Limiting the "work area" to what was specified by the Write() method's offset and count resolved the issue.

-S
Comment permalink 42 Jeff Sargent |
Milan,

Love the article - I'm using this technique on the company website to throw an intermediate page before any outbound links. I don't like the practice, cause it's annoying, but we apparently have a lot of complaints about "pages being broken", and we find that the user never realized they followed an outbound link that we don't control.

Anyhow, onto the tech - the Regex I'm using matches links with http:// and https://, processes them a bit and compares them to an XML file of "internal" domains that we don't actually want to be flagged as external to our main website. If it doesn't find the domain in the list of "internal" domains, it prepends "outbound.aspx?link=" to the link.

Here's my problem - I want to exclude "outbound.aspx" from being processed under this httpmodule - when I grab the link out of the querystring (link=) and put it inside the page as a "Continue to..." link, the httpmodule processes that link also, causing a recurring link to "outbound.aspx?link=http://www.the.original.link/". How do I exclude just one page?

Thanks!
Comment permalink 43 Milan Negovan |
Jeff, I'm not sure, but this looks like an exercise in regular expressions. Would you like to send me a code snip so I can see what's causing a link loop?
Comment permalink 44 Jeff Huck |
Thanks for the great article. Does anyone know if this technique still results in the correct Content-length header or why that may not matter?
Comment permalink 45 Milan Negovan |
The Content-length header is correct. It's the length of the compressed content.
Comment permalink 46 PavelBure |
This code is vulnerable to a simple attack. If a malicious user could enter /html tag somewhere in your site (e.g. in forums), this would cause the content to be written twice.

A regex check for the end of the document can be avoided like this:
public class HttpFilter : Stream
{
private Stream m_objSink;
private StringBuilder m_objResponseHtml;

public override void Write(byte[] buffer, int offset, int count)
{
string strBuffer=HttpContext.Current.Response.ContentEncoding.GetString(buffer,offset,count);
m_objResponseHtml.Append(strBuffer);
}

public override void Flush()
{
string strHtmlOutput=m_objResponseHtml.ToString();
//here we can change the content
byte[] data=HttpContext.Current.Response.ContentEncoding.GetBytes(strHtmlOutput);
m_objSink.Write(data,0,data.Length);
m_objSink.Flush();
}
}

This seems to work at my site.
Comment permalink 47 Jeff Magill |
Thanks for the great article Milan. This is exactly the functionality I had been looking for though my situation is a bit different. I do have a couple issues though.

I'm having trouble understanding why you are using a HTTPModule for this. I know you said you wanted to filter the HTML early on in the request, however, it seems to me that a HTTPModule is a lot of work for something that can be accomplished quite easily elsewhere.

I also tried to test out the loss of functionality when Response.Redirect() is used. Perhaps I misunderstood your point, but in my testing, my filters remained intact despite the the use of Response.Redirect().
Comment permalink 48 Milan Negovan |
It's only a matter of personal preference, I think. I like the modular design of HTTP modules. They give me a lot of flexibility in coding and maintenance.

I believe you can tap into the page filtering from the Page class itself, though.
Comment permalink 49 Marcus |
I have been using this method to get XHTML-Compliant Pages in asp.net 1.1 but now it´s time to move on to asp.net 2.0. The output seems to be pretty nice and the code validates on http://www.w3.org/ but when i run the html validator "tidy" i recieve one error " ID "__VIEWSTATE" uses XML ID syntax" Does anyone have any idea how to fix this error?
Comment permalink 50 007dad |
I am having same concerns as 49 Marcus
Comment permalink 51 Ben Strackany |
Yep, ASP.NET 2.0 is much more compliant than 1.1.

You're getting a validation error in your viewstate because it has an id of __VIEWSTATE, and the HTML 4.01 spec says ids must start with a letter, not underscores.

Setting your page doctype to XHTML should resolve the issue, and/or telling Tidy (or whatever validator you're using) to validate the page as XHTML, not HTML. However, it could be that certain validators (like Tidy, perhaps) are going to complain about the underscores no matter what. If that's the case, you can disable viewstate, ignore the validation errors, or hack into the Page class & rename the ViewState to something else.
Comment permalink 52 Ben Strackany |
You might find some tips on renaming or getting rid of ViewState here

http://www.codeproject.com/aspnet/ServerViewState.asp

and here

http://www.eggheadcafe.com/articles/20040613.asp

e.g. in the SavePageStateToPersistenceMedium and LoadPageStateFromPersistenceMedium overrides.
Comment permalink 53 Arul |
Can anyone tell the regular expression syntax for replacing the empty div with proper closing div.

should be replaced with .
please tellme the syntax using regex.replace
Comment permalink 54 Chris |
It is a good thing to move to more standards complient XHTML, but the resulting changes are don't do anything but please purist-webstandard-types (i am one of them :) ).

In my opinion ASP.NET folks moving toward webstandards should check out the CSS Friendly Adapters(http://www.codeplex.com/cssfriendly). This project takes care of creating more accesible, semantic XHTML. The CSS Friendly Adapters for example modify the RadioButtonList control from table markup to an unordered list markup.

That said, creating true perfect shiny XHTML, CSS and JS is next to impossible using 'classic ASP.NET'. My hope is that the ASP.NET MVC architecture will get out of the way!
Comment permalink 55 Qurban Ali |
If you click Switzerland in this page, you will see there is diamond like char. Can we remove someway bu the method you described?
Comment permalink 56 Milan Negovan |
Let's take this offline. Please see my email.

Submit your comment

Please enter only text since all HTML tags except hyperlinks will be stripped. Hyperlinks will become live links. Any comments with flaming or offensive language will be deleted. Be courteous to other posters. Thank you.

Your name (required):
Your email (optional):
Your site's URL (optional):
Enter this number
Type in the number above:
Comment (required):