I got so sick of messing with full-text search here, on this site, that I decided to give the MSN Search web service a try. I’d imagine the last thing hosting companies want is that you create a full-text catalog. They wouldn’t run SQL Server Agent for everyone to keep their catalogs fresh and allow incremental updates. Besides, full-text search results are nowhere close in accuracy to those of MSN and Google.
Getting the MSN web service to run is very easy (see Getting Started with MSN Search Web Services). The SDK is concise, offers good examples and nice diagrams of class hierarchy. Look for a .chm help file in the SDK directory.
Working with search results took me much longer, and therefore I wanted to share some gotchas should you choose this route.
Repetitive Links
Sometimes search results contain the same link(s), i.e. I’d get a link to a blog post as well as a couple of auxiliary pages with the same title. To filter auxiliaries out I wrote a regular expression along the lines of:
\b(?:subscribe|unsubscribe|email_comments|<other pages>)\b
I apply this regex to keep unwanted pages out:
foreach (Result searchResult in searchResults)
{
// Some auxhiliary pages are excluded from SERPs
if (reSearchExclusions.IsMatch (searchResult.Url))
continue;
…
}
Wordy Descriptions
Search engine bots grab everything on a page, and it’s hard to expect them to provide meaningful page descriptions. They are the same blurbs you see when searching at the MSN site. There’s pretty much nothing you can do about it, unless you want to write some aggressive regular expressions and clean up descriptions.
Highlights
You can have MSN highlight search terms in titles and descriptions. MSN encloses search terms in a pair of UTF characters:
The UTF-8 characters used to mark the query words are 0xEE8080 at the beginning of the word or words and 0xEE8081 at the end (Unicode characters 0xE000 and 0xE001, respectively).
The best way to display highlights on a web page is to wrap found query words in <span>s:
Regex reSearchHighlight = new Regex (
@"\uE000([\s\S]+?)\uE001",
RegexOptions.Compiled | RegexOptions.Multiline);
…
string description = reSearchHighlight.Replace (
HttpUtility.HtmlEncode (searchResult.Description),
@"<span class='highlight'>$1</span>");
The Regex above finds a pair of said Unicode characters, and then wraps <span class='highlight'> around each match. Remember to define the highlight class (or whatever you name it) in a style sheet:
.highlight {
background: yellow;
}
Conclusion
The MSN Search web service offers a lot of customization options. It seems MSN will allow searches for sponsored links in the future, which promises to be interesting.