Nov 12 2008

Deep-web crawling with .NET: Getting Started

Category: DeepCrawler.NETMatt @ 10:40

Thanks go out to Sol over at FederatedSearchBlog.com for giving me some suggestions on things to watch out for.  If you want more background information on federated search or information retrieval, go check it out that site.

In the last post, I introduced the idea of creating a deep-web crawler.  I laid out the basic requirements that I've given myself, and I touched on some of the barriers to meeting those requirements.  In this post, I'm going to introduce DeepCrawler.NET, my .NET-based (prototype-stage) crawler.

DeepCrawler.NET is written in C# for Microsoft .NET 3.5.  While there is intelligence behind it, at it's core it is doing nothing more than by automating Internet Explorer.  The crawler's "brain" examines a page in IE, then tells IE what to do, such as populate a form field with a value, click a link or button, or navigate to a new URL.  To facilitate this automation, I'm currently using the open-source WatiN API. WatiN is actually designed for creating unit tests for web interfaces, but it's proving to be a fairly nice abstraction over the alternative method of automating IE from C# (that is using the raw COM APIs). 

The main class in WatiN is "IE", which represents an instance of the Internet Explorer browser.  There's all sorts of options you can adjust to control how WatiN "wraps" the browser, but for the most part, the defaults appear to be fine.  Now, though WatiN is designed to facilitate testing of a web form, its API is flexible enough to enable exploratory analysis of a web page.  You can easily enumerate forms, links, buttons, or anything else in the DOM tree.  Since the first task of a deep-web crawler is just to submit a query through a search form, our task straightforward assuming access to a magic black box that can help you make certain decision.  First, enumerate the forms (some pages may contain multiple forms), and use the black box to select the form that most-likely contains the search form.  Next, enumerate the fields in the form, and use the black box to determine which fields correspond to which available query criteria (the crawler's pool of queries may be simple keywords, or they could be keywords augmented with date ranges or other values).  Finally, enumerate the buttons and links, and use the black box to determine which one to use to submit the form and begin the search.  From there, it's a simple matter of paging through the results and grabbing all the links.

By simple, I mean very NOT simple.  First, not all the links on the page are going to be for results, some may link back to the search form, some may go elsewhere on the site, some may be ads, and some are (hopefully) the links to page through the results.  Which brings up another issue: how do you determine how to page through the results?  These are open questions that I'm currently working to address and will hopefully discuss in a future post.

Ignoring those issues for now and focusing just on how you submit a form, you can see that I've skipped all the hard parts by using this magical black box.  Such a box doesn't exist, so we have to implement one.  What issues do we have to deal with?  How do we decide which form contains a search form?  Once we've done that, how do we determine where to place our query criteria?  There's *nothing* that says people have to give their form fields meaningful names or IDs, so "q" could just as easily be a box for a "query" as it could be a box for entering your username.  Finally, even if we find the form and figure out how to populate it with the query we want to execute, how do we submit the form?  Some forms may use JavaScript, some may use buttons, some may use images... what can we do?  How do we determine which one to use?

In the next post, we'll start diving in to some of these issues.  And remember, these are just the issues with meeting the *first* requirement.  After that, we still have to figure out how to do these things efficiently and intelligently.  It's a long road ahead, I wish I had more than four weeks left in the semester. :D

Tags:

blog comments powered by Disqus