Try-Catch-FAIL

Failure is inevitable.

Crawling results in DeepCrawler.NET

clock November 17, 2008 02:45 by author Matt

In the last post, I laid out DeepCrawler.NET's (primitive) strategy for finding search forms, populating them, and submitting their contents using WatiN and a heuristic search mechanism.  As I mentioned at the end of the previous post though, submitting a query is only the first step in a complicated process.  Assuming nothing goes wrong, submitting a query will get us back one or more pages of results.  The problem now becomes parsing and crawling the results. 

The Current State

First, let's look at what a typical result page consists of, using Google as an example.  There are some useless links that don't really help us: links to other pages on Google (we'll call these navigational links), sponsored links, etc.  Then there are links to individual results along with "supporting" links for each result (such as links to the cached copy, links to similar pages, etc).  Finally, there are links to subsequent pages in the result set.  Ideally, we'd like to skip the first set of links and focus the crawl just on the second two sets of links (the actual results and the subsequent result pages).  Unfortunately, this is very difficult to do.  Without encoding some site-specific knowledge into the process, how do we determine the difference between a result link and a link that we don't care about? 

For now, I've decided to punt on the issue: DeepCrawler.NET just crawls all the available links.  Fortunately, with the right crawl depth, this actually has fairly good recall (meaning we grab a large number of the actual results) at the cost of low precision (we also grab a lot of pages that we don't care about).  The high recall is achieved because the initial result page contains links to the subsequent result pages, so the crawler recursively navigates to the result pages, pulling in their results, and so on.  The low precision is due again to this recursive property: DeepCrawler.NET pulls in the junk pages, follows their links, and so-on.  It does work, but it's not perfect.

A Better Approach?

The approach I'm currently investigating is to "learn" where the good results are by executing probe queries against a source.  Assuming that the source is semi-consistent in how it presents results (which is a safe assumption, I think), the result pages for different queries should be largely the same, varying only in the actual result links, descriptions, etc.  Using this assumption, hopefully we can quickly identify navigational links and exclude them from the crawl.  Next, we can limit the crawler so that it only travels one link away from the original source's domain.  Doing this will enable us to crawl subsequent result pages by increasing the crawl depth without diminishing precision due to the crawler jumping further and further away from the results. 

The above approach isn't perfect.  First, it still doesn't filter ads, but many of the sources that I envision DeepCrawler.NET being used for won't have ads, and there's always a chance that an ad is actually relevant to what you're looking for, so maybe that isn't a terrible thing.  Second, the approach for limiting crawl depth may not work well for search sources that only list links from their own sites.  For this type of source, the crawler could still navigate too broadly away from the results.  A better solution is needed, one that can precisely identify the result links and the controls for paging through a set of results.

What's Next?

Well, I obviously still have some work to do on crawling the results.  The current approach works, but it's flawed.  I plan to review recent research literature to see if there's an existing method I can build on top of, but I suspect I'm going into muddy waters here where there's not an easy solution.  Another thing that I need to do is add richer support for handling different file types.  Right now, DeepCrawler.NET is only capable of saving HTML documents (it can "visit" any file type, though), but this can easily be solved by leveraging available functionality in WatiN. 

One thing that's completely missing from DeepCrawler.NET is solid support for sources that utilize sessions.  For such sources, the crawl must be performed immediately after executing a query.  Ideally, the crawl should be interruptible and distributable, enabling a much more flexible crawl process.  My current plan for facilitating this is to have the crawler record the "path" of actions it took to arrive at any given page in the crawl.  With that path information, the crawler should (in theory) be able to replay the actions to return to any point in the crawl at any point in time.  This would allow a "path" to be discovered by one instance of the crawler, then followed further by a completely separate instance at a different point in time. 

Questions or comments?  Suggestions or ideas?  Please leave me a comment.

Share or Bookmark this post…
  • del.icio.us
  • DotNetKicks
  • Digg
  • msdn Social
  • Reddit
  • StumbleUpon

Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


DeepCrawler.NET: Alive and Kicking

clock November 14, 2008 02:10 by author Matt

Much to my surprise, getting DeepCrawler.NET up and working with basic functionality was surprisingly easy.  It's far from finished, and I haven't exhaustively tested it, but it does work.  In this post, I'll describe the current implementation with respect to how I've addressed some of the barriers raised in my last post.

How do we decide which form contains a search form?

Some sites (in particular, FedBizOpps) contain multiple forms.  As in this case, sites can contain login forms along side query forms, and we definitely don't want our crawler to try to log in with our various keywords. 

To address this barrier, DeepCrawler.NET employs a heuristic search of the forms on the page.  It calculates a probability for each form on the page by examining the form's contents (text fields, labels, buttons, etc) as well as attributes on the form, such as the form's name and ID.  The properties of each of these controls are compared to a very short list of words (query, search, keywords) that correlate well with search forms.  Right now, scores are basically binary: the heuristic either considers something a potential match or not a potential match.  Future work will add some more intelligent scoring to the process.

Another issue not fully addressed yet are forms with useless ID, name, and value descriptors for fields within the form.  Right now, DeepCrawler.NET can't do anything, but that's going to change (hopefully this weekend).  I've prototyped a search mechanism that looks for text lables (not label elements, which are much more useful and already handled by DeepCrawler.NET) by "visually" searching a grid around a form element of interest.  The search is primitive now, but early tests indicate that it will successfully locate text labels corresponding to form fields in some cases.

How do we determine where to place our query criteria?

This is somewhat addressed by the solution to the previous issue.  Within each candidate form, DeepCrawler.NET applies a heuristic test to each text box (input elements of type text) to determine if the text box is where the query should go.  The "best" text box is retained and used as our query box.  Again, this is a somewhat naive approach, but it works well enough for the sites I've evaluated DeepCrawler.NET against.  Future work will add more intelligence to the heuristics.

In its current state DeepCrawler.NET doesn't handle anything but text boxes.  That still leaves drop-down lists, multi-selects, text areas, radio buttons, and checkboxes.  Technically, there could also be hidden fields, but since Internet Explorer is serving as the crawler's "window", I'm assuming that any hidden fields will be correctly populated by the page itself.  I plan to address the remaining field types in the near future, but probably not for the first "finished" version of DeepCrawler.NET.

How do we submit the form?

So, at this point, we have a form, and we know where the query should go, but how do we submit it?  Again, this problem is addressed by the heuristic search that finds the search form.  While calculating the probability that each form is a search form, it examines buttons and images, and the best one is retained to submit the form.  Unfortunately, this does not work for forms that use JavaScript with hyperlinks to submit the form, but I'm already working on code to address that limitation.  There are other issues, too, like working with an image button that lacks useful attributes, but I probably won't address those issues in this first version of the crawler.

Moving Forward

I've described how DeepCrawler.NET finds a form, populates it, and submits it, but that's only part of the battle.  Next up is crawling the search results.  My approach is somewhat primitive, but it actually works quite well in the limited testing I've done so far.  I'll do a write-up on that at some point next week.  I also plan to release the full source code to DeepCrawler.NET after the semester ends, but if you have any questions on how I'm accomplishing anything specific (remember that I'm using WatiN right now), feel free to ask in the comments.

Share or Bookmark this post…
  • del.icio.us
  • DotNetKicks
  • Digg
  • msdn Social
  • Reddit
  • StumbleUpon

Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Deep-web crawling with .NET: Getting Started

clock November 12, 2008 10:40 by author Matt

Thanks go out to Sol over at FederatedSearchBlog.com for giving me some suggestions on things to watch out for.  If you want more background information on federated search or information retrieval, go check it out that site.

In the last post, I introduced the idea of creating a deep-web crawler.  I laid out the basic requirements that I've given myself, and I touched on some of the barriers to meeting those requirements.  In this post, I'm going to introduce DeepCrawler.NET, my .NET-based (prototype-stage) crawler.

DeepCrawler.NET is written in C# for Microsoft .NET 3.5.  While there is intelligence behind it, at it's core it is doing nothing more than by automating Internet Explorer.  The crawler's "brain" examines a page in IE, then tells IE what to do, such as populate a form field with a value, click a link or button, or navigate to a new URL.  To facilitate this automation, I'm currently using the open-source WatiN API. WatiN is actually designed for creating unit tests for web interfaces, but it's proving to be a fairly nice abstraction over the alternative method of automating IE from C# (that is using the raw COM APIs). 

The main class in WatiN is "IE", which represents an instance of the Internet Explorer browser.  There's all sorts of options you can adjust to control how WatiN "wraps" the browser, but for the most part, the defaults appear to be fine.  Now, though WatiN is designed to facilitate testing of a web form, its API is flexible enough to enable exploratory analysis of a web page.  You can easily enumerate forms, links, buttons, or anything else in the DOM tree.  Since the first task of a deep-web crawler is just to submit a query through a search form, our task straightforward assuming access to a magic black box that can help you make certain decision.  First, enumerate the forms (some pages may contain multiple forms), and use the black box to select the form that most-likely contains the search form.  Next, enumerate the fields in the form, and use the black box to determine which fields correspond to which available query criteria (the crawler's pool of queries may be simple keywords, or they could be keywords augmented with date ranges or other values).  Finally, enumerate the buttons and links, and use the black box to determine which one to use to submit the form and begin the search.  From there, it's a simple matter of paging through the results and grabbing all the links.

By simple, I mean very NOT simple.  First, not all the links on the page are going to be for results, some may link back to the search form, some may go elsewhere on the site, some may be ads, and some are (hopefully) the links to page through the results.  Which brings up another issue: how do you determine how to page through the results?  These are open questions that I'm currently working to address and will hopefully discuss in a future post.

Ignoring those issues for now and focusing just on how you submit a form, you can see that I've skipped all the hard parts by using this magical black box.  Such a box doesn't exist, so we have to implement one.  What issues do we have to deal with?  How do we decide which form contains a search form?  Once we've done that, how do we determine where to place our query criteria?  There's *nothing* that says people have to give their form fields meaningful names or IDs, so "q" could just as easily be a box for a "query" as it could be a box for entering your username.  Finally, even if we find the form and figure out how to populate it with the query we want to execute, how do we submit the form?  Some forms may use JavaScript, some may use buttons, some may use images... what can we do?  How do we determine which one to use?

In the next post, we'll start diving in to some of these issues.  And remember, these are just the issues with meeting the *first* requirement.  After that, we still have to figure out how to do these things efficiently and intelligently.  It's a long road ahead, I wish I had more than four weeks left in the semester. :D

Share or Bookmark this post…
  • del.icio.us
  • DotNetKicks
  • Digg
  • msdn Social
  • Reddit
  • StumbleUpon

Currently rated 5.0 by 2 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Creating a deep-web crawler with .NET: Background

clock November 10, 2008 10:15 by author Matt

For one of my graduate courses, I've decided to tackle the task of creating an intelligent agent for deep-web (AKA hidden-web) crawling.  Unlike traditional crawlers, which work by following hyperlinks from a set of seed pages, a deep-web crawler digs through search interfaces and forms on the Internet to access the data underneath, data that is typically not directly accessible via a hyperlink (and would therefore be missed by a traditional crawler).  It's a tough problem to crack.  No two search forms on the web are the same (example 1 and example 2), and there's no standard that's useful here.  Worse, many now employ client-side scripting, SSL, and other technologies that make interaction very difficult even when you focus on a single site.  When you try to scale to the myriad of possibilities that exist on the web, the problem becomes (seemingly) intractable. 

There has been quite a bit of research in this area over the last few years.  Even Google has gotten in to the game with their own deep-web crawler that is now tearing up the (hidden) web.  Still, most of the documented approaches have limitations that I don't like: some only support GET-based requests (meaning that any form that uses POST is unsupported); many don't work with client-side scripting; some try to spider an entire source (digging too deeply for a focused task), while others barely scratch the surface, missing a lot of valuable information.  So, my goal is to create a flexible deep-web crawler that meets these basic requirements:

  1. If you can browse it, you can crawl it.  The crawler should work with client-side scripting, work over SSL, and support POST requests.
  2. You give it "topics", and it does the rest.  You tell it keywords (or phrases) that you're interested in, and it will crawl a source to harvest as much information about those keywords as it can.
  3. It should be efficient. The crawler should intelligently determine which queries to submit to achieve high recall and high precision. 

For now, let's ignore requirements 2 and 3, and focus just on requirement 1.  There's a few ways we could go about this.  First, we could use low-level API classes to simulate GET, POST, etc. requests, basically  implementing an HTTP-compliant crawler.  We could parse the HTML received in response to a request, then 'figure out' what to do next (more on that in a future post).  But what happens if the HTML contains JavaScript?  What if the form requires cookies?  What if the traffic is encrypted over SSL?  Well, we could implement support for all of that eventually, but it would be very, very painful.  We would basically be implementing an automated browser.  Instead of going that route, why not just automate an existing browser?  It turns out that this option is actually fairly easy to do.  Microsoft has a COM API that can be used to manipulate Internet Explorer programmatically.  This API, though a bit rough around the edges, can be leveraged to create a crawler, which is the route I've chosen to go.  Doing so will allow me to very quickly bypass painful things like cookies, JavaScript, etc, and focus on implementing the logic to find and interact with web forms.

Time's up for today.  In the next post, I'll introduce the API I'm using for my crawler, and I'll discuss some of the barriers to meeting the first requirement.

Share or Bookmark this post…
  • del.icio.us
  • DotNetKicks
  • Digg
  • msdn Social
  • Reddit
  • StumbleUpon

Currently rated 5.0 by 2 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


About Matt

I am an overworked (and apparently overpaid) software developer with aspirations of acquiring a PhD in Computer Science. I started off coding in C over a decade ago.  Since then, I've migrated from C to C++ and branched out to C#, PHP, VB.NET, JavaScript, and worked with a wide assortment of other languages that I hope to never deal with again (I'm looking at you, COBOL). Oh, and yes, I've written some Java.  Does that make me a bad person?

Disclaimer

The opinions expressed herein are my own personal opinions and do not represent my employer's view in  anyway.

© Copyright 2009

Sign in