webseer HTML Analysis Library

WebSeer is a tool to perform detailed feature extraction on web pages to assist in classification/survey tasks.

DOWNLOAD

VIEW MANUAL

BROWSE API

What it is Not:

It is not a search engine, though it was designed to plug into one. It doesn't actually have classification algorithms, but provides convenient wrappers around Weka classification API.

What it Is:

It's a collection of analysis packages that give very detailed statistics on many aspects of a web page, including HTML structure, style and script usage, text, positional and segmentation information.

Library History

It is primarily being used to conduct research for the PhD of Ryan Levering, which includes web surveys and document classification. However, it has been used to assist Information Retrieval classes, taking a lot of the overhead away from working with web pages.

News:

January 25 2008 - First Release Approaching: Very shortly the first official release of webseer will happen. This is in conjunction with the wrapping up of my PhD research. I'm currently working on finalizing the tools that will make it much easier to work with the libraries.

Publications:

  • 2008. HICSS 41 Paper - Using Visual Features for Fine-Grained Genre Classification of Web Pages [PAPER]
  • 2007. HyperText 2007 Poster - Using Visual Features in Genre Classification of HTML [POSTER | PAPER]
  • 2006. Document Engineering 2006 Paper - The portrait of a common HTML web page [PAPER]
  • 2004. Masters Thesis - Multi-stage Modeling of HTML Documents

Thanks To:

YourKit is kindly supporting open source projects with its full-featured Java Profiler.
YourKit, LLC is creator of innovative and intelligent tools for profiling Java and .NET applications.
Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.