Chapter 4. Installation and Setup

webseer currently doesn't have a pretty frontend of any sort to give the user more power to customize how they extract information. That's in the works for the future, but for now you'll have to use it as either an API or a command line tool.

For starters, download a zip file from http://webseer.sourceforge.net. There should be two possible distributions, one with the source if you're going to extend webseer and one without if you just want to run it. The source distribution has exactly the same structure as what is found in the code repository so you can pull it from there as well. The difference in contents is that the binary distribution has pre-compiled plugins and the core library jars included and doesn't have the src folder that created these jars. These plugins and core library can be regenerated from the source by running the ant target "prepare" found in the main build.xml file found in the root. You should probably do that right after you download the source distribution just to be able to run the examples

The structure of the whole project is as follows (this is also in the README file):

bin\ - the command line tools to make executing webseer easier
conf\ - the main configuration files for webseer
docs\ - the website, this guide, and the papers that were written based on webseer
lib\ - the core dependencies and the core library after it is built by the ant build
src\java - the source code for the core of webseer

src\plugin - a directory for each plugin that is part of webseer
src\plugin\*\ - a plugin.xml file and a build.xml file that are used to create/register the plugins with webseer
src\plugin\*\src - the plugin source

nutch-plugins\ - contains a directory for required nutch plugins (don't touch this unless you know what you're doing)
conf\nutch - configuration for nutch (don't touch this unless you know what you're doing)
            

You don't have to understand the structure totally but if you're the sort that likes to poke around really quick, that should be a good guide. In the next section we'll go into the general architecture of webseer, but for now all you really need to know is that it's very plugin-dependent. Therefore, only the bare minimum skeleton classes are loaded when webseer is first run and then it loads the plugin jar/libraries when it feels it needs to. This is one of the secret sauces that makes webseer easy to extend (though by no means unique or novel).

If you've downloaded the binary distribution or prepared the source, you should now be able to run the example from the first section. Woohoo, that was pretty easy!

Examine the bin\extract.bat file to understand what is really required to run webseer. All you need to do is include 1) all the libraries in \lib, 2) the configuration directories \conf and \conf\nutch, and 3) the directory above where you have a \plugins directory on the classpath. Then you can run webseer through the java command like "java name.levering.ryan.webseer.WebSeer putArgumentsHere".