Table of Contents
In order to be the glue or backbone of a bunch of web tools, it's necessary to have a generalized way for different pieces of webseer talk to each other. I chose to implement this backbone as a model transformation metaphor. webseer in a very general way is responsible for taking a URI that represents a data abstraction, generating an initial model for that data, transforming that model to other models, and then doing something with those models. For example, one thing webseer was built to do (an application, not an architectural requirement) was perform web surveys. It takes lists of URLs of HTML documents, generates HTML DOM models, converts them to other models (like standard text models), and then generates a set of numerical features that represents the web page.
To use webseer, its not necessary to embrace all these steps. For instance, if you just want a really easy way to get a DOM model from a URL, you can just ignore the extra transformation steps that are possible.
The main extension points or constructs in webseer represent each of the general steps described above:
This is responsible for the IO in webseer. Concretely, this means it can take a data URI and generate bytes that are then used to generate a model. In webseer's current incarnation, it wraps/extends the Apache Nutch framework. After spending much time worrying about IO issues by implementing my own IO plugins, I realized that Nutch had a very extensible plugin architecture of its own so I embraced this architecture and extended it.
The fetcher can be used standalone in a very simple way:
FetcherUtil fetcher = new FetcherUtil(conf); Content content = fetcher.fetch(url);
If you know Nutch at all, you can see that this format steals directly from its syntax and style. Event the Content class is a Nutch class. There are plans to abstract this out a bit, but for now it serves its purpose.
This webseer construct is responsible for turning raw bytes into a model. Technically speaking, a byte array is a model of its own, so this is really just a special case of a transformation that happens to be the first. A person can either request a particular generator to be used on content or let webseer try to figure it out for you from the content. Generally speaking, if you use a generator directly, you want your output in a particular format, so let's show you an example of turning the previous content into an HTML DOM model:
GeneratorUtil generator = new GeneratorUtil(conf); HTMLDocument document = (HTMLDocument) generator.generate(content, "HTML");
Pretty easy, isn't it? webseer tries to be really powerful if you desire it, but really easy if you just want to trust its magic.
Sometimes the model that the content is naturally in is not really how you want it. One basic example would be if you want to see the HTML page as a bunch of text. Why would you want to write your own text stripper when hundreds of people have already written them before? webseer comes with one so can transform a HTML model into a text model:
TransformerUtil transformer = new TransformerUtil(conf); TextSource text = (TextSource) transformer.transform(document, "HTML", "Text");
This step is a little more magical than even the last one. You can see that we had to tell webseer what type of model the input was in. That's because webseer is very anti-typing. Models don't carry with them information about what they are. This is a very powerful concept in theory because it allows us to generalize a transformation process to a bunch of different forms of input.
Another thing that is glazed over here is that transformations are actually a many-to-many relationship. A single model can be transformed into multiple models or many models can be consolidated into a single model. This makes transformation an almost too powerful step. Many operations can be turned into model transformations.
This may be a bit more than most people need, but several applications of webseer rely on generating features from a data source. This is just map of feature names to feature values, which are often numerical measurements. They can be generated from any model similarly to how a model is transformed (in a sense, it's just a special case of model transformation):
FeatureExtractorUtil extractor = new FeatureExtractorUtil(conf); FeatureRecord record = extractor.extract(text, "Text");
This would return to us every measurement that webseer can generate about that particular text model.