So you want to use webseer in your own web exploration to make things easier on you, eh? Maybe you got fed up with Java's URLConnection class or maybe you just want to download text documents from HTML sources and don't want to deal with that part yourself.
For an example, lets say we our goal is to get all the text that is bolded in an HTML document. That code would look like:
Configuration conf = WebseerConfiguration.create(); FetcherUtil fetcher = new FetcherUtil(conf); Content content = fetcher.fetch(myURL); GeneratorUtil generator = new GeneratorUtil(conf); HTMLDocument document = (HTMLDocument) generator.generate(content, "HTML"); final List<String> boldStrings = new ArrayList<String(); document.accept(new SuperVisitor() { public void accept(HTMLTag tag) { if (HTMLUtils.isBoldTag(tag)) { boldStrings.add(((HTMLText) tag.getFirstChild()).getText()); } } });
Let's go over that in detail:
Configuration conf = WebseerConfiguration.create();
This is a configuration object that is passed around whenever pieces of the system are created. This paradigm was taken from Nutch (who I imagine didn't invent the concept themselves) and allows flexible runtime configuration without relying on static/singleton dependence. This loads its configuration from files webseer-default.xml, nutch-site.xml, and nutch-default.xml on the classpath which should be in the webseer library JARs.
FetcherUtil fetcher = new FetcherUtil(conf); Content content = fetcher.fetch(myURL);
These lines first create a fetcher using the configuration object we initialized. Then it fetches a given URL and returns its Content object. This particular method in FetcherUtil looks up the method for fetching based on the protocol in the URL specified. So if you pass it a http:// address it will use an HTTP fetcher. It will do HTTP redirecting automatically and even has a special handler for HTML meta redirects (which is somewhat of a cheat on our pristine fetcher separation). The content object that is returned has some meta information about the fetch as well the actual bytes that were returned from the fetch.
GeneratorUtil generator = new GeneratorUtil(conf); HTMLDocument document = (HTMLDocument) generator.generate(content, "HTML");
This initializes a model generator object using our same configuration object. Then it requests an HTML document from the content that we have fetched. It does this explicitly, we could also do it implicitly:
HTMLDocument document = (HTMLDocument) generator.generate(content);
However, this isn't really guaranteed to return a HTMLDocument; it merely returns whatever webseer currently defaults the text/html MIME type to generating. If you want to guarantee a certain model, request it explicitly. Behind the scenes, this looks up what parsers it has that can generate an HTML model and uses the first one it finds to parse the bytes. At the time of writing this, it was using JTidy to generate the DOM models.
The document that you get back is very much like the W3C DOM model if you know of that and has similar accessor methods to traverse the document (i.e. getRoot(), getChildNodes(), getTagName(), etc.). In fact, it's a problem that it doesn't fulfill the W3C interface which should be fixed shortly. It wasn't really my goal to reduce compability.
final List<String> boldStrings = new ArrayList<String(); document.accept(new SuperVisitor() { public void visit(HTMLTag tag) { if (HTMLUtils.isBoldTag(tag)) { boldStrings.add(((HTMLText) tag.getFirstChild()).getText()); } } });
Now we get a bit tricky. We could use a standard recursive technique to recurse down the DOM tree and look for HTMLTags that are bolds. However, if we really aren't interested in context and just want to get the bold strings independent of where they exist, a visitor is much easier. Visitors are a design pattern concept that is way beyond the scope of this manual, but luckily webseer has super-magical visitors that make the concept a bit easier to understand. By subclassing SuperVisitor with a class that has a method "public void visit(XXX object)" and then calling Model.accept(myVisitor), the model will automatically take care of the recursion for you and call visit(XXX) whenever an object of type XXX is encountered. This visitor pattern usage is crucial to the way webseer works, so its good to have a small introduction to it.
Inside the visit method, we use an HTMLUtils method to check whether its a bold tag (generally it's better to do this than call tag.getTagName().equals("b") to avoid case-sensitivity problems and because it's good to avoid strings inline when possible). Then we call some DOM methods on the tag to get its contained text. Voila!