Identity

Identity
Prev	Chapter 2. Fundamental concepts	Next

Generalization by the above constructs makes webseer dynamic, extensible, and takes care of a lot of the work of web analysis tool. However, it still doesn't address the comparability/trackability problem that was previously addressed. Saying that we're working with two results obtained from DOM models generated from HTML is not enough to mean we're comparing apples and apples, when working with an imprecise language like HTML. When we layer several complex transformations on top of one another, models become even less trackable. The general problem is that while the structure of individual models may be identical, often the algorithms that transform and/or generate the contents of the models are variable.

This is probably best shown by an example: TODO

So as an alternative to identifying models by their type, in webseer we identify models by the steps that led to the model. This becomes much more clear in the identification of features, which are identified by the series of steps that led to the model where the feature is drawn and then an identifier for the feature itself. For example, the number of words in a HTML document:

http://ryan.levering.name/webseer/fetcher/nutch-http/1.0
http://ryan.levering.name/webseer/generator/html-tidy/1.0
http://ryan.levering.name/webseer/transformer/html-text/1.0
http://ryan.levering.name/webseer/extractor/simple-text/1.0#wordCount

Obviously, this wouldn't be useful in a display and would probably be shortened to wordCount, but the point is that there is an explicit definition of the process by which that feature was generated backing every feature value. By assigning a URI to a static implementation of a particular algorithm, as long as we have defined those algorithms explicitly somewhere, some one else can figure out exactly how we generated the content. This is especially important on the web where we are always fighting the volatile nature of content.