- the value for TimeOut is not as written in seconds, but in milliseconds.
There are some inconsistencies in this page:
- in Process you are writing "Policy: there are five types of policies offered on how to deal with robots.txt rules" and list four types. What is the fifth one?
- for CrawlingModel you name two available models, MaxBreadth or MaxDepth. In the multiple website configuration example youre using
<CrawlingModel Type="MaxIterations" Value="20"/>.
What about this model? Is MaxBreadth supported at all (SAXParseExceptions says no ...)? Can you maybe add a sentence to describe those models a bit more detailed (or the difference)?
Could you add a bit more about CrawlScope? What options are allowed? As you can see in the mailinglist (http://dev.eclipse.org/mhonarc/lists/smila-user/msg00108.html) this question comes up for several users.
What I am missing in the whole description for crawlers is a word about content, what does content actually look like (Daniel said "as is", meaning the output of the source, e.g. complete markup from a webserver) or how can i configure the outlook of content (e.g. pipelet HTML2Text)? Maybe you can add a word about that.