Skip to main content
Jump to: navigation, search

Talk:SMILA/Documentation/Web Crawler

Revision as of 11:33, 4 January 2011 by (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Minor errors:

- the value for TimeOut is not as written in seconds, but in milliseconds.

There are some inconsistencies in this page:

- in Process you are writing "Policy: there are five types of policies offered on how to deal with robots.txt rules" and list four types. What is the fifth one?

- for CrawlingModel you name two available models, MaxBreadth or MaxDepth. In the multiple website configuration example youre using

 <CrawlingModel Type="MaxIterations" Value="20"/>. 

What about this model? Is MaxBreadth supported at all (SAXParseExceptions says no ...)? Can you maybe add a sentence to describe those models a bit more detailed (or the difference)?

Could you add a bit more about CrawlScope? What options are allowed? As you can see in the mailinglist ( this question comes up for several users.

What I am missing in the whole description for crawlers is a word about content, what does content actually look like (Daniel said "as is", meaning the output of the source, e.g. complete markup from a webserver) or how can i configure the outlook of content (e.g. pipelet HTML2Text)? Maybe you can add a word about that.

Back to the top