Talk:SMILA/Documentation/Web Crawler

There are some inconsistencies in this page:

- in Process you are writing "Policy: there are five types of policies offered on how to deal with robots.txt rules" and list four types. What is the fifth one?

- for CrawlingModel you name two available models, MaxBreadth or MaxDepth. In the multiple website configuration example youre using

 <CrawlingModel Type="MaxIterations" Value="20"/>.

What about this model? Is MaxBreadth supported at all (SAXParseExceptions says no ...)? Can you maybe add a sentence to describe those models a bit more detailed (or the difference)?

Could you add a bit more about CrawlScope? What options are allowed? As you can see in the mailinglist (http://dev.eclipse.org/mhonarc/lists/smila-user/msg00108.html) this question comes up for several users.

What I am missing in the whole description for crawlers is a word about content, what does content actually look like (Daniel said "as is", meaning the output of the source, e.g. complete markup from a webserver) or how can i configure the outlook of content (e.g. pipelet HTML2Text)? Maybe you can add a word about that.

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Talk:SMILA/Documentation/Web Crawler