Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "Talk:SMILA/Documentation/Web Crawler"
Line 8: | Line 8: | ||
What about this model? Is MaxBreadth supported at all (SAXParseExceptions says no ...)? Can you maybe add a sentence to describe those models a bit more detailed (or the difference)? | What about this model? Is MaxBreadth supported at all (SAXParseExceptions says no ...)? Can you maybe add a sentence to describe those models a bit more detailed (or the difference)? | ||
+ | |||
+ | Could you add a bit more about CrawlScope? What options are allowed? As you can see in the mailinglist (http://dev.eclipse.org/mhonarc/lists/smila-user/msg00108.html) this question comes up for several users. | ||
What I am missing in the whole description for crawlers is a word about content, what does content actually look like (Daniel said "as is", meaning the output of the source, e.g. complete markup from a webserver) or how can i configure the outlook of content (e.g. pipelet HTML2Text)? Maybe you can add a word about that. | What I am missing in the whole description for crawlers is a word about content, what does content actually look like (Daniel said "as is", meaning the output of the source, e.g. complete markup from a webserver) or how can i configure the outlook of content (e.g. pipelet HTML2Text)? Maybe you can add a word about that. |
Revision as of 04:06, 17 August 2010
There are some inconsistencies in this page:
- in Process you are writing "Policy: there are five types of policies offered on how to deal with robots.txt rules" and list four types. What is the fifth one?
- for CrawlingModel you name two available models, MaxBreadth or MaxDepth. In the multiple website configuration example youre using
<CrawlingModel Type="MaxIterations" Value="20"/>.
What about this model? Is MaxBreadth supported at all (SAXParseExceptions says no ...)? Can you maybe add a sentence to describe those models a bit more detailed (or the difference)?
Could you add a bit more about CrawlScope? What options are allowed? As you can see in the mailinglist (http://dev.eclipse.org/mhonarc/lists/smila-user/msg00108.html) this question comes up for several users.
What I am missing in the whole description for crawlers is a word about content, what does content actually look like (Daniel said "as is", meaning the output of the source, e.g. complete markup from a webserver) or how can i configure the outlook of content (e.g. pipelet HTML2Text)? Maybe you can add a word about that.