Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
SMILA/Documentation/Web Crawler
< SMILA | Documentation
Revision as of 06:08, 14 August 2008 by Dhazin.gmail.com (Talk | contribs) (New page: == XML Index Order == Following an example of a Webcrawler Index Order: <source lang="xml"> <?xml version="1.0" encoding="UTF-8"?> <IndexOrderConfiguration xmlns:xsi="http://www.w3.org/200...)
XML Index Order
Following an example of a Webcrawler Index Order:
<?xml version="1.0" encoding="UTF-8"?> <IndexOrderConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <DataSourceID>Web_TEST</DataSourceID> <SchemaID>org.eclipse.eilf.connectivity.framework.crawler.web</SchemaID> <DataConnectionID> <Crawler>MyWebCrawler</Crawler> </DataConnectionID> <CompoundHandling>No</CompoundHandling> <Attributes> <Attribute Type="String" Name="Url" KeyAttribute="true"> <FieldAttribute>Url</FieldAttribute> </Attribute> <Attribute Type="String" Name="Title"> <FieldAttribute>Title</FieldAttribute> </Attribute> <Attribute Type="String" Name="Content" HashAttribute="true" Attachment="true" MimeTypeAttribute="Content"> <FieldAttribute>Content</FieldAttribute> </Attribute> <Attribute Type="String" Name="MetaData" Attachment="false"> <MetaAttribute Type="MetaData"/> </Attribute> <Attribute Type="String" Name="ResponseHeader" Attachment="false"> <MetaAttribute Type="ResponseHeader"> <MetaName>Date</MetaName> <MetaName>Server</MetaName> </MetaAttribute> </Attribute> <Attribute Type="String" Name="MetaDataWithResponseHeaderFallBack" Attachment="false"> <MetaAttribute Type="MetaDataWithResponseHeaderFallBack"/> </Attribute> </Attributes> <Process> <WebSite ProjectName="Example Crawler Configuration" Header="Accept-Encoding: gzip,deflate; Via: myProxy" Referer="http://myReferer"> <UserAgent Name="Crawler" Version="1.0" Description="Test crawler" Url="http://www.softaria.com" Email="crawler@example.com"/> <CrawlingModel Type="MaxIterations" Value="20"/> <CrawlScope Type="Broad"> <Filters> <Filter Type="BeginningPath" WorkType="Select" Value="/test.html"/> </Filters> </CrawlScope> <CrawlLimits> <SizeLimits MaxBytesDownload="0" MaxDocumentDownload="10" MaxTimeSec="3600" MaxLengthBytes="1000000" /> <TimeoutLimits Timeout="10000" /> <WaitLimits Wait="0" RandomWait="false" MaxRetries="8" WaitRetry="0"/> </CrawlLimits> <Seeds FollowLinks="NoFollow"> <Seed>http://www.brox.de</Seed> </Seeds> <Filters> <Filter Type="BeginningPath" WorkType="Unselect" Value="/something/"> <Refinements> <TimeOfDay From="09:00:00" To="23:00:00"/> <Port Number="80"/> </Refinements> </Filter> <Filter Type="RegExp" WorkType="Unselect" Value="news"/> <Filter Type="ContentType" WorkType="Unselect" Value="image/jpeg"/> </Filters> <MetaTagFilters> <MetaTagFilter Type="Name" Name="author" Content="Blocked Author" WorkType="Unselect"/> </MetaTagFilters> </WebSite> </Process> </IndexOrderConfiguration>