WORK IN PROGRESS
We are currently implementing this, so the documentation may already describe features that are not yet part of the code.
Using JobManager to run imports
The idea is to apply the jobmanagement framework for doing crawl jobs, too. The advantages are:
- we don't need a seperate execution framework for crawling anymore
- integrators can use same programming model for creating crawler components than for processing workers.
- same control and monitoring APIs for crawling and processing
- better performance through inherent asynchronicity
- better error tolerance through inherent failsafety
- Parallelization of crawling process possible
We can reach this goal by splitting up the crawl process into several workers. Basically, a crawling workflow always looks like this:
Workers with names starting with "(DS)" are specific for the crawled data source type. E.g. to crawl a file system you apparently need a different crawler worker than for a web server. Not each component may be necessary for each data source type, and it is possible to adapt components to ad or remove functionality.
The crawling job is separated from the processing (e.g. indexing) workflow. A final worker in the crawl workflow pushes all records to the other workflow. This makes it possible to have several datasource being crawled into a single index. Also, in update crawl runs it is easier to detect when the actual crawling is done so that it can be determined which records have to be deleted because they were not visited in this run.
The components are:
- Crawler: (data source specific)
- two output slots:
- one for crawled resources (files, web pages)
- one for resources to crawl (outgoing links, sub-directories). This one leads to follow-up tasks for the same worker.
- The worker can create multiple output bulks for each slot per task so that the following workers can parallelize better.
- In general, it doesn't get content of a resource, but only the path or URL (or whatever identifies it) and metadata of the resources, to minimize IO load especially during "update runs" where most of the resources have not changed and therefore need not need to be fetched.
- If it has to fetch the content anyway (e.g. Web Crawler has to parse HTML to find follow-up links), it may add it to the crawled records to prevent additional fetching.
- two output slots:
- Checks with DeltaService whether resource has to be added/updated or not(= resource is up-to-date), depending on some of the metadata produced by the crawler (modification date from file system or HTTP headers)
- If resource is up-to-date this is marked in DeltaService and the record is not written to the output bulk.
- Else the record is written to the output bulk (an additional attribute describes if it's new record or one to update) and must be pushed to the processing workflow in the end.
- Fetcher: (data source specific)
- Worker that gets the content of the resource, if the record does not yet contain it
- Detect compounds (like archive files (zip, tgz), for example) and does not fetch them, but just copy them to a compound output bulk for later extraction, as we do not want to put extremely large compound objects into bulks.
- Compound extractor: (data source specific)
- for handling compounds: fetch the compound data to a local temp filesystem, extract it and add the records to output bulks, just like the ones written by the fetcher.
- extracting is generic, but fetching the compound file to a local temp directory will be data source specific - the compound is not fetched by the fetcher, because we don't want to add extremely large files to record bulks.
- Update Pusher:
- Push resulting records to BulkBuilder and mark them as updated in the DeltaService.