API-Problems

Current Implementation

 /**
  * Returns an array of MObject objects. The size of the returned array may vary from call to call. The maximum size of
  * the array is determined by configuration or by the implementation class.
  * 
  * @return an array of MObject objects or null, if no more MObject exist
  * @throws CrawlerException
  *           if any error occurs
  */
 MObject\[\] getNextDeltaIndexingData() throws CrawlerException, CrawlerCriticalException;

 /**
  * Returns a Record object. The parameter pos refers to the position of the MObject from the MObject\[\] returned by
  * getNextDeltaIndexingData().
  * 
  * @param pos
  *          the position refering to a MObject\[\]
  * @return a Record object
  * @throws CrawlerException
  *           if any error occurs
  */
 Record getRecord(int pos) throws CrawlerException, CrawlerCriticalException;

Workflow:

getNextDeltaIndexingData should return attributes that are needed to generate the ID and the HASH for the entry
(they are flagged in the IndexOrderConfiguration)
The CrawlerController then generates the ID and the HASH
Communication with DeltaIndexingModule (ID and HASH needed)
DeltaIndexingModule returns the Information if the entry has changed or not
For changed entries the CrawlerController queries the Record from the Crawler

The Crawler returns always an array (size can be defined by the crawler). Tests have shown that this workflow increases communication performance, but crawler developer has to implement more code and the API is a little bit more complicate

Current Problems

Crawler Developer have to handled frames for getNextDeltaIndexing and getRecords Attachments (Attributes that are flagged as Attachment in the IndexOrder) cannot be returned with the MObject (with GetNextDeltaIndexing), because MObject can contain only Literals and Literals are only simple Data-Types Crawler should usually not return Attachments for hashing, because it destroys the intended Workflow. "Expensive" (time-consuming) operations like getting the content of the Entry should only be executed with getRecord() in the current Implementation attachments (the content) is returned in the mobject as string and then it is returned also as attachment in the record (probably it is also returned in record as Mobject). That means the content is transferred three times Crawler Developer has to understand Record/MObject Structure Exception handling: How should an Exception handled while calling getNextDeltaIndexing? At the moment it tries several times until stopping crawling.

Alternatives

getNextDeltaIndexing returns a new Class (e.g. DIEntry)
the Class contains Attributes with Name and Value, the Value is stored with the Object-Type. therefore every Attribute and Attachments can be returned

getRecord returns only Object\[\], it contains only not previously transferred attributes

CrawlerController creates Records (based on the information in the IndexOrder)
getNextDeltaIndexing returns Record (contains only the DI-Information Attributes and Attachments)
getRecord returns also a Record, it contains only not previously transferred Information

CrawlerController can "merge" both entries
HASH/ID generation is executed in the Crawler Process.
At the moment the Crawler is based on an abstract class that should implement the communication implementation (like Tuscany). Hash /ID creation classes #:can be moved into the Crawler Site Classes. Thus getNextDeltaIndexing will return prepared ID and Hash

Discussion

S.Voigt: to minimize problems with the underlying communication technology and to simplify crawler development i would prefer 1) Crawler Developers have only to understand the indexorderconfiguration and they can return the "Attributes" with simple Java data-types. There is no advantage for us that the crawler developer has to implement Hashing/ID Components (increase only development complexity) and has to fill records and MObjects.

Separation between Crawler Implementation and Communication Implementation

How can we separate the Communication technology from the Crawler Implementation? Goal is to switch simple between e.g. Tuscany and In-Process Communication without changing the code for crawlers.

How big should be the Crawler Framework (classes that are necessary for the start of the Crawler Process?)

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Specifications/CrawlerAPIDiscussion09

Contents

API-Problems

Current Implementation

Current Problems

Alternatives

Discussion

Separation between Crawler Implementation and Communication Implementation

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Specifications/CrawlerAPIDiscussion09

Contents

API-Problems

Current Implementation

Current Problems

Alternatives

Discussion

Separation between Crawler Implementation and Communication Implementation