SMILA/Project Concepts/IRM Improvements

Description

This page was created for posting IRM improvements proposals (improvements on Data Model and simplifying development process)

Discussion

abasalaev: Sometimes crawler has record data (attributes value) only at the moment of creating MObject, later during record creation crawler hasn't any data besides stored in MObject. Therefore developer have to manually store record information in the temporary storage and get it during record creation. How i understand these improvements solve this problem because DataExctractor processes full record data at once. Is it right?

Daniel Stucky:

3.1: We already discussed the problems regarding DeltaIndexing last week. The "problem" is, that the creation of the HASH token is done in the CrawlerController. Therefore, all attributes needed to create the HASH must be returned. At design time, we did not expect that a HASH is build from binary data. If this is required, we have to adjust the returned object from MObject to Record so that we can also include attachments. Another option (this is the one I suggested from the beginning) is to generate the HASH inside the Crawler and return a data structure (to be defined) that contains only the ID and the generated HASH (a side benefit is that this has better performance, as no big binary data is exchanged). The HASH creation could be done in the abstract base class AbstractCrawler. Of course Crawlers not implemented in Java would have to provide their own HASH creation implementations. The same applies to of the ID. At the moment it is created in the CrawlerController. This could also be done in the Crawler itself.

3.2: I don't think that there is too much SCA, DS and OSGi knowledge necessary and that we should not hide important aspects of the architecture. A developer has to be aware that a Crawler is a remoteable component! However, our own interfaces needs to be known and understood in detail by developers. I agree, that it may be helpfull to provide Utility classes that help create all the Records, MObjects, Attributes and so on. Perhaps we can include some of them in AbstractCrawler. I don't know if implementing another interface is the best way. I fear that it gets to restrictive (e.g. for performance optimized implementations).

Technical proposal

(I1) Improvements of SMILA/Project_Concepts/Data_Model_and_XML_representation

Now Data Model contained two main objects: Record and MObject and four second level objects: Attribute, Attachment, Annotation and Literal

Record

MObject

Annotation(s )

Named Values

Anonimous Values

Attributes(s)

MObject(s)

\----------->

Annotation(s)

\----------->

Attachments(s)

I think that using of this structure is very complicated for Agent/Grawler developers ( especially for 3th party ). Sometimes its hard to separate attribute and attachment.

And there is a problem for delta indexing,

Sometimes if its required to have a byte\[\] "content" for calculating HASH. but on this step we have only MObject returned from Crawler and no "Record" with attachments.

From the other hand setting of byte\[\] is prohibited for Literal Attribute values.

Its suggested to avoid using of Attachments (and maybe also Annotations) and let all data will be setted as Attributes. If its required to store some bynary data separately

We may realize (inside the AttributeImpl setter/getter methods) dynamic linking of value with some storage. \\

It may cause structure to be simpler, for example:

Record

Attributes(s)

Attributes(s) \\

{info} The discussion of this should take place in the comments of SMILA/Project_Concepts/Data_Model_and_XML_representation. {info}

(I2) Improvements of Crawler development process [SMILA/Documentation/How_to_implement_a_Crawler]]

I think that implementing of Crawler interface is too hard for 3rd party developer and may be greatly simplified and it may solve additional problems, for example for calculating HASH. Now its required to have too much knowlege about technologies (SCA, declarative services, osgi, our interfaces...).

I suggest following interface:

interface DataExtractor {
 void start(IndexOrderConfiruration config);
 boolean moveNext();
 Object readAttribute(String name);
 void finish();
}

And it will be written one wrapper "DefaultCrawler" class that will implement Crawler interface and it will use "DataExtractor" user's object for crawling (creating Record reading attributes when its required, calculating HASH, creating arrays of objects for sending remotely and so on). For example, this wrapper may be used in "DefaultCrawler" bundle which accepst multiple eclipse plug-ins for 3rd party "DataExtractor" objects or it may be used manually.

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/IRM Improvements

Contents

Description

Discussion

Technical proposal

(I1) Improvements of SMILA/Project_Concepts/Data_Model_and_XML_representation

(I2) Improvements of Crawler development process [SMILA/Documentation/How_to_implement_a_Crawler]]

Breadcrumbs

Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/IRM Improvements

Contents

Description

Discussion

Technical proposal

(I1) Improvements of SMILA/Project_Concepts/Data_Model_and_XML_representation

(I2) Improvements of Crawler development process [SMILA/Documentation/How_to_implement_a_Crawler]]