Jump to: navigation, search



Current Implementation

  * Returns an array of MObject objects. The size of the returned array may vary from call to call. The maximum size of
  * the array is determined by configuration or by the implementation class.
  * @return an array of MObject objects or null, if no more MObject exist
  * @throws CrawlerException
  *           if any error occurs
 MObject[] getNextDeltaIndexingData() throws CrawlerException, CrawlerCriticalException;

  * Returns a Record object. The parameter pos refers to the position of the MObject from the MObject[] returned by
  * getNextDeltaIndexingData().
  * @param pos
  *          the position refering to a MObject[]
  * @return a Record object
  * @throws CrawlerException
  *           if any error occurs
 Record getRecord(int pos) throws CrawlerException, CrawlerCriticalException;


  1. getNextDeltaIndexingData should return attributes that are needed to generate the ID and the HASH for the entry
    (they are flagged in the IndexOrderConfiguration)
  2. The CrawlerController then generates the ID and the HASH
  3. Communication with DeltaIndexingModule (ID and HASH needed)
  4. DeltaIndexingModule returns the Information if the entry has changed or not
  5. For changed entries the CrawlerController queries the Record from the Crawler

The Crawler returns always an array (size can be defined by the crawler). Tests have shown that this workflow increases communication performance, but crawler developer has to implement more code and the API is a little bit more complicate

Current Problems

Crawler Developer have to handled frames for getNextDeltaIndexing and getRecords Attachments (Attributes that are flagged as Attachment in the IndexOrder) cannot be returned with the MObject (with GetNextDeltaIndexing), because MObject can contain only Literals and Literals are only simple Data-Types Crawler should usually not return Attachments for hashing, because it destroys the intended Workflow. "Expensive" (time-consuming) operations like getting the content of the Entry should only be executed with getRecord() in the current Implementation attachments (the content) is returned in the mobject as string and then it is returned also as attachment in the record (probably it is also returned in record as Mobject). That means the content is transferred three times Crawler Developer has to understand Record/MObject Structure Exception handling: How should an Exception handled while calling getNextDeltaIndexing? At the moment it tries several times until stopping crawling.


  1. getNextDeltaIndexing returns a new Class (e.g. DIEntry)
    the Class contains Attributes with Name and Value, the Value is stored with the Object-Type. therefore every Attribute and Attachments can be returned
    getRecord returns only Object\[\], it contains only not previously transferred attributes
    CrawlerController creates Records (based on the information in the IndexOrder)
  2. getNextDeltaIndexing returns Record (contains only the DI-Information Attributes and Attachments)
    getRecord returns also a Record, it contains only not previously transferred Information
    CrawlerController can "merge" both entries
  3. HASH/ID generation is executed in the Crawler Process.
    At the moment the Crawler is based on an abstract class that should implement the communication implementation (like Tuscany). Hash /ID creation classes #:can be moved into the Crawler Site Classes. Thus getNextDeltaIndexing will return prepared ID and Hash


Sebastian Voigt: to minimize problems with the underlying communication technology and to simplify crawler development i would prefer 1) Crawler Developers have only to understand the indexorderconfiguration and they can return the "Attributes" with simple Java data-types. There is no advantage for us that the crawler developer has to implement Hashing/ID Components (increase only development complexity) and has to fill records and MObjects.

Daniel Stucky: Personally I prefer to let the Crawler generate ID and HASH. It is beneficial for performance, as less data has to be transferred between Crawlers and CrawlerController. I don't see additional complexity. Not every Crawler has to implement it's own methods to create ID/HASH. He only has to use them. Such methods can be made available by Utility classes or an abstract base class. If someone desperately wants to implement these things on his own - he's free to do it and has to bear the consequences. Concerning the return types, I think that getNextDeltaIndexing() should return an array of a new data type DIInfo, that contains only the ID (Id) and the HASH (String). As there are 2 concrete data types (Id and String) there is no need to use MObjects or Records. It is still possible, though. For the return type of getRecord() one could simply use a Map<String,Object> and create the Record objects on the CrawlerController. In this way a Crawler may provide data, that is not convertible into a Record (at least not automatically/generically). On the other hand, we would have less dependencies towards other bundles. A Record object has more constraints and allows a Crawler to provide additional information to the data using annotations (sadly I currently don't have an example for a use case). Another issue could be semantics. At the moment is is totally unclear how semantics are added/associated to/with Records. Using the same objects throughout the system may make things easier. I do agree that creation of Records, MObjects and Literals is cumbersome. So we should adopt those APIs or add utility methods to make creation easier, regardless of this is used in Crawlers or in the CrawlerController.

Separation between Crawler Implementation and Communication Implementation

How can we separate the Communication technology from the Crawler Implementation? Goal is to switch simple between e.g. Tuscany and In-Process Communication without changing the code for crawlers.

Daniel Stucky: Actually Tuscany (SCA) is the technology that allows separation of communication technology and business logic. The wiring of components allows us for example to let the CrawlerController communicate with Crawlers in Process, via RMI, webservice, etc. by configuration. I think your question is "Is it possible to NOT use Tuscany for in process communication without changing code for crawlers?". There are several issues:

  • in process communication without Tuscany may be a valid request, as it leads to better performance. Even when using binding.sca Tuscany generates proxy objects that will slow down communication. Perhaps we should do some tests (see Performance Evaluation on page SMILA/Project Concepts/IRM
  • most of the Tuscany features do not need actual coding (e.g. implementation of interfaces) but are enabled by code annotations. These annotations do not interfere with the crawler code if Tuscany is not used at runtime (for compilation Tuscany annotation classes are needed of course)
  • the concept was done with Tuscany/SCA functionality in mind. So there are several features that automatically come with Tuscany (like handling of conversations/sessions, using ComponentContext to determine CrawlerID). This allows a Crawler to crawl multiple DataSources in parallel by automatically providing multiple instances. If Tuscany is not used this feature has to be reimplemented by each Crawler. If it is reimplemented, then it makes no sense to use it's Tuscany counterpart when using Tuscany. The ComponentContext is used to get the Crawlers ID from the Component description. It is used for Crawler detection by the CrawlerController

So what is the gain for a Crawler developer? I don't see any benefits regarding simplification. In contrast, the developer has to take care of multithreading and session handling. If you see any problems with the technology in the Crawler area, then we should discuss if CrawlerController and Crawler should run in the same VM and make NOT use of Tuscany in any case. If Crawlers in non Java technologies are needed integration is done in traditional ways (e.g. JNI, Corba, etc.) using a Java Proxy. And is Tuscany a valid technology for distributing ConnectivityManager and BPEL Services, then ?

How big should be the Crawler Framework (classes that are necessary for the start of the Crawler Process?)

Daniel Stucky: I think we should try to keep the Crawler Framework as small as possible. So I guess we have to provide seperate bundles for interfaces and implementations, as it is already done in org.eclipse.smila.connectivity and org.eclipse.smila.connectivity.impl. Also a restructuring of utility classes may be necessary.

Alternate opinion

Ivan Churkin: I have alternate to Daniel opinion. But, before represent it, I want to summarize.

The main goal of framework is is to offer convenient API for 3rd party crawler developers. To satisfy the goal, it have to possess following characteristics, in my opinion.

  • Simplicity.
  • Independence. ( from 3rd party technologies, like SCA )
  • Effectiveness. ( ready crawler should interact with framework efficiently)

Unfortunately, current crawler API does not possess at least one characteristic from the list!

  • Its hard to implement.
  • It dependent from SCA
  • It inefficiently interacts with framework, for example when HASH should be calculated from the CONTENT, like for web crawler. As a result crawler sends CONTENT as some additional Attribute to Crawler Controller only for calculating HASH. And, moreover, its impossible use web crawler for downloading binary content, because DIInfo based on string Literals.

In my opinion its absolutely unacceptable.

The problem that this API was designed specially for SCA. Its not user-friendly. Additionally, it has (only one) simplification of development, common HASH calculating on crawler controller side. This simplification breaks effectiveness and makes additional issues like "Content or binary based HASH" problem.

I think the solution is to split crawler API and communication API. Crawler interface should be very simple. It should be something like the next interface:

interface Crawler {
 void start(IndexOrderConfiruration config);
 boolean next();
 Object getAttribute(String name);
 byte[] getAttachment(String name);
 void finish();

Or, maybe, even better:

interface DataSourceReference {
 Object getAttribute(String name);
 byte[] getAttachment(String name);
interface Crawler {
 void start(IndexOrderConfiruration config);
 DataSourceReference next();
 void finish();

Communication interface will depends from communication technology used. For SCA It will be similar to currently used Crawler interface. The main benefit that it will be added reference implementation (RI) of communication interface into framework. It will allow to ball a game. Manly, crawler developers will implement very simple interface and only to use ready communication RI. From the other side, it will be allowed to write and use own implementations of communication interface if RI does not fit ( dont shure that its really required ).

I see many benefits.

  • All hard and unclear work will be moved to written once communication RI, All crawler developers will be happy ;)
  • Its more flexible regarding transfort protocols. For example, if transfort will be changed (from SCA to other), we have to change only one class in framework. And we have not fix all (3rd party) crawlers, they will remain the same.
  • Problems like "Content based HASH" diappeared.