Skip to main content
Jump to: navigation, search

Difference between revisions of "SMILA/Specifications/CrawlerAPIDiscussion09"

(Current Implementation)
Line 61: Line 61:
  
 
===Discussion===
 
===Discussion===
S.Voigt:
+
[[User:S.voigt.brox.de|Sebastian Voigt]]:
 
to minimize problems with the underlying communication technology and to simplify crawler development i would prefer 1)  
 
to minimize problems with the underlying communication technology and to simplify crawler development i would prefer 1)  
 
Crawler Developers have only to understand the indexorderconfiguration and they can return the "Attributes" with simple Java data-types.
 
Crawler Developers have only to understand the indexorderconfiguration and they can return the "Attributes" with simple Java data-types.
 
There is no advantage for us that the crawler developer has to implement Hashing/ID Components (increase only development complexity) and has to fill records and MObjects.
 
There is no advantage for us that the crawler developer has to implement Hashing/ID Components (increase only development complexity) and has to fill records and MObjects.
  
 +
[[User:Daniel.stucky.empolis.com|Daniel Stucky]]:
 +
Personally I prefer to let the Crawler generate ID and HASH. It is beneficial for performance, as less data has to be transferred between Crawlers and CrawlerController. I don't see additional complexity. Not every Crawler has to implement it's own methods to create ID/HASH. He only has to use them. Such methods can be made available by Utility classes or an abstract base class. If someone desperately wants to implement these things on his own - he's free to do it and has to bear the consequences.
 +
Concerning the return types, I think that getNextDeltaIndexing() should return an array of a new data type DIInfo, that contains only the ID (Id) and the HASH (String). As there are 2 concrete data types (Id and String) there is no need to use MObjects or Records. It is still possible, though.
 +
For the return type of getRecord() one could simply use a Map<String,Object> and create the Record objects on the CrawlerController. In this way a Crawler may provide data, that is not convertible into a Record (at least not automatically/generically). On the other hand, we would have less dependencies towards other bundles. A Record object has more constraints and allows a Crawler to provide additional information to the data using annotations (sadly I currently don't have an example for a use case). Another issue could be semantics. At the moment is is totally unclear how semantics are added/associated to/with Records. Using the same objects throughout the system may make things easier.
 +
I do agree that creation of Records, MObjects and Literals is cumbersome. So we should adopt those APIs or add utility methods to make creation easier, regardless of this is used in Crawlers or in the CrawlerController.
  
  
=Separation between Crawler Implementation and  Communication Implementation=
 
  
 +
=Separation between Crawler Implementation and Communication Implementation=
  
How can we separate the Communication technology from the Crawler Implementation?
+
===How can we separate the Communication technology from the Crawler Implementation? Goal is to switch simple between e.g. Tuscany and In-Process Communication without changing the code for crawlers.===
Goal is to switch simple between e.g. Tuscany and In-Process Communication without changing the code for crawlers.
+
  
How big should be the Crawler Framework (classes that are necessary for the start of the Crawler Process?)
+
[[User:Daniel.stucky.empolis.com|Daniel Stucky]]:
 +
Actually Tuscany (SCA) is the technology that allows separation of communication technology and business logic. The wiring of components allows us for example to let the CrawlerController communicate with Crawlers in Process, via RMI, webservice, etc. by configuration. I think your question is "Is it possible to NOT use Tuscany for in process communication without changing code for crawlers?". There are several issues:
 +
* in process communication without Tuscany may be a valid request, as it leads to better performance. Even when using binding.sca Tuscany generates proxy objects that will slow down communication. Perhaps we should do some tests (see Performance Evaluation on page [[SMILA/Project Concepts/IRM]]
 +
* most of the Tuscany features do not need actual coding (e.g. implementation of interfaces) but are enabled by code annotations. These annotations do not interfere with the crawler code if Tuscany is not used at runtime (for compilation Tuscany annotation classes are needed of course)
 +
* the concept was done with Tuscany/SCA functionality in mind. So there are several features that automatically come with Tuscany (like handling of conversations/sessions, using ComponentContext to determine CrawlerID). This allows a Crawler to crawl multiple DataSources in parallel by automatically providing multiple instances. If Tuscany is not used this feature has to be reimplemented by each Crawler. If it is reimplemented, then it makes no sense to use it's Tuscany counterpart when using Tuscany. The ComponentContext is used to get the Crawlers ID from the Component description. It is used for Crawler detection by the CrawlerController
 +
So what is the gain for a Crawler developer? I don't see any benefits regarding simplification. In contrast, the developer has to take care of multithreading and session handling.
 +
If you see any problems with the technology in the Crawler area, then we should discuss if CrawlerController and Crawler should run in the same VM and make NOT use of Tuscany. If Crawlers in non Java technologies are needed integration is done in traditional ways (e.g. JNI, Corba, etc.) using a Java Proxy. And is Tuscany a valid technology for distributing ConnectivityManager and BPEL Services ?
 +
 
 +
 
 +
===How big should be the Crawler Framework (classes that are necessary for the start of the Crawler Process?)===
 +
 
 +
[[User:Daniel.stucky.empolis.com|Daniel Stucky]]:
 +
I think we should try to keep the Crawler Framework as small as possible. So I guess we have to provide seperate bundles for interfaces and implementations, as it is already done in org.eclipse.smila.connectivity and org.eclipse.smila.connectivity.impl. Also a restructuring of utility classes may be necessary.

Revision as of 05:52, 21 August 2008

API-Problems

Current Implementation

 /**
  * Returns an array of MObject objects. The size of the returned array may vary from call to call. The maximum size of
  * the array is determined by configuration or by the implementation class.
  * 
  * @return an array of MObject objects or null, if no more MObject exist
  * @throws CrawlerException
  *           if any error occurs
  */
 MObject[] getNextDeltaIndexingData() throws CrawlerException, CrawlerCriticalException;


 /**
  * Returns a Record object. The parameter pos refers to the position of the MObject from the MObject[] returned by
  * getNextDeltaIndexingData().
  * 
  * @param pos
  *          the position refering to a MObject[]
  * @return a Record object
  * @throws CrawlerException
  *           if any error occurs
  */
 Record getRecord(int pos) throws CrawlerException, CrawlerCriticalException;


Workflow:

  1. getNextDeltaIndexingData should return attributes that are needed to generate the ID and the HASH for the entry
    (they are flagged in the IndexOrderConfiguration)
  2. The CrawlerController then generates the ID and the HASH
  3. Communication with DeltaIndexingModule (ID and HASH needed)
  4. DeltaIndexingModule returns the Information if the entry has changed or not
  5. For changed entries the CrawlerController queries the Record from the Crawler

The Crawler returns always an array (size can be defined by the crawler). Tests have shown that this workflow increases communication performance, but crawler developer has to implement more code and the API is a little bit more complicate

Current Problems

Crawler Developer have to handled frames for getNextDeltaIndexing and getRecords Attachments (Attributes that are flagged as Attachment in the IndexOrder) cannot be returned with the MObject (with GetNextDeltaIndexing), because MObject can contain only Literals and Literals are only simple Data-Types Crawler should usually not return Attachments for hashing, because it destroys the intended Workflow. "Expensive" (time-consuming) operations like getting the content of the Entry should only be executed with getRecord() in the current Implementation attachments (the content) is returned in the mobject as string and then it is returned also as attachment in the record (probably it is also returned in record as Mobject). That means the content is transferred three times Crawler Developer has to understand Record/MObject Structure Exception handling: How should an Exception handled while calling getNextDeltaIndexing? At the moment it tries several times until stopping crawling.


Alternatives

  1. getNextDeltaIndexing returns a new Class (e.g. DIEntry)
    the Class contains Attributes with Name and Value, the Value is stored with the Object-Type. therefore every Attribute and Attachments can be returned
    getRecord returns only Object\[\], it contains only not previously transferred attributes
    CrawlerController creates Records (based on the information in the IndexOrder)
  2. getNextDeltaIndexing returns Record (contains only the DI-Information Attributes and Attachments)
    getRecord returns also a Record, it contains only not previously transferred Information
    CrawlerController can "merge" both entries
  3. HASH/ID generation is executed in the Crawler Process.
    At the moment the Crawler is based on an abstract class that should implement the communication implementation (like Tuscany). Hash /ID creation classes #:can be moved into the Crawler Site Classes. Thus getNextDeltaIndexing will return prepared ID and Hash


Discussion

Sebastian Voigt: to minimize problems with the underlying communication technology and to simplify crawler development i would prefer 1) Crawler Developers have only to understand the indexorderconfiguration and they can return the "Attributes" with simple Java data-types. There is no advantage for us that the crawler developer has to implement Hashing/ID Components (increase only development complexity) and has to fill records and MObjects.

Daniel Stucky: Personally I prefer to let the Crawler generate ID and HASH. It is beneficial for performance, as less data has to be transferred between Crawlers and CrawlerController. I don't see additional complexity. Not every Crawler has to implement it's own methods to create ID/HASH. He only has to use them. Such methods can be made available by Utility classes or an abstract base class. If someone desperately wants to implement these things on his own - he's free to do it and has to bear the consequences. Concerning the return types, I think that getNextDeltaIndexing() should return an array of a new data type DIInfo, that contains only the ID (Id) and the HASH (String). As there are 2 concrete data types (Id and String) there is no need to use MObjects or Records. It is still possible, though. For the return type of getRecord() one could simply use a Map<String,Object> and create the Record objects on the CrawlerController. In this way a Crawler may provide data, that is not convertible into a Record (at least not automatically/generically). On the other hand, we would have less dependencies towards other bundles. A Record object has more constraints and allows a Crawler to provide additional information to the data using annotations (sadly I currently don't have an example for a use case). Another issue could be semantics. At the moment is is totally unclear how semantics are added/associated to/with Records. Using the same objects throughout the system may make things easier. I do agree that creation of Records, MObjects and Literals is cumbersome. So we should adopt those APIs or add utility methods to make creation easier, regardless of this is used in Crawlers or in the CrawlerController.


Separation between Crawler Implementation and Communication Implementation

How can we separate the Communication technology from the Crawler Implementation? Goal is to switch simple between e.g. Tuscany and In-Process Communication without changing the code for crawlers.

Daniel Stucky: Actually Tuscany (SCA) is the technology that allows separation of communication technology and business logic. The wiring of components allows us for example to let the CrawlerController communicate with Crawlers in Process, via RMI, webservice, etc. by configuration. I think your question is "Is it possible to NOT use Tuscany for in process communication without changing code for crawlers?". There are several issues:

  • in process communication without Tuscany may be a valid request, as it leads to better performance. Even when using binding.sca Tuscany generates proxy objects that will slow down communication. Perhaps we should do some tests (see Performance Evaluation on page SMILA/Project Concepts/IRM
  • most of the Tuscany features do not need actual coding (e.g. implementation of interfaces) but are enabled by code annotations. These annotations do not interfere with the crawler code if Tuscany is not used at runtime (for compilation Tuscany annotation classes are needed of course)
  • the concept was done with Tuscany/SCA functionality in mind. So there are several features that automatically come with Tuscany (like handling of conversations/sessions, using ComponentContext to determine CrawlerID). This allows a Crawler to crawl multiple DataSources in parallel by automatically providing multiple instances. If Tuscany is not used this feature has to be reimplemented by each Crawler. If it is reimplemented, then it makes no sense to use it's Tuscany counterpart when using Tuscany. The ComponentContext is used to get the Crawlers ID from the Component description. It is used for Crawler detection by the CrawlerController

So what is the gain for a Crawler developer? I don't see any benefits regarding simplification. In contrast, the developer has to take care of multithreading and session handling. If you see any problems with the technology in the Crawler area, then we should discuss if CrawlerController and Crawler should run in the same VM and make NOT use of Tuscany. If Crawlers in non Java technologies are needed integration is done in traditional ways (e.g. JNI, Corba, etc.) using a Java Proxy. And is Tuscany a valid technology for distributing ConnectivityManager and BPEL Services ?


How big should be the Crawler Framework (classes that are necessary for the start of the Crawler Process?)

Daniel Stucky: I think we should try to keep the Crawler Framework as small as possible. So I guess we have to provide seperate bundles for interfaces and implementations, as it is already done in org.eclipse.smila.connectivity and org.eclipse.smila.connectivity.impl. Also a restructuring of utility classes may be necessary.

Back to the top