Skip to main content
Jump to: navigation, search

Difference between revisions of "SMILA/Project Concepts/IRMDiscussion"

(One intermediate revision by one other user not shown)
Line 180: Line 180:
The alternate interface doesn't fit the workflow for a crawler from my point of view. The crawler will crawl each item step by step, the getRecord mechanism forces the Crawler to cache the Record information for each entry to return them when the Controller asks for it, or it has to access the entry in the data source twice which means lower performance. The Record hasNext(), Record next() Iteration is better, it is an easier workflow and therefore easier for the Crawler developer.
The alternate interface doesn't fit the workflow for a crawler from my point of view. The crawler will crawl each item step by step, the getRecord mechanism forces the Crawler to cache the Record information for each entry to return them when the Controller asks for it, or it has to access the entry in the data source twice which means lower performance. The Record hasNext(), Record next() Iteration is better, it is an easier workflow and therefore easier for the Crawler developer.

Latest revision as of 10:23, 24 January 2012


Benefit of SCA for Integration Model os external Systems

Georg Schmidt: Where is the direct advantage to use SCA for the integration of external systems? In which context should it be used and which advantages arre we gaining from it?

  • Daniel Stucky: As Agents/Crawlers may be implemented in different programming languages we need a technology to communicate between Agents/Crawlers and their Controllers. SCA provides this functionality, but it does not offer any advantages (except that we will make use of SCA in other parts of SMILA and so the it would be more homogeneous). What other possibilities do we have for such a communication ?
    • Georg Schmidt: Just plain inprocess communication. Eg. Calling one OSGi bundle from another.

Entry Barrier for Integration Developers

Georg Schmidt: Is there already an idea which entry barrier exists for integration developers? How do we handle build integration? How do we do unit tests? Which technologies must the developer know? Which tools could the developer use to perform the development? Which interfaces must the developer implement at minimum to get the easyest integration done? Is there a concept for managing configurations?

  • Daniel Stucky: Developers should only have to implement Agents/Crawlers. Must have technologies would only be SCA. I think we should postpone these questions after the concept is stable.

Naming of the modules/components

Sebastian Voigt:

  • Modules/Component in the SCA Component View should have Names with Manager/Module/Component in the Name, e.g. Delta Indexing Manager sounds better than only "Incremental Import State"
  • Agent and Crawler Component should be merged into a superior module with a name (something like IRM, but probably better: Connector?). But Both components should exists as component, there a developer can decided to implement both or only one of component for his "irm/Connector"-module, this modul has to developed and the other components are delivered as framework (see Discussion 2.2)
  • Daniel Stucky: I don't think we need a "superior" module in terms of code, but in terms of packaging to allow code reuse for an implementation supporting both the Agents and Crawler interfaces.
  • Sebastian Voigt: That was my intention. We could create a package, that has to be developed and that contains the crawler and/or agent. We need a name for it. Atm there is no border between the IRM FrameWork/interface and the IRM itself (if we should call it IRM).
  • Sebastian Voigt: Probably we could change the architecture overview (the figure) in that way that the components are more separated (task-based\!). I see at the moment 3 packages:
    • Agent/crawler (that have to be developed by a new data source)
    • THE (IRM) Interface: contains the Agent/Crawler Controller, the compound management and the Configuration manager (at this point it is the configuration only for the indexing job).
    • Connectivity Module/Manager (how it is called at the moment?)

These are the three main parts; perhaps we can make small (dashed border lines) boxes around them. From the view of an IRM/agent/crawler developer the IRM Interface (agent/crawler controller/compound management/configuration manager) and the Connectivity Part is contained in ONE package/box. This is the connectivity "box". From the point of view of nodes there is one node that contains the IRM-Interface and the Connectivity Module. The Agent/Crawler can probably run at the host where the data source runs (another node) and the queue runs distributed at more than hosts/nodes. Probably there is a misunderstood and the IRM Interface things should run at the host where the agent/crawler is installed. If not, I vote for super package (super module) that contains the existing IRM interface and the connectivity part). We have to create figures that fit into super architecture overview figures. In the super architecture overview there are agents/crawlers and the connectivity Module. In our overview this "packages" should appear or should reused to show the reader the hierarchy.

Agent/Controller conflict problem?

Sebastian Voigt: Agent and crawler should not be allowed to access (send it to Connectivity Manager) the same data/object/information at the same time \--> mutual exclusion / synchronization is needed

  • Daniel Stucky: this funtionality has to be adressed more generally in the Connectivity Manager, as Agents/Crawlers now nothing about each other.

Sebastian Voigt: That is correct, synchronization issues has be managed by the Connectivity Manager.

Process Component Logic:

Sebastian Voigt: The Process Component Logic should/can be own Modul/Component, because it handle a lot of work. Thus it can be easier reused and improved.

  • Daniel Stucky: Yes, I agree. At design time I focused only on container objects (like zip), but theree are other compounds that must be handeled. This is called "Splitter" in the architecture overview and also contains page-based indexing of large files. So we could also need a seperate framework here.

Definition of Interfaces for the Components (Agent/Crawler Controller & Connectivity):

Sebastian Voigt: The Data format for the Agent/Crawler Controller should be defined:

  • In which format should the retrieved information returned (Agent/Crawler \->Agent/Crawler Controller)

Simple example for the definition (same way it is used with IRMs and the AFE-Engine): A agent/crawler defines with a xsd Schema. It Contains all information fields that the agent/controller can return with xml. The Agent/Crawler controller can check the xml with the given xsd. Then mentioned "data unifier" can be used to convert information fields like date to a unique format. A Concept for it is described at Index Order Configuration Schema.

Probably the XML format from ECS-67 can be used for this.

  • For the connectivity module there is also a definition for the data format needed Agent/Crawler Controller \--> Connectivity).
  • Daniel Stucky: I like the idea of using XML schema. But I think we should not allow an IRM to introduce new data types. Is this possible ?
  • Sebastian Voigt: This confluence page Index Order Configuration Schema contains a more detailed description for the irm configuration

Configuration Management: Information retrieval

Sebastian Voigt: An Agent or a Crawler need to know which information should be retrieved. Thus the Agent/Crawler can retrieve only necessary information (lazy initialize: long retrieval operations should only execute if they are necessary) Should this information be stored within the config? (BTW: the Binding of information to index fields is done in another part of the framework? That means this configuration file should be used also in another parts of the framework. Should it be used also for teh configuration of index fields and their parameters for the search configuration (AND/OR/wildcards... search dependent parameters) The same idea from 2.6 with the xml/xsd definition of a self-built IRM/Connector could be used. The Configuration (config) has a special part, where the information is described that should be retrived with the same xml tags used for the information transport between agent/crawler and agent/crawler controller.

This two parts of the config could also be verified by the agent/crawler controller with the xsd of the agent/crawler(IRM/Connector). See for more Information Index Order Configuration Schema. 2.6 and 2.7 should be discussed and if the idea is good we can create a exact definiton for this Definiton is now here: Index Order Configuration Schema).

  • Daniel Stucky: We should seperate the configuration of what information an agent/crawler provides from what information is indexed and how this information is searched. There may be simple cases, where the configuration is equal, but the processing of the crawled information usualy leads to additional index fields. Also not every information may be used for indexing/searching, but may be stored in the XML Data Storage for other use cases (e.g. Mashup).
  • Sebastian Voigt: Ok.

Dealing/Handling with special information like permissions:

Sebastian Voigt: A lot of data sources use external security management. That means the assignment of a user to a security group or something else is not stored within the data source. An Example is the use of LDAP or Windows Domain for the security management. Data sources store only the permissions of users and groups of them. Therefore IRM has no knowledge while indexing which user is assigned to a group.

Thus existing IRMs for the AF-Engine return permissions for an information object/entry unchanged and the search implementation uses a module that gather the information about the assignments from users to groups. This module then translates the search query, in order that the search only returns entries that the search user is allowed to see. Should we keep this workflow for SMILA? Daniel Stucky:

  • I agree that we need functionality to resolve security information (members of a group, groups of a user).
  • your approach is good as
    • less data is stored as when the Crawler would resolve the information (groups can have lot's of members)
    • changes in user->group assignment can be applied without need for reindexing
  • how does it affect the search performance ? During indexing the time spent to resolve this information is not so critical.
  • I suggest that we just provide the functionality in the framework but do not constitute on how security information is handeled. There may be scenarios where the security information is only accessible via Agents/Crawlers. Your approach should be emphasized in "Best Practices".

Igor Novakovic:

  • We used the same approach in our product called e:SLS. We stored only group information bound to a document (this binding has a rather static nature) in index. Now, before the search query has been executed, the groups in witch the user (who fired that search request) is a member would be resolved and used as a filter criteria for the search. Additionally, the retrieved documents were checked against the actual current access rights of that user in order to make sure, that he/she can really read those documents. (Die group-access information, stored in index, may be out of date if the document has not been reindexed after the access right on that document changed.)

Here are some ideas/discussions about interfaces

Agent Controller

interface AgentController|borderStyle
    void add(Record) // triggers the add process
    void update(Record) // triggers the update process
    void delete(Record) // triggers the delete process, the Record most likely will only contain the ID and no data


interface Crawler
    Iterator<Record> crawl(Config)

Starts a crawl process (as separate thread(s)) with the given Configuration and returns an Iterator on the crawled data. In this case the Iterator has to be a Service that is created on demand. The Iterators hasNext() method should not return a boolean, but an IncImportData object (e.g. a hash token) if it has a next elements, or NULL if no more elements exist. The IncImportData (probably a hash) is needed in the CrawlerController to determine if this data needs to be processed. The Iterator also needs a method skip() to move the iterator to the next element without getting the current element. {note:title=Technical Note}My idea was that the Crawler Controller initiates a new crawl process by calling method crawl() on the Crawler, which returns an Iterator on the data to the Crawler Controller. Therefore I made some tests with Tuscany using Conversations to simulate this interaction. In General it works, but Tuscany seems to have a bug when returning ServiceReferences. Initiated Conversations are not reused. I created in Tuscany JIRA to address this limitation. This bug is fixed in Tuscany 1.1 {note} \\

interface Iterator
        Checks if more data objects are available.
        @return a Record containing data for delta indexing (ID and hash) or null if no more data objects exist.
    Record hasNext();
        Moves the iterator to the next data object after accessing and returning the current data object as a Record
        @return a Record containing the complete data
    Record next();
        Moves the iterator to the next element without accessing and returning the data object
    void skip();

{info:title=Alternative Interface Design} We could also provide the following interface. It seems to be more flexible than the initial one and distributes the implementation logic between the Crawler and Iterator. In the initial approach the main logic is provided by the iterator. A second benefit is that it allows direct access to a selected Record which may be needed in BPEL during search.:

interface Crawler
        Returns an  Iterator with Records on delta indexing information
    Iterator<Record> crawl(Config)
        Gets a Record with all data by ID
    Record getRecord(ID)
interface Iterator
        Checks if more data exists and returns true if one or more data exists, false otherwise
    boolean hasNext();
        Returns one or more Records containing delta indexing information.
    List<Record> next();

With this interface, the Iterator iterates only on delta indexing information. It does not access all of the objects data and does not return this data in any way. Access to the complete data is provided by the Crawler interface, using the Record ID. Note that iteration and access of data is asynchronus. This may be difficult or even impossible to implement for certain data sources, or the maybe the size of the List has be reduced to one (compare empolis Exchange Connector). Perhaps we should introduce a data type DeltaIndexingRecord to seperate between Records with complete data and delta indexing data.

Sebastian Voigt: I would prefer the upper Crawler Interface. The alternate interface doesn't fit the workflow for a crawler from my point of view. The crawler will crawl each item step by step, the getRecord mechanism forces the Crawler to cache the Record information for each entry to return them when the Controller asks for it, or it has to access the entry in the data source twice which means lower performance. The Record hasNext(), Record next() Iteration is better, it is an easier workflow and therefore easier for the Crawler developer. {info}

Back to the top