SMILA/Project Concepts/Index Order Configuration Schema

Description

This page describes a concept for the configuration for indexing jobs.

Main Goals: - Configuration concept needed for the configuration manager - Agent/Crawler should follow a spezification (forced by Definition) - Agent/Crawler need the possibility to define own configuration mechanism (each agent/crawler need specific configurations) - Type Safety for the data that the agent/crawler returns

Technical proposal

Indexing Job Configuration Schema:

The Agent/Controller Framework defines a schema that contains the main necessary parameters that every Indexing Job needs.

The schema is the following:

This schema contains the DataSourceID that is used by the ID-Concept (see SMILA/Project_Concepts/ID_Concept).

Furthermore the indexing job describes which Agent/Crawler should be used. The Agent/Crawler tag contains a unique description of the Agent/Crawler OSGI-Bundle that should executed. (Therefore every Agent/Crawler Bundle has to return his unique ID with an API).

With CompoundHandling the CompoundHandling can be activated for the Indexing Task. (Yes/No)

The configuration schema contains 2 parts called Attributes and Process. \\

Each Agent/Crawler has to implement these parts with a redefine of these xml tags (with a own schema).

Access to Object Attributes

The first part is used to describe which attributes of an object that is stored in the accessed source should returned by the Agent/Crawler. Attributes are information parts of an entry in the data-source Thus Object are Entries in the Data source like Files in a FileSystem and Attributes are the Objects Attributes like the file name or the file size of the file.

An Agent/crawler has to define with an XSD which XML \-Tags are allowed to use for this process. Thus each agent/crawler has a "description" (the XML-Tags) of which information could be returned.

The Agent/crawler are allowed to use sub - elements or categories to allow the Agent/Crawler-Developer to give an easy configuration for the indexing job.

Examples for a File System:

In the example the filesystem crawler defines two categories. The first is used for standard information attributes:

Name of the file (FileName)
Date of the file (e.g last modified: FileDate)
Size of the file (FileSize)
Path to the file (FilePath)

The Permission category allows to select between permission information for

Users
Groups
Others (like Unix-Filesystem (as example))

h4. Attribute Information

The following screenshot shows the xml attributes that can be used for *Attribute{*}s \\

Name & Type*:

Each Attribute should have a name. This *Name* can be used to access this attribute (probably necessary, depends on the technique that are used for the data objects). For Type safety an return type can be configured each attribute. The agent/crawler defines a list with allowed return types like String, Integer, Date, etc. Each agent/crawler has to return only values that are used for an attribute. Therefore each Attribute tag has to be assigned to one return type. This return type is used by the agent/crawler when it delivers the information. Thus the Agent/crawler controller can work with this type.

KeyAttributes, HashAttribute & MimeTypeAttribute*:

Each Attribute can be used to create the "delta indexing" hash. Therefore the Agent/Crawler Framework defines a xml-attribute. This xml attribute can be set to true and this attribute will be used for generating the hash value that is needed by delta-indexing.

Furthermore each Attribute can be marked as a key attribute. This attribute will/should be use by the irm for the creation of the ID (see SMILA/Project_Concepts/ID_Concept)

Therefore: marked Attribute with *KeyAttributes* means: use this Information to create a Key for this object (see SMILA/Project_Concepts/ID_Concept) marked Attribute with *HashAttributes* means: use this Information to create the Hash for this object (see SMILA/Project_Concepts/Connectivity#Delta Indexing Manager) marked Attribute with *MimeTypeAttribute* means: the MimeType-Detection should use this Attribute to detect the MimeType

possible Values for MimeTypeAttribute:

Content: an attribute that is the content of the object,
FileExtension: an attribute that contains the file extension,
MimeType: an attribute that contains the the mime type describtion

Attachment*:

Describes if the Crawler/Agent should return this Attribute within an Attachment in the Record (should be used for binary content).

\\ {note:title=type safety} Type Safety should be considered for all SMILA-Components. Never work with undefined types. Thus Data conversion is simpler and the user and the framework have in every state for each value a specific data type. {note}

{warning:title=Open issue: Which Attributes should be delivered to the Connectivity?} All defined Attributes should be returned from an Crawler / Agent. Unsolved is at the moment, if all attributes including the hash/id - Attributes should be delivered within the SMILA Record to the Connectivity and later should be stored in the index){warning}

Selection of Information

The second part of a configuration, (called Process-part), contains further information for the indexing job. It contains parameters for the indexing job like selection of only specific objects from the source (e.g. query for the sql server, starting folder/URL for a filesystem /web\- crawler).

The agent/crawler can define specific XML-attributes that fit its configuration.

Example (for a filesystem crawler):

Notes:

Probably it can be possible that an agent/Crawler defines more than only one return type for an attribute ( String (that contain all entries) or a Collection of the string)

XML-Configuration Definition:

IRM Interface Framework:

Defines XML Configuration Body ( For an agent/crawler Attributes & Process)

Agent/Crawler Schema:

Based on the IRM-Interface Framework (has to use the return types and the Configuration Body)
Defines specific agent/crawler configuration for each attribute, but the leaf has to use a return type)
Defines specific agent/crawler configuration process configuration settings
Defines allowed Return Types

Configuration XML (file):

Is based on one agent/crawler schema

Sample Files:

Schema of the IRM/Agent/Crawler Framework

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" jxb:extensionBindingPrefixes="ext" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:ext="http://xml.w-wins.com/xjc-plugins/interfaces">
  <!-- simple types -->
  <xs:simpleType name="YesNoType">
    <xs:annotation>
      <xs:appinfo>
        <jxb:class ref="org.eclipse.smila.connectivity.framework.indexorder.messages.YesNoType"/>
      </xs:appinfo>
    </xs:annotation>
    <xs:restriction base="xs:string">
      <xs:pattern value="Yes"/>
      <xs:pattern value="No"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="MimeTypeAttributeType">
    <xs:annotation>
      <xs:appinfo>
        <jxb:class ref="org.eclipse.smila.connectivity.framework.indexorder.messages.MimeTypeAttributeType"/>
      </xs:appinfo>
    </xs:annotation>
    <xs:restriction base="xs:string">
      <xs:enumeration value="FileExtension"/>
      <xs:enumeration value="Content"/>
      <xs:enumeration value="MimeType"/>
    </xs:restriction>
  </xs:simpleType>
  <!-- complex types -->
  <xs:complexType name="Attribute">
    <xs:annotation>
      <xs:appinfo>
        <ext:interface>org.eclipse.smila.connectivity.framework.indexorder.messages.interfaces.IAttribute</ext:interface>
      </xs:appinfo>
    </xs:annotation>
    <xs:attribute name="KeyAttribute" type="xs:boolean" use="optional" default="false"/>
    <xs:attribute name="HashAttribute" type="xs:boolean" use="optional" default="false"/>
    <xs:attribute name="Name" type="xs:string" use="required"/>
    <xs:attribute name="Type" type="xs:string" use="required"/>
    <xs:attribute name="MimeTypeAttribute" type="MimeTypeAttributeType" use="optional"/>
    <xs:attribute name="Attachment" type="xs:boolean" use="optional" default="false"/>
  </xs:complexType>
  <xs:complexType name="Process">
    <xs:annotation>
      <xs:documentation>Process Specification</xs:documentation>
      <xs:appinfo>
        <ext:interface>org.eclipse.smila.connectivity.framework.indexorder.messages.interfaces.IProcess</ext:interface>
      </xs:appinfo>
    </xs:annotation>
  </xs:complexType>
  <xs:element name="IndexOrderConfiguration">
    <xs:annotation>
      <xs:appinfo>
        <jxb:class ref="org.eclipse.smila.connectivity.framework.indexorder.messages.IndexOrderConfiguration"/>
      </xs:appinfo>
    </xs:annotation>
    <xs:complexType>
      <xs:sequence>
        <xs:element name="DataSourceID" type="xs:string"/>
        <xs:element name="DataConnectionID">
          <xs:complexType>
            <xs:choice>
              <xs:element name="Agent" type="xs:string"/>
              <xs:element name="Crawler" type="xs:string"/>
            </xs:choice>
          </xs:complexType>
        </xs:element>
        <xs:element name="CompoundHandling" type="YesNoType"/>
        <xs:element name="Attributes">
          <xs:complexType>
            <xs:sequence maxOccurs="unbounded">
              <xs:element name="Attribute" type="Attribute"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <xs:element name="Process" type="Process"/>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

FileSystem Configuration Schema (based on the Agent/Crawler Framework schema)

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:redefine schemaLocation="RootIndexOrderConfiguration.xsd">
    <xs:complexType name="Process">
      <xs:annotation>
        <xs:documentation>Process Specification</xs:documentation>
      </xs:annotation>
      <xs:complexContent>
        <xs:extension base="Process">
          <xs:sequence maxOccurs="unbounded">
            <xs:element name="BaseDir" type="xs:string"/>
            <xs:element name="Filter" maxOccurs="unbounded">
              <xs:complexType>
                <xs:sequence>
                  <xs:element name="Include" minOccurs="0" maxOccurs="unbounded">
                    <xs:complexType>
                      <xs:attribute name="Name" type="xs:string" use="required"/>
                      <xs:attribute name="DateFrom" type="xs:dateTime" use="optional"/>
                      <xs:attribute name="DateTo" type="xs:dateTime" use="optional"/>
                      <xs:attribute name="Period" type="xs:normalizedString" use="optional"/>
                    </xs:complexType>
                  </xs:element>
                  <xs:element name="Exclude" minOccurs="0" maxOccurs="unbounded">
                    <xs:complexType>
                      <xs:attribute name="Name" type="xs:string" use="required"/>
                    </xs:complexType>
                  </xs:element>
                </xs:sequence>
                <xs:attribute name="CaseSensitive" type="xs:boolean" use="optional" default="true"/>
                <xs:attribute name="Recursive" type="xs:boolean" use="required"/>
              </xs:complexType>
            </xs:element>
          </xs:sequence>
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
    <xs:complexType name="Attribute">
      <xs:complexContent>
        <xs:extension base="Attribute">
          <xs:choice>
            <xs:element name="FileAttributes" type="FileAttributesType"/>
            <xs:element name="Permissions" type="PermissionsType"/>
          </xs:choice>
        </xs:extension>
      </xs:complexContent>
    </xs:complexType>
  </xs:redefine>
  <!-- simple types -->
  <xs:simpleType name="FileAttributesType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="Name"/>
      <xs:enumeration value="Path"/>
      <xs:enumeration value="Size"/>
      <xs:enumeration value="LastModifiedDate"/>
      <xs:enumeration value="Content"/>
      <xs:enumeration value="FileExtension"/>
    </xs:restriction>
  </xs:simpleType>
  <xs:simpleType name="PermissionsType">
    <xs:restriction base="xs:string">
      <xs:enumeration value="User"/>
      <xs:enumeration value="Group"/>
      <xs:enumeration value="Ohers"/>
    </xs:restriction>
  </xs:simpleType>
</xs:schema>

Example of FileSystem Crawler Configuration

<?xml version="1.0" encoding="UTF-8"?>
<IndexOrderConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="FileSystemIndexOrder.xsd" >
  <DataSourceID>FileSystem_C_TEST</DataSourceID>
  <DataConnectionID>
    <Crawler>org.eclipse.smila.connectivity.framework.indexorder.schema.sample</Crawler>
  </DataConnectionID>
  <CompoundHandling>Yes</CompoundHandling>
  <Attributes>
    <Attribute Type="Date" Name="Date">
      <FileAttributes>LastModifiedDate</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Filename" KeyAttribute="true">
      <FileAttributes>Name</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Path" KeyAttribute="true">
      <FileAttributes>Path</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="PermissionUsers">
      <Permissions>User</Permissions>
    </Attribute>
    <Attribute Type="StringCollection" Name="PermissionGroup">
      <Permissions>Group</Permissions>
    </Attribute>
    <Attribute Type="String" Name="Content" HashAttribute="true" MimeTypeAttribute="Content"  Attachment="true">
      <FileAttributes>Content</FileAttributes>
    </Attribute>
    <Attribute Type="String" Name="Extension" MimeTypeAttribute="FileExtension">
      <FileAttributes>FileExtension</FileAttributes>
    </Attribute>
  </Attributes>
  <Process>
    <BaseDir>c:\test</BaseDir>
    <Filter Recursive="true" CaseSensitive="false">
      <Include Name="*.txt"/>
    </Filter>
  </Process>
</IndexOrderConfiguration>

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/Index Order Configuration Schema

Contents

Description

Technical proposal

Indexing Job Configuration Schema:

Access to Object Attributes

Selection of Information

XML-Configuration Definition:

IRM Interface Framework:

Agent/Crawler Schema:

Configuration XML (file):

Sample Files:

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/Index Order Configuration Schema

Contents

Description

Technical proposal

Indexing Job Configuration Schema:

Access to Object Attributes

Selection of Information

XML-Configuration Definition:

IRM Interface Framework:

Agent/Crawler Schema:

Configuration XML (file):

Sample Files: