Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Management"

m (Crawlers performance counters)
(JMX Client in OSGi console)
 
(36 intermediate revisions by 7 users not shown)
Line 1: Line 1:
 +
{{note|This is deprecated for SMILA 1.0, the JMX management framework is still functional but it's planned to replace it with management and monitoring HTTP ReST APIs.}}
 +
 
SMILA is a framework with a lot of functionality. Most is invoke automatically by internal operations. Nevertheless, the user has to configure and start an initial operation. All functions a user can execute are accessible from the JMX Management Agent. On the following pages you will learn how to use SMILA with the aid of Java's built in JConsole and to handle the JMXClient which features access to SMILA commands via batch files.
 
SMILA is a framework with a lot of functionality. Most is invoke automatically by internal operations. Nevertheless, the user has to configure and start an initial operation. All functions a user can execute are accessible from the JMX Management Agent. On the following pages you will learn how to use SMILA with the aid of Java's built in JConsole and to handle the JMXClient which features access to SMILA commands via batch files.
  
Line 11: Line 13:
 
=== Smila manageable Components ===
 
=== Smila manageable Components ===
  
There are four components of SMILA which you can access over jconsole.
+
Most components are now controlled via the HTTP REST API. Remaining JMX controlled components are:
  
==== CrawlerController ====
+
* [[SMILA/Documentation/SesameOntologyManager]]
 +
* [[SMILA/Documentation/Solr]]
  
Here you can manage the crawling jobs. The following commands are available:
+
Also, some third-party components embedded in SMILA (e.g. Zookeeper) offer monitoring tools via JMX.
 
+
* <tt>startCrawl(String dataSourceID)</tt>: starts a crawling job with the given dataSourceID, for example file or web.
+
* <tt>stopCrawl(String dataSourceID)</tt>: stops the crawling job for the given dataSourceID. ''Note:'' the crawler is only signaled to stop and may do so at its own discretion. In other words: depending on the implementation it might take a while until it actually stops crawling. It thus gives the crawler the chance to clean up all open resources and finish whatever business it needs to.
+
* <tt>getActiveCrawls()</tt>: opens a dialog which show a list containing the dataSourceID for all active crawl jobs. If no job is running the dialog shows null.
+
* <tt>getActiveCrawlsStatus()</tt>:  opens a dialog telling you how many crawl jobs are active at the moment.
+
* <tt>getStatus(String dataSourceID)</tt>: opens a dialog which shows you the status of the crawling job for a given <tt>dataSoruceID</tt>. Possible states are: <tt>RUNNING</tt>, <tt>FINISHED</tt>, <tt>STOPPED</tt> or <tt>ABORTED</tt>.
+
 
+
==== RecordRecycler ====
+
 
+
The <tt>RecordRecycler</tt> gives you the possibly to push already crawled records into Data Flow Process again. For example it could be useful if you want to modify record in the index with another pipeline. To control <tt>RecordRecycler</tt> there are following operations available.
+
 
+
* <tt>startRecycling(String configurationID, String dataSourceId)</tt>: fires a recycling event with the given configurationID ( the configurationID  must match the  name of a configuration file located at <tt>configuration/org.eclipse.smila.connectivity.framework.queue.worker/recyclers</tt>) and <tt>dataSourceID</tt> (get records from RecordStorage which have this dataSourceID). See [[QueueWorker]] documentation for further enlightenment on the Recycler.
+
* <tt>stopRecycling(String dataSourceID)</tt>: stops the recycle event for the given dataSourceID.
+
* <tt>getRecordsRecycled(String dataSourceID)</tt>: opens a dialog shows how many records are recycled.
+
* <tt>getConfigurations(String dataSourceID)</tt>: show a list containing all available recycle configuration files.
+
* <tt>getStatus(String dataSourceID)</tt>: open dialog showing the status of recycling event for given <tt>dataSourceID</tt>. Possible states are: <tt>STARTED</tt>, <tt>IN_PROCESS</tt>, <tt>STOPPING</tt>, <tt>STOPPED</tt>, <tt>FINISHED</tt>.
+
 
+
==== DeltaIndexingManager ====
+
 
+
The <tt>DeltaIndexManager</tt> stores a hash value of each record. It is part of the Connectivity Framework and signals a crawler that a given record has (not) changed since the last crawl. See [[DeltaIndexing]] documentation. Within jconsole you can use the following commands:
+
 
+
* <tt>clearAll()</tt>: clears all hashes thus enabling to reprocess all records
+
* <tt>unlockAll()</tt>: unlock all datasources
+
* <tt>clear(String dataSourceID)</tt>: same as <tt>clearAll</tt> but limited to one data source
+
 
+
==== Lucene ====
+
 
+
With Lucene you have the possibility to invoke several method concerning the index. Following operation are available:
+
 
+
* <tt>deleteIndex(String indexName)</tt>: removes the index with the given name if available. Otherwise an error dialog is shown.
+
* <tt>indexExists(String indexName)</tt>: ask the framework if the given Index exists. Returns true or false.
+
* <tt>createIndex (String indexName)</tt>: creates an index with the given name.
+
* <tt>reorganizeIndex(String indexName)</tt>: reorganizes the index with the given name. This will clean up the index, in that deleted entries are physically removed resulting in a smaller index size.
+
* <tt>renameIndex(String currentIndexName, String newIndexName)</tt>: rename the index with the given name (currentIndexname) into the value of newIndexName.
+
  
 
== PerformanceCounter ==
 
== PerformanceCounter ==
  
 
A [[SMILA/Project_Concepts/Performance_counters_API|PerformanceCounter]] monitors the activity of a component. In SMILA currently two kinds of PerformanceCounters are available, one for [[SMILA/Documentation/Crawler|Crawlers]] and another for Processing within the Data Flow Process. With the aid of jconsole you have the possibility to look at interesting counters of SMILA. There exist a lot of views that allow you to get information about different situations.
 
A [[SMILA/Project_Concepts/Performance_counters_API|PerformanceCounter]] monitors the activity of a component. In SMILA currently two kinds of PerformanceCounters are available, one for [[SMILA/Documentation/Crawler|Crawlers]] and another for Processing within the Data Flow Process. With the aid of jconsole you have the possibility to look at interesting counters of SMILA. There exist a lot of views that allow you to get information about different situations.
 
=== Crawlers performance counters ===
 
 
After you start a crawl job immediately a new branch in MBeans-tree appears with the following nodes/values:
 
 
[[Image:CrawlerPCTree.PNG|thumb|Crawler counters tree]]
 
 
* <tt>Crawlers</tt>
 
** <tt>FileSystem</tt> - Crawler type
 
*** <tt>Launches</tt>
 
**** <tt>file</tt> - Data source Id
 
***** <tt>19786841</tt> - Crawler instance, one node for every crawl job
 
*** <tt>Total</tt> - Crawler type sub-total
 
** <tt>Web</tt>  - Crawler type
 
*** <tt>Launches</tt>
 
**** <tt>web</tt> - Data source Id
 
***** <tt>2611152</tt> - Crawler instance, one node for every crawl job
 
*** <tt>Total</tt> - Crawler type sub-total
 
** <tt>Total</tt>
 
 
The nodes contain a subset collection of these possible counters:
 
 
* <tt>AttachmentBytesTransfered</tt>: total records attachment bytes transfered
 
* <tt>AttachmentTransferRate</tt>: records attachment bytes transfer rate (bytes/sec)
 
* <tt>AverageAttachmentTransferRate</tt>: overall (whole crawl job) records attachment bytes transfer rate (bytes/sec)
 
* <tt>AverageDeltaIndicesProcessingTime</tt>: average delta index processing time (sec)
 
* <tt>AverageRecordsProcessingTime</tt>: average record processing time (sec)
 
* <tt>OverallAverageDeltaIndicesProcessingTime</tt>: overall delta index processing time (sec)
 
* <tt>OverallAverageRecordsProcessingTime</tt>:  overall record processing time (sec)
 
* <tt>Error</tt>: contains a collection of all errors occurred. On operation tab you can find a method to show all errors in a dialog box.
 
* <tt>Delta-indices</tt>: number of delta indices created in LuceneIndex.
 
* <tt>Exceptions(critical)</tt>: number of critical exceptions.
 
* <tt>Exceptions(non-critical)</tt>: number of non-critical exceptions.
 
* <tt>Exceptions(producer)</tt>: number of producer exceptions.
 
* <tt>Files</tt>: number of files which were crawled. (only FileSystemCrawler)
 
* <tt>Folder</tt>: number of folder walked through. (only FileSystemCrawler)
 
* <tt>Records</tt>: number of records created.
 
* <tt>Bytes</tt>: how much bytes were downloaded
 
* <tt>http-fetch-time</tt>: average of each http-fetch-time, i.e. the time it costs to download a webpage.
 
* <tt>Pages</tt>: how many pages were visited.
 
  
 
=== Processing performance counters ===
 
=== Processing performance counters ===
Line 104: Line 33:
 
* <tt>Processing Service</tt>: lists all processing services which were invoked, sorted by pipelines
 
* <tt>Processing Service</tt>: lists all processing services which were invoked, sorted by pipelines
 
** <tt>AddPipeline</tt>
 
** <tt>AddPipeline</tt>
*** <tt>LuceneIndexService</tt>
 
 
*** <tt>SimepleMimeTypeIdentifier</tt>
 
*** <tt>SimepleMimeTypeIdentifier</tt>
 
** <tt>DeletePipeline</tt>
 
** <tt>DeletePipeline</tt>
*** <tt>LuceneIndexService</tt>
+
*** <tt>SolrIndexPipelet</tt>
 
* <tt>Simple Pipelet</tt>: lists all pipelets which were used, sorted by pipelines
 
* <tt>Simple Pipelet</tt>: lists all pipelets which were used, sorted by pipelines
 
** <tt>AddPipeline</tt>
 
** <tt>AddPipeline</tt>
 
*** <tt>HtmlToTextPipelet</tt>
 
*** <tt>HtmlToTextPipelet</tt>
 +
*** <tt>SolrIndexPipelet</tt>
  
 
== JMX Client ==
 
== JMX Client ==
  
The JMX Client is a lightweight and very easy to use command line driven component to use access most JMX Management operations. It works without jconsole and provides only a few commands. If you want to have full control over SMILA framework you have to use jconsole as described in the chapter above. But if you only want to start a crawl job or check if a crawl job is still active, you don’t have to use the jconsole. Furthermore you have the possibility to expand functionality of JMX Client. It is highly configurable with only one single configuration file.
+
The JMX Client is a lightweight and very easy to use command line driven component to use access most JMX Management operations. It works without jconsole and provides only a few commands. If you want to have full control over SMILA framework you have to use jconsole as described in the chapter above. Furthermore you have the possibility to expand functionality of JMX Client. It is highly configurable with only one single configuration file.
  
 
=== Pre-defined commands (batch-files) ===
 
=== Pre-defined commands (batch-files) ===
  
* <tt>crawlFILE</tt>: start a crawl job with the file-dataSoruceID
+
* <tt>clearOntology</tt>: remove all statements from "native" ontology.
* <tt>crawlFILEandWait</tt>: same as command above with the exception that the console-window stays open till the crawl job has finished.
+
* <tt>importRDF</tt>: import RDF file into "native" ontology. First argument is path to the RDF file (different formats are supported if the suffix is correct, see [[SMILA/Documentation/SesameOntologyManager#JMX Management Agent]], second argument is the baseURI for all "relative" resources defined in the file. The value is irrelevant if the file contains only "absolute" URIs.
* <tt>crawlFILEstop</tt>: stop the crawl job for dataSoruceID file, if one is active.
+
* <tt>exportRDF</tt>: export all statements from "natvie" ontology to file "export.rdf" in RDF/XML format.
* <tt>crawlWEB</tt>: start a crawl job with web dataSourceID.
+
* <tt>crawlWEBandWait</tt>: same as command above with the exception that the console-windows stays open till the crawl job has finished.
+
* <tt>crawlWEBstop</tt>: stop the crawl job for dataSoruceID web, if one is active.
+
* <tt>getActiveCrawls</tt>: show all active crawl jobs on console.
+
* <tt>indexOptimize</tt>: invoke a reorganize of LuceneIndex (default test_index).
+
* <tt>recycleConfigurations</tt>: display all available recycler configurations.
+
* <tt>recycleFILE</tt>: start a recycle event for dataSoruceID file and the default recyclerConfig (recycler1.xml)
+
* <tt>recycleFILEandWait</tt>: same as command above with the exception that the console-window stays open till the recycle event has finished.
+
* <tt>recycleFILEstatus</tt>: show the actual status of recycle event for dataSourceID file.
+
  
 
=== Usage ===
 
=== Usage ===
Line 141: Line 61:
 
=== Configuration ===
 
=== Configuration ===
  
There is a configuration file located at <tt>org.eclipse.smila.management.jmx.client/schemas/jmxclient.xsd</tt> (Soruce) and <tt>jmxclient/schemas/jmxclient.xsd</tt> (Build). The default configuration file could be found at <tt>org.eclipse.smila.management.jmx.client/config.xml</tt> (Source) and <tt>jmxclient/config.xml</tt>.
+
There is a configuration file located at <tt>org.eclipse.smila.management.jmx.client/schemas/jmxclient.xsd</tt> (Source) and <tt>jmxclient/schemas/jmxclient.xsd</tt> (Build). The default configuration file could be found at <tt>org.eclipse.smila.management.jmx.client/config.xml</tt> (Source) and <tt>jmxclient/config.xml</tt>.
  
==== Default configuration ====
 
 
<source lang="xml">
 
<jmxclient xmlns="http://www.eclipse.org/smila/management/jmx/client" xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
 
  xsi:noNamespaceSchemaLocation="schemas/jmxclient.xsd"
 
>
 
  <connection id="local" host="localhost" port="9004"/>
 
  <connection id="test" host="10.0.0.1" port="23123"/>
 
  <!-- crawler controller related operations -->
 
  <cmd id="crawl" echo="Starting crawler by datasource id">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.framework.CrawlerController"
 
      name="startCrawl"
 
      echo="Starting crawl [%1]"
 
    >
 
      <parameter echo="data source id"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="crawlStop" echo="Stop crawler by datasource id">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.framework.CrawlerController"
 
      name="stopCrawl"
 
      echo="Stopping crawl [%1]"
 
    >
 
      <parameter echo="data source id"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="crawlW" echo="Starting crawler by datasource id and wait for finished">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.framework.CrawlerController"
 
      name="startCrawl"
 
      echo="Starting crawl [%1]"
 
    >
 
      <parameter echo="data source id"/>
 
    </operation>
 
    <!-- result will be similar to "Crawl with the dataSourceId = file and hashcode [595826] successfully started!"
 
        we have to extract hashcode from string -->
 
    <regexp pattern="^.*\[(\d+)\].*$" group="1" echo="Extracting crawler hash code"/>
 
    <wait echo="Waiting while crawl ends" pause="1000">
 
      <in>
 
        <cmd id="-" echo="Getting crawler status by datasource id">
 
          <operation
 
            domain="SMILA"
 
            key="org.eclipse.smila.connectivity.framework.CrawlerController"
 
            name="getStatus"
 
            echo="Crawl [%1] status"
 
          >
 
            <!--  value="%1" -->
 
            <parameter echo="data source id"/>
 
          </operation>
 
        </cmd>
 
        <const value="Finished" echo="Crawling finished status"/>
 
        <const value="Stopped" echo="Crawling stopped status"/>
 
      </in>
 
      <cmd id="-" echo="Reading crawler performance counters">
 
        <attribute
 
          domain="SMILA Crawlers"
 
          key="Crawlers,crawler=%2,name=Total,counter=records"
 
          name="Value"
 
          echo="Total records"
 
        />
 
        <attribute
 
          domain="SMILA Crawlers"
 
          key="Crawlers,crawler=%2,name=Instances,hash=%0,counter=records"
 
          name="Value"
 
          echo="Instance records"
 
        />
 
        <attribute
 
          domain="SMILA Crawlers"
 
          key="Crawlers,crawler=%2,name=Instances,hash=%0,buffer=Errors"
 
          name="Errors"
 
          echo="Errors"
 
        />
 
      </cmd>
 
    </wait>
 
  </cmd>
 
  <cmd id="activeCrawls" echo="Reading active crawls">
 
    <attribute
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.framework.CrawlerController"
 
      name="ActiveCrawls"
 
      echo="Active crawls"
 
    />
 
  </cmd> 
 
  <!-- record recycler related operations -->
 
  <cmd id="recycle" echo="Starting recycler by configuration name and datasource id">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.queue.worker.RecordRecycler"
 
      name="startRecycle"
 
      echo="Starting recycle by configuration [%1] and data source [%2]"
 
    >
 
      <parameter echo="configuration file name"/>
 
      <parameter echo="data source id"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="recycleStatus" echo="Get recycler status by datasource id">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.queue.worker.RecordRecycler"
 
      name="getStatus"
 
      echo="Getting recycling status for data source [%1]"
 
    >
 
      <parameter echo="data source id"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="recycleStop" echo="Stopping recycler by datasource id">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.queue.worker.RecordRecycler"
 
      name="stopRecycle"
 
      echo="Stopping recycle by data source [%1]"
 
    >
 
      <parameter echo="data source id"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="recycleConfigs" echo="Getting recycler configurations list">
 
    <attribute
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.queue.worker.RecordRecycler"
 
      name="Configurations"
 
      echo="Getting recycler configurations list"
 
    />
 
  </cmd>
 
  <cmd id="recycleW" echo="Starting recycler by configuration name and datasource id and wait for finished">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.connectivity.queue.worker.RecordRecycler"
 
      name="startRecycle"
 
      echo="Starting recycle by configuration [%1] and data source [%2]"
 
    >
 
      <parameter echo="configuration file name"/>
 
      <parameter echo="data source id"/>
 
    </operation>
 
    <wait echo="Waiting while recycing ends" pause="1000">
 
      <in>
 
        <cmd id="-" echo="Getting crawler status by datasource id">
 
          <operation
 
            domain="SMILA"
 
            key="org.eclipse.smila.connectivity.queue.worker.RecordRecycler"
 
            name="getStatus"
 
            echo="Recycling status"
 
          >
 
            <parameter echo="data source id" argument="2"/>
 
          </operation>
 
        </cmd>
 
        <const value="FINISHED" echo="Recycling finished status"/>
 
        <const value="STOPPED" echo="Recycling stopped status"/>
 
        <const value="EXCEPTION" echo="Recycling exception status"/>
 
      </in>
 
      <cmd id="-" echo="Reading recycler counters">
 
        <operation
 
          domain="SMILA"
 
          key="org.eclipse.smila.connectivity.queue.worker.RecordRecycler"
 
          name="getRecordsRecycled"
 
          echo="Records recycled"
 
        >
 
          <parameter echo="data source id" argument="2"/>
 
        </operation>
 
      </cmd>
 
    </wait>
 
  </cmd>
 
  <!-- Lucene index related operations --> 
 
  <cmd id="createIndex" echo="Create index by name">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.lucene"
 
      name="createIndex"
 
      echo="Creating index [%1]"
 
    >
 
      <parameter echo="index name"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="reorganizeIndex" echo="Optimize index by name">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.lucene"
 
      name="reorganizeIndex"
 
      echo="Optimizing index [%1]"
 
      voidType="true"
 
    >
 
      <parameter echo="index name"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="deleteIndex" echo="Delete index by name">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.lucene"
 
      name="deleteIndex"
 
      echo="Deleting index [%1]"
 
      voidType="true"
 
    >
 
      <parameter echo="index name"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="isIndexExists" echo="Check is index exists by name">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.lucene"
 
      name="isIndexExists"
 
      echo="Checking index [%1]"
 
    >
 
      <parameter echo="index name"/>
 
    </operation>
 
  </cmd>
 
  <cmd id="renameIndex" echo="Rename index">
 
    <operation
 
      domain="SMILA"
 
      key="org.eclipse.smila.lucene"
 
      name="renameIndex"
 
      echo="Renaming index [%1] to [%2]"
 
      voidType="true"
 
    >
 
      <parameter echo="index name"/>
 
      <parameter echo="new index name"/>
 
    </operation>
 
  </cmd>
 
  <sample value="run crawl file" echo="starts filesystem crawler"/>
 
  <sample value="run crawlW file FilesystemCrawlerDS" echo="starts filesystem crawler and waits crawling end"/>
 
</jmxclient>
 
</source>
 
  
 
==== Configuration explanation ====
 
==== Configuration explanation ====
  
===== To use commands which interact with JMX a connection is needed =====
+
===== To use commands which interact with JMX a connection to the JMX port of SMILA is needed =====
  
 
<source lang="xml"><connection id="local" host="localhost" port="9004"/></source>
 
<source lang="xml"><connection id="local" host="localhost" port="9004"/></source>
  
===== To create your own command you have to use cmd command after this schema =====
+
===== Existing commands =====
 +
 
 +
The JMX client commands for SMILA are defined in the file <tt>config.xmk</tt> of the package <tt>org.eclipse.smila.management.jmx.client</tt>.
 +
The schema for the commands is defined in the folder <tt>schemas</tt> of the same package.
 +
 
 +
===== To create your own commands you have to use cmd command after the schema defined in the above folder =====
  
 
* cmd:
 
* cmd:
** id: name of the command.
+
** id: the name of the command.
** echo: something to display on console if command is execute.
+
** echo: information to display on console if command is executed.
 
*** operation
 
*** operation
**** domain: JMX property root is DOMAIN inside each domain paths (to leafs) are identified by KEY for leafs there are operstions and attributes.
+
**** domain: the JMX property root. If not defined, it will be defaulted to "SMILA".
 
**** key: Class containing method.
 
**** key: Class containing method.
 
**** name: name of the method to invoke.
 
**** name: name of the method to invoke.
**** echo: something to display on console if method is invoked.
+
**** echo: information to display on console if method is invoked.
 
***** parameter: one tag for each parameter.
 
***** parameter: one tag for each parameter.
****** echo: description for the parameter.
+
****** echo: description of the parameter.
  
 
===== To keep the console open and inform you about actual status you can use the wait tag =====
 
===== To keep the console open and inform you about actual status you can use the wait tag =====
Line 391: Line 92:
 
* '''STEP 1''':
 
* '''STEP 1''':
 
<source lang="xml">
 
<source lang="xml">
 +
<cmd id="crawlW" echo="Starting crawler by datasource id and wait for finished">
 
   <operation
 
   <operation
    domain="SMILA"
+
     key="CrawlerController"
     key="org.eclipse.smila.connectivity.framework.CrawlerController"
+
     name="startCrawling"
     name="startCrawl"
+
 
     echo="Starting crawl [%1]">
 
     echo="Starting crawl [%1]">
    <parameter echo="data source id"/>
+
    <parameter echo="data source id"/>
 +
    <parameter echo="job name"/>
 
   </operation>
 
   </operation>
 +
  ...
 
</source>
 
</source>
  
: for DOMAIN "SMILA" and leaf identified by KEY „org.eclipse.smila.connectivity.framework.CrawlerController" execute operation "startCrawl" with one input parameter ( its String type - default) JMX will return result to client, e.g. ''"Crawl with the dataSourceId = file and hashcode [595826] successfully started!"''
+
:: for the MBean of DOMAIN "SMILA" with KEY „CrawlerController" the operation "startCrawlerTask" with two input parameters (with String type - default) is executed. JMX will return a result to the client, e.g. ''"Crawler with the dataSourceId 'file' pushing to job 'indexUpdateJob' successfully started! (import run id: 595826)"''
  
 
* '''STEP 2''':
 
* '''STEP 2''':
:: we should extract hash code from crawler to track its activities it was used by task the next result is "595826"
+
:: we need to extract the hash code (which is the import run id) from the crawler's feedback to track its activities. This can be done by the following regexp tag which would return "595826" as its result for the above example.
 
<source lang="xml">
 
<source lang="xml">
  <regexp pattern="^.*\[(\d+)\].*$" group="1" echo="Extracting crawler hash code"/>
+
<regexp pattern="^.*\(\D*(\d+)\).*$" group="1" echo="Extracting crawler hash code"/>
 
</source>
 
</source>
  
* '''STEP 3''' is a wait task - the most complex - we will wait until crawl finished here it defined by 2 subnodes  
+
* '''STEP 3''' is an unconditional simple wait task. We have to wait for the jmx counters to be created before we can access them in the next step.
 +
<source lang="xml">
 +
<wait echo="Waiting for jmx counters" pause="1000" />
 +
</source>
 +
 
 +
* '''STEP 4''' is a wait task - the most complex task - we will wait until the crawl is finished. This wait tag is defined by using two subnodes  
 
<source lang="xml">
 
<source lang="xml">
 
<wait echo="Waiting while crawl ends" pause="1000">
 
<wait echo="Waiting while crawl ends" pause="1000">
Line 414: Line 122:
 
     <cmd id="-" echo="Getting crawler status by datasource id">
 
     <cmd id="-" echo="Getting crawler status by datasource id">
 
       <operation
 
       <operation
        domain="SMILA"
+
         key="CrawlerController"
         key="org.eclipse.smila.connectivity.framework.CrawlerController"
+
 
         name="getStatus"
 
         name="getStatus"
 
         echo="Crawl [%1] status">
 
         echo="Crawl [%1] status">
          <parameter echo="data source id"/>
+
        <!--  value="%1" -->
 +
        <parameter echo="data source id"/>
 
       </operation>
 
       </operation>
 
     </cmd>
 
     </cmd>
 
     <const value="Finished" echo="Crawling finished status"/>
 
     <const value="Finished" echo="Crawling finished status"/>
 
     <const value="Stopped" echo="Crawling stopped status"/>
 
     <const value="Stopped" echo="Crawling stopped status"/>
 +
    <const value="Aborted" echo="Crawling aborted status"/>
 
   </in>
 
   </in>
 
   <cmd id="-" echo="Reading crawler performance counters">
 
   <cmd id="-" echo="Reading crawler performance counters">
 
     <attribute
 
     <attribute
      domain="SMILA Crawlers"
+
       key="Crawlers/%3/Total"
       key="Crawlers,crawler=%2,name=Total,counter=records"
+
       name="Records"
       name="Value"
+
       echo="Total records"/>
       echo="Total records"
+
      ...
    />
+
    ...
+
 
   </cmd>
 
   </cmd>
</wait>
+
</wait>
 
</source>
 
</source>
:: First subnode ( here its logical IN ) is a condition to exit from WAIT task. Second subnode is a batch to execute if we not exited. So each 1000 ms it will ask for status and validate is it IN "Finished" or "Stopped":
+
:: First subnode ( here its logical IN ) is a condition defining when to exit from the WAIT task. The second subnode is a command to execute in each iteration of the wait loop.
::* if it is, the crawling finished, otherwise:
+
:: If the condition is not evaluated to true, the wait task will pause for the given amount of milliseconds before entering the next iteration. So each 1000 ms the following will be executed:
 
::* three performance counters defined in <tt>cmd</tt> with <tt>id="-"</tt> will be printed.
 
::* three performance counters defined in <tt>cmd</tt> with <tt>id="-"</tt> will be printed.
 +
::* it will read the number of records, ask for the crawler's status checks if the status is "Finished", "Stopped" or "Aborted".
 +
::* if it is, the crawling has finished and so the loop exits, otherwise the next iteration of the loop is started.
 +
 +
== JMX Client in OSGi console ==
 +
 +
The JMX client is also available in the Equinox OSGi console as a command provider. Thus you can now invoke the same configured actions also from the OSGi console without having to open a separate window. The command name is <tt>smila</tt> followed by the same arguments used with the <tt>run</tt> script in <tt>SMILA/jmxclient</tt>. Use <tt>help</tt> to get a description of the supported commands. Usually a lot more help output for the standard Equinox commands follows so you may need to scroll back a lot to find the description of the <tt>smila</tt> command. The commands are exactly like when using the <tt>run</tt> script, only the command name is <tt>smila</tt>, not <tt>run</tt>.
  
 
== External links ==
 
== External links ==
Line 443: Line 156:
 
* [http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/ Java Management Extensions (JMX)]
 
* [http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/ Java Management Extensions (JMX)]
 
* [http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html Using JConsole to Monitor Applications]
 
* [http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html Using JConsole to Monitor Applications]
 +
 +
 +
 +
  
 
[[Category:SMILA]]
 
[[Category:SMILA]]

Latest revision as of 02:12, 5 July 2012

Note.png
This is deprecated for SMILA 1.0, the JMX management framework is still functional but it's planned to replace it with management and monitoring HTTP ReST APIs.


SMILA is a framework with a lot of functionality. Most is invoke automatically by internal operations. Nevertheless, the user has to configure and start an initial operation. All functions a user can execute are accessible from the JMX Management Agent. On the following pages you will learn how to use SMILA with the aid of Java's built in JConsole and to handle the JMXClient which features access to SMILA commands via batch files.

Management with the aid of jconsole

The jconsole is a little tool for monitoring java applications nested in the JDK. Over a JMX connection it’s possible to connect an application with the swing UI of jconsole. If you start up SMILA engine and open jconsole you can connect the Jconsole to SMILA immediately.

jconsole

After connecting you can find SMILA operation on MBeans tab in the Tree on the left site.

Smila manageable Components

Most components are now controlled via the HTTP REST API. Remaining JMX controlled components are:

Also, some third-party components embedded in SMILA (e.g. Zookeeper) offer monitoring tools via JMX.

PerformanceCounter

A PerformanceCounter monitors the activity of a component. In SMILA currently two kinds of PerformanceCounters are available, one for Crawlers and another for Processing within the Data Flow Process. With the aid of jconsole you have the possibility to look at interesting counters of SMILA. There exist a lot of views that allow you to get information about different situations.

Processing performance counters

As soon as Router puts Records into MQ the Listener pushes them into Data Flow Process. This time a new section with the following hierarchy (only an example, because PerformanceCounters vary according to your personal usage of SMILA) appears in MBeans-tree:

  • Pipeline: lists all invoked pipelines.
    • AddPipeline
    • DeletePipeline
  • Processing Service: lists all processing services which were invoked, sorted by pipelines
    • AddPipeline
      • SimepleMimeTypeIdentifier
    • DeletePipeline
      • SolrIndexPipelet
  • Simple Pipelet: lists all pipelets which were used, sorted by pipelines
    • AddPipeline
      • HtmlToTextPipelet
      • SolrIndexPipelet

JMX Client

The JMX Client is a lightweight and very easy to use command line driven component to use access most JMX Management operations. It works without jconsole and provides only a few commands. If you want to have full control over SMILA framework you have to use jconsole as described in the chapter above. Furthermore you have the possibility to expand functionality of JMX Client. It is highly configurable with only one single configuration file.

Pre-defined commands (batch-files)

  • clearOntology: remove all statements from "native" ontology.
  • importRDF: import RDF file into "native" ontology. First argument is path to the RDF file (different formats are supported if the suffix is correct, see SMILA/Documentation/SesameOntologyManager#JMX Management Agent, second argument is the baseURI for all "relative" resources defined in the file. The value is irrelevant if the file contains only "absolute" URIs.
  • exportRDF: export all statements from "natvie" ontology to file "export.rdf" in RDF/XML format.

Usage

If you open command window in folder SMILA/jmxclient and execute run.bat you'll get very useful help.

JMX Client

The JMX Client can be used to simplify JMX Management while using batch-files for most important functions. But that’s not all. With the aid of JMX Client you have the possibility to use SMILA completely from your console or write own batch files which could invoke for example one method after another. The Client works with commands. These commands are managed in only one configuration file. In addition to the pre-defined commands you are able to create own commands. You only need to know the fully qualified class name and method name of function you want to invoke. To execute a command simply use this pattern: run.bat commandName commandParameters. The JMX client is able to execute any JMX operation and get any JMX attribute and to make it in one batch with reusing previous results.

Configuration

There is a configuration file located at org.eclipse.smila.management.jmx.client/schemas/jmxclient.xsd (Source) and jmxclient/schemas/jmxclient.xsd (Build). The default configuration file could be found at org.eclipse.smila.management.jmx.client/config.xml (Source) and jmxclient/config.xml.


Configuration explanation

To use commands which interact with JMX a connection to the JMX port of SMILA is needed
<connection id="local" host="localhost" port="9004"/>
Existing commands

The JMX client commands for SMILA are defined in the file config.xmk of the package org.eclipse.smila.management.jmx.client. The schema for the commands is defined in the folder schemas of the same package.

To create your own commands you have to use cmd command after the schema defined in the above folder
  • cmd:
    • id: the name of the command.
    • echo: information to display on console if command is executed.
      • operation
        • domain: the JMX property root. If not defined, it will be defaulted to "SMILA".
        • key: Class containing method.
        • name: name of the method to invoke.
        • echo: information to display on console if method is invoked.
          • parameter: one tag for each parameter.
            • echo: description of the parameter.
To keep the console open and inform you about actual status you can use the wait tag
  • STEP 1:
<cmd id="crawlW" echo="Starting crawler by datasource id and wait for finished">
  <operation
    key="CrawlerController"
    name="startCrawling"
    echo="Starting crawl [%1]">
    <parameter echo="data source id"/>
    <parameter echo="job name"/>
  </operation>
  ...
for the MBean of DOMAIN "SMILA" with KEY „CrawlerController" the operation "startCrawlerTask" with two input parameters (with String type - default) is executed. JMX will return a result to the client, e.g. "Crawler with the dataSourceId 'file' pushing to job 'indexUpdateJob' successfully started! (import run id: 595826)"
  • STEP 2:
we need to extract the hash code (which is the import run id) from the crawler's feedback to track its activities. This can be done by the following regexp tag which would return "595826" as its result for the above example.
<regexp pattern="^.*\(\D*(\d+)\).*$" group="1" echo="Extracting crawler hash code"/>
  • STEP 3 is an unconditional simple wait task. We have to wait for the jmx counters to be created before we can access them in the next step.
<wait echo="Waiting for jmx counters" pause="1000" />
  • STEP 4 is a wait task - the most complex task - we will wait until the crawl is finished. This wait tag is defined by using two subnodes
<wait echo="Waiting while crawl ends" pause="1000">
  <in>
    <cmd id="-" echo="Getting crawler status by datasource id">
      <operation
        key="CrawlerController"
        name="getStatus"
        echo="Crawl [%1] status">
        <!--  value="%1" -->
        <parameter echo="data source id"/>
      </operation>
    </cmd>
    <const value="Finished" echo="Crawling finished status"/>
    <const value="Stopped" echo="Crawling stopped status"/>
    <const value="Aborted" echo="Crawling aborted status"/>
  </in>
  <cmd id="-" echo="Reading crawler performance counters">
    <attribute
      key="Crawlers/%3/Total"
      name="Records"
      echo="Total records"/>
      ...
  </cmd>
</wait>
First subnode ( here its logical IN ) is a condition defining when to exit from the WAIT task. The second subnode is a command to execute in each iteration of the wait loop.
If the condition is not evaluated to true, the wait task will pause for the given amount of milliseconds before entering the next iteration. So each 1000 ms the following will be executed:
  • three performance counters defined in cmd with id="-" will be printed.
  • it will read the number of records, ask for the crawler's status checks if the status is "Finished", "Stopped" or "Aborted".
  • if it is, the crawling has finished and so the loop exits, otherwise the next iteration of the loop is started.

JMX Client in OSGi console

The JMX client is also available in the Equinox OSGi console as a command provider. Thus you can now invoke the same configured actions also from the OSGi console without having to open a separate window. The command name is smila followed by the same arguments used with the run script in SMILA/jmxclient. Use help to get a description of the supported commands. Usually a lot more help output for the standard Equinox commands follows so you may need to scroll back a lot to find the description of the smila command. The commands are exactly like when using the run script, only the command name is smila, not run.

External links