https://wiki.eclipse.org/api.php?action=feedcontributions&user=Marco.strack.empolis.com&feedformat=atomEclipsepedia - User contributions [en]2024-03-28T23:53:52ZUser contributionsMediaWiki 1.26.4https://wiki.eclipse.org/index.php?title=SMILA/Documentation/Search&diff=374343SMILA/Documentation/Search2014-11-28T10:31:07Z<p>Marco.strack.empolis.com: /* SMILA Search Servlet */</p>
<hr />
<div>This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests internally, and the sample servlet used to create a simple web-based GUI for search. <br />
<br />
=== Introduction ===<br />
<br />
Let's start right at the top: Provided that you installed SMILA and created an index by starting a crawler as described in [[SMILA/Documentation for 5 Minutes to Success|5 Minutes to Success]], you can use you web browser to go to [http://localhost:8080/SMILA/search http://localhost:8080/SMILA/search] and search on the index: <br />
<br />
[[Image:SMILA-search-page-default.png|500px|SMILA's sample search page]] <br />
<br />
What happens behind the scenes when you enter a query string and submit the form, is that, on the server side, a servlet creates a SMILA record from the HTTP parameters, uses the search service to process the search with this record, receives an enriched version of the query record and also a list of result records in XML form, and uses an XSLT stylesheet to create a result page in HTML format. <br />
<br />
The processing inside the search service is done by calling a simple (builtin) javascript file to do the work . <br />
<br />
By clicking the ''Advanced'' link at the top of the search page (or by entering the URL <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>), you can switch to a more detailed search form page, which allows you to construct more specific search queries: <br />
<br />
[[Image:SMILA-search-page-advanced.png|500px|SMILA's advanced sample search page]] <br />
<br />
If you want to use the default search servlet for your own search page, you are encouraged to use the two XSLT files creating these HTML pages as a reference or basis when building your pages. <br />
<br />
=== Search Processing ===<br />
<br />
As mentioned before, the internal processing is done via the search service which makes use of javascript for keeping an high level of flexbility. There is also the possibility to use BPEL workflows to steer the query throughout the system. If you want to dig deeper, a more thorough documentation of both concepts can be found at:<br />
* [[SMILA/Documentation/Scripting]] and<br />
* [[SMILA/Documentation/BPEL Workflow Processor]]<br />
<br />
Since both chapters rather address the indexing side of both concepts, the following two paragraphs provide an explanation of how they're to be used in query scenarios. <br />
<br />
==== Search Scripts ====<br />
<br />
In the scripting environment, this follows the same record-in, record-out semantics you may already know. Difference is, that the input record contains the query and maybe some additional parameters while the output record holds the search result. The best explanation is given by looking at the actual JS which has been used in the demo search described above. You'll find the file '''search.js''' in '''configuration/org.eclipse.smila.scripting'''. <br />
<br />
==== Search Pipelines ====<br />
<br />
Search workflows (or pipelines) look just like indexing pipelines, they are only used a bit differently: Instead of pushing lists of records corresponding to data source objects through them, they are invoked with a single record representing the search request. This record contains the values of the parameters which were defined by the Search API (see below). The request object can be analyzed and enriched with additional information during the workflow before the actual search on the index takes place. The results of this search are not added to the blackboard as records of their own, but are added to the request record under the key "records". Further pipelets may then do further processing based on the request data and the result record list (e.g. highlighting). Finally, the request record including the search results is returned to the client and can be presented. <br />
<br />
Pipelet invocations look the same as in indexing pipelines. See <tt>SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/searchpipeline.bpel</tt> for a complete example search pipeline (the one used in the above sample). <br />
<br />
=== Search Service API ===<br />
<br />
The actual Search API is quite simple: SMILA registers an OSGi service with the interface <tt>org.eclipse.smila.search.api.SearchService</tt>. It provides a few methods that take a SMILA query record and the name of a search script or workflow as input, execute the script/workflow with the record, and return the result in different formats: <br />
<br />
*<tt>Record searchWithScript(String scriptAndFunction, Record query) throws ProcessingException</tt>: This is the basic method of the search service, returning the result records as native SMILA data structures. The other methods call this method for the actual search execution, too, and then just convert the result. <br />
*<tt>org.w3c.dom.Document searchAsXmlWithScript(String scriptAndFunction, Record query) throws ProcessingException</tt>: Returns the search result as an XML DOM document. See below for the schema of the result. <br />
*<tt>String searchAsXmlStringWithScript(String scriptAndFunction, Record query) throws ProcessingException</tt>: Returns the search result as an XML string. See below for the schema of the result.<br />
<br />
The corresponding calls for BPEL are (the same description applies):<br />
<br />
*<tt>Record search(String workflowName, Record query) throws ProcessingException</tt> <br />
*<tt>org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException</tt> <br />
*<tt>String searchAsXmlString(String workflowName, Record query) throws ProcessingException</tt><br />
<br />
The schema of XML search results is basically as follows (target namespace is <tt>http://www.eclipse.org/smila/search</tt>, see <tt>org.eclipse.smila.search.api/xml/search.xsd</tt> for the full definition): <br />
<br />
<source lang="xml"><br />
<element name="SearchResult"><br />
<complexType><br />
<sequence minOccurs="1" maxOccurs="1"><br />
<choice><br />
<element name="Workflow" type="string" minOccurs="1" maxOccurs="1" /><br />
<element name="Script" type="string" minOccurs="1" maxOccurs="1" /><br />
</choice><br />
<element ref="rec:Record" minOccurs="0" maxOccurs="1" /><br />
</sequence><br />
</complexType><br />
</element><br />
</source> <br />
<br />
You can view the result XML when using the sample SMILA search page at <tt>http://localhost:8080/SMILA/search</tt> if you enable the ''Show XML result'' option before submitting the query. <br />
<br />
The content of the query record basically depends a lot on the used search services. However, the Search API also includes a recommendation where to put some basic commonly used search parameters which all index integrations should honor (of course they may specify extensions that are not covered by the generic Search API). The following sections describe these recommendations. <br />
<br />
=== Query Parameters ===<br />
<br />
The query record mainly consists of parameters. The Search API defines the names of these parameters, the allowed values as well as the default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additional parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also defined default values. All parameters are single-valued unless otherwise specified. <br />
<br />
*''query'': Either a search string using a query syntax or a query record describing the query by setting values for attributes (aka fielded search). The implementer for a specific underlying technology may define a query syntax to be able to build complex search criteria in a single string. However, SMILA currently does not define an own query syntax and passes the string as is to its default search engine [[SMILA/Documentation/Solr|Solr]] (see there for handling and interpretation).<br />
**Example using a query string:<br />
<br />
<source lang="xml"><br />
<Record><br />
<Val key="query">meaning of life</Val><br />
</Record><br />
</source> <br />
<br />
*Example using a query object (fielded search):<br />
<br />
<source lang="xml"><br />
<Record><br />
<Map key="query"><br />
<Val key="author">shakespeare</Val><br />
<Val key="title">hamlet</Val><br />
</Map><br />
</Record><br />
</source> <br />
<br />
*''maxcount'': The maximum number of records which should be returned to the search client. Default value is 10. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="maxcount" type="long">3</Val><br />
</source> <br />
<br />
*''offset'': The number of hits which, starting from the top, should be skipped in the search result. Default value is 0. Use this parameter to implement result list paging and to provide the user a means to navigate through the result pages: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ... Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="maxcount" type="long">3</Val><br />
<Val key="offset" type="long">3</Val><br />
</source> <br />
<br />
*''threshold'': The minimal value of the relevance score that a result must have to be returned to the search client. Default is 0.0.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="threshold" type="double">0.5</Val><br />
</source> <br />
<br />
*''language'': The natural language of the query. No default value. This parameter may be required for language-specific pipelets/services that need to know in which language the user is expressing his or her query to be able to deliver feasible results. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">sinn des lebens</Val><br />
<Val key="language">de</Val><br />
</source> <br />
<br />
*''indexname'': Some index services (like Solr) can manage multiple indices at once. When doing so, they can use this parameter to select the index which is to be searched with the current request. However, when using such a scenario, it is recommended to configure a default index name, so that search requests will succeed without having this parameter set explicitly. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="indexname">wikipedia</Val><br />
</source> <br />
<br />
*''resultAttributes'': A multi-valued parameter containing the names of the attributes which the search engine should add to the result records. Since including too many attributes will decrease performance, the list should contain only those attributes that are needed by some pipelets for further processing after the search has taken place or for displaying the results in the end. Omitting the parameter results in getting all available attributes. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="resultAttributes"><br />
<Val>author</Val><br />
<Val>title</Val><br />
</Seq><br />
</source> <br />
<br />
*''highlight'': A sequence of string values specifying the attribute names for which highlighting should be returned. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="highlight"><br />
<Val>content</Val><br />
</Seq><br />
</source> <br />
<br />
*''sortby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "order" ("ascending" | "descending") specifying that the search result should be sorted by the named attributes in the given order. Omitting this parameter results in a search result sorting by descending relevance (score, similarity, ranking, ....). Multiple maps can be added and should be evaluated in the order of their appearance. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="sortby"><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="order">descending</Val><br />
</Map><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Val key="order">ascending</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''facetby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "maxcount" (long). This causes facets to be returned by the search results for the specified attributes, returning "maxcount" values for each attribute. Optionally, each facetby map may contain a map with key "sortby" with keys "order" ("ascending" | "descending") and "criterion" (any string, e.g. "count" or "value") specifying in which order to return the values (e.g. "count" by number of this per facet or "value" by attribute value name). Example:<br />
<br />
{{Note|since 1.0|prior to 1.0 this was named ''groupby'' and has been merely renamed, (see [http://dev.eclipse.org/mhonarc/lists/smila-dev/msg00998.html mail thread]}}<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="facetby"><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="maxcount" type="long">10</Val><br />
</Map><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Map key="sortby"><br />
<Val key="criterion">value</Val><br />
<Val key="order">ascending</Val> <br />
</Map><br />
<Val key="maxcount" type="long">5</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''filter'': A sequence of maps describing for certain attributes which values they must have in valid result records. Each of the maps contains a ''key'' "attribute" and one or more value descriptions: <br />
**"oneOf", "allOf", "noneOf": sequences of values describing required or forbidden attribute values. <br />
**"atLeast", "atMost", "greaterThan", "lessThan": single values describing lower and upper bounds (including or excluding the bound values) for the attribute value. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="filter"><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Seq key="oneOf"><br />
<Val>pratchett</Val><br />
<Val>adams</Val><br />
</Seq><br />
</Map><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="atLeast">1990</Val><br />
<Val key="lessThan">2000</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''ranking'': A configuration defining how to rank the search results. This is highly depending on the used search engine, so we don't specify this further in SMILA.<br />
<br />
=== Result Annotations ===<br />
<br />
The search result is usually the request record, enriched with result data. <br />
<br />
*''records'': A sequence of maps describing the actual search result, meaning the records retrieved from the index. Each record should have an additional attribute "_weight" describing the relevance score of this record with respect to the query. The size of the "record" sequence is limited by the "maxcount" parameter.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<!-- other query parameters --><br />
<Seq key="records"><br />
<Map><br />
<Val key="_weight" type="double">0.95</Val><br />
<Val key="_recordid">file:hamlet</Val><br />
<Val key="title">Hamlet</Val><br />
<Val key="author">Shakespeare</Val><br />
...<br />
</Map><br />
<Map><br />
<Val key="_weight" type="double">0.90</Val><br />
<Val key="_recordid">file:hitchhiker</Val><br />
<Val key="title">Hitchhiker's Guide to the Galaxy</Val><br />
<Val key="author">Adams</Val><br />
...<br />
</Map><br />
</Seq><br />
</source> {{Note|return binary content|<br />
There is no nice way to return binary content anymore as attachents may only be top-level children of a record. These two solutions are possible:<br />
# add an attachment to the search record with a name after this pattern: <resultItem-record.Id>.<resultItem.atachmentName><br />
# convert the byte[] into a string (e.g. base64 encoding, so it is serializable) and return it in the AnyMap<br />
}} <br />
<br />
*''count'': The total number of records in the index that have any relevance to the query. Example see ''runtime''. <br />
*''indexSize'' (optional): The total number of records in the searched index. Example see ''runtime''. <br />
*''runtime'': The execution time of request in milliseconds, added by the search service. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="count" type="long">123456</Val><br />
<Val key="indexSize" type="long">987654321</Val><br />
<Val key="runtime" type="long">42</Val><br />
<!-- other query parameters --><br />
<Seq key="records"><br />
<!-- contains returned records --><br />
</Seq><br />
</source> <br />
<br />
*''facets'': The faceting results as requested by the ''facetby'' parameters. This Map contains a nested Seq for each requested facet and its values.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Map key="facets"><br />
<Seq key="year"><br />
<Map><br />
<Val key="value">2000</Val><br />
<Val key="count" type="long">42</Val><br />
</Map><br />
<Map><br />
<Val key="value">2001</Val><br />
<Val key="count" type="long">21</Val><br />
</Map><br />
...<br />
</Seq><br />
<Seq key="author"><br />
<Map><br />
<Val key="value">adams</Val><br />
<Val key="count" type="long">13</Val><br />
</Map><br />
<Map><br />
<Val key="value">shakespear</Val><br />
<Val key="count" type="long">17</Val><br />
</Map><br />
...<br />
</Seq><br />
</Map><br />
</Val><br />
</source> <br />
<br />
*''_highlight'': The annotation of the result record, usually used to highlight relevant sections from the result documents in order to allow the user to see at one glance if it suits what he or she was looking for. What is returned here exactly, depends on the used search engine. For example, the Solr integration in SMILA returns the raw form of the text and information about the matching parts to be highlighted. Example:<br />
<br />
<source lang="xml"><br />
<Seq key="records"><br />
<Map><br />
...<br />
<Map key="_highlight"><br />
<Map key="content"><br />
<Val key="text">... To be or not to be ...</Val><br />
<Seq key="positions"><br />
<Map><br />
<Val key="start" type="long">7</Val><br />
<Val key="end" type="long">9</Val><br />
<Val key="quality" type="long">100</Val><br />
</Map><br />
<Map><br />
<Val key="start" type="long">20</Val><br />
<Val key="end" type="long">22</Val><br />
<Val key="quality" type="long">95</Val><br />
</Map><br />
</Seq><br />
</Map><br />
<Map><br />
...<br />
</Map><br />
</source> Using the HighlightingPipelet this can be transformed into a highlighted text fragment (here using * as the highlight tag): <source lang="xml"><br />
<Seq key="records"><br />
<Map><br />
<Val key="_weight" type="double">0.95</Val><br />
<Val key="_recordid">file:hamlet</Val><br />
<Val key="title">Hamlet</Val><br />
<Val key="author">Shakespeare</Val><br />
<Map key="_highlight"><br />
<Map key="content"><br />
<Val key="text">... To *be* or not to *be* ...</Val><br />
</Map><br />
<Map><br />
...<br />
</Map><br />
</source><br />
<br />
=== Helper Classes ===<br />
<br />
There are some classes that help a client to create query records with their annotations and to read result records and their annotation. You can find them in package <tt>org.eclipse.smila.search.api.helper</tt>: <br />
<br />
*<tt>QueryBuilder</tt>: A helper class for building queries and sending the query to search service. Returns a result in the form of the next class: <br />
*<tt>ResultAccessor</tt>: A wrapper for the complete search result. Provides methods to access the basic top-level result annotations and to access each search result record wrapped by a: <br />
*<tt>ResultRecordAccessor</tt>: Defines methods for accessing some of the result record annotations.<br />
<br />
See the source code or JavaDocs for more details on the provided methods. <br />
<br />
=== SMILA Search Servlet ===<br />
<br />
In addition to the "search backend", SMILA contains a simple servlet that creates a query record from HTTP parameters and displays the result as an HTML page by converting the XML search result using an XSLT stylesheet. This servlet is intended for quick demos only, not for productive use. It is usually deployed in the Jetty instance that comes with SMILA at <tt>/SMILA/search</tt>. On first invocation, it currently creates a quite empty query record (it sets some default parameters like ''maxcount'' etc.) and processes it with the default pipeline "SearchPipeline". The pipeline should be able to process such a query and return an empty result list, not an error. The XML representation of this empty result is then transformed using the default stylesheet ("SMILASearchDefault") to present an initial search page. <br />
<br />
Note that the servlet actually enriches the XML search result a bit, so the input for the XSLT stylsheet does not completely conform to the defined XML schema. Currently, it adds a section containing the names of indices available in Solr so that the search page can display the names for selection on the left side: <br />
<br />
<source lang="xml"><br />
<SearchResult xmlns="http://www.eclipse.org/smila/search"><br />
<Script>search.process</Script><br />
<!-->'Workflow' instead of 'Script' also possible<--><br />
<Record xmlns="http://www.eclipse.org/smila/record"><br />
<!-- effective query and embedded result records ---><br />
</Record><br />
<!-- part added by SearchServlet --><br />
<IndexNames><br />
<IndexName>test_index</IndexName><br />
</IndexNames><br />
</SearchResult><br />
</source> <br />
<br />
You can use the same mechanism to add other information to the XML that is necessary for displaying purposes in the search form but not contained in the search service result: You just have to implement your own servlet or extend the default servlet. Please refer to the source code for details. <br />
<br />
==== XSLT Stylesheets for SMILA search and result pages ====<br />
<br />
The stylesheets are loaded from the configuration directory <tt>org.eclipse.smila.search.servlet</tt> and are selected by adding the HTTP parameter "style" to the URL. The value of this parameter must be the filename of the desired stylesheet without the suffix. The file's extension must bei <tt>.xsl</tt>. The servlet currently uses the hardcoded default name "SMILASearchDefault" if no other value was set. <br />
<br />
In the default application, three stylesheets are avaiable: <br />
<br />
*SMILASearchDefault: The default search page. Use this as a reference on how to describe simple queries and present result lists, including paging through bigger results. <br />
*SMILASearchAdvanced: Same layout for the result list but demostrates how to create more complex query records with attribute values and filters. <br />
*SMILASearchTest: Primitive layout without paging but demonstrates the setting of even more query features.<br />
<br />
To start with another than the default stylesheet, you can add a ''style'' parameter to the initial URL. E.g., to start with the "advanced" stylesheet, use: <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>. <br />
<br />
In the following we will describe how to set query record features using the servlet. Please have a look at those sample stylesheets for complete examples on how to apply them, as we will not present something like a full tutorial here (-; <br />
<br />
==== Setting parameters ====<br />
<br />
To set a parameter, just use the parameter name as the HTTP parameter name. All values for this HTTP parameter are added to the "parameters" annotation of the query record. E.g., to set the ''resultSize'' parameter to 7 using a hidden HTML input field, use: <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="resultSize" value="7" /><br />
</source> <br />
<br />
See below for naming rules for the HTTP parameter names to set attribute literals and annotations. Note that you cannot set a parameter with a name that matches one of these rules. <br />
<br />
==== Setting attributes ====<br />
<br />
You can add literal string values to attributes using "A.&lt;AttributeName&gt;" as the HTTP parameter name. E.g., to set a value from an HTML text input field as a literal in attribute "Title", use: <br />
<br />
<source lang="xml"><br />
<input type="text" name="A.Title" /><br />
</source> <br />
<br />
==== Setting other parameters ====<br />
<br />
To add a "sortby" parameter for an attribute, use "sortBy.&lt;AttributeName&gt;=&lt;order&gt;", e.g. <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="sortby.FileSize" value="descending" /><br />
</source> <br />
<br />
To create a filter for an attribute, use HTTP params: <br />
<br />
*"F.val.&lt;AttributeName&gt;" to add filter values to an "oneOf" filter. <br />
*"F.min.&lt;AttributeName&gt;" and "F.max.&lt;AttributeName&gt;" to set the lower/upper bounds of an "atLeast"/"atMost" filter.<br />
<br />
If both "F.val" and "F.min/F.max" paramaters are set, the servlet will create both an enumeration filter and a range filter with the same filter mode. It depends on the used search engine integration what happens in this case. E.g. <br />
<br />
*To set a filter for attribute ''MimeType'' restricting the result to HTML documents, use:<br />
<br />
<source lang="xml"><br />
<input type="hidden" name="F.val.MimeType" value="text/html" /><br />
</source> <br />
<br />
*To set a filter for attribute ''FileSize'' restricting the result to document sizes between 1000 and 10000 bytes, use:<br />
<br />
<source lang="xml"><br />
<input type="hidden" name="F.min.FileSize" value="1000" /><br />
<input type="hidden" name="F.max.FileSize" value="10000" /><br />
</source> <br />
<br />
To set a value in the ranking parameter for the complete record or an attribute, use "R[.&lt;AttributeName&gt;].&lt;ValueName&gt;". E.g., the following input field adds a parameter "Operator=OR" to attribute "Content": <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="R.Operator.Content" value="OR" /><br />
</source><br />
<br />
==== Adding attachments ====<br />
<br />
Attachments can be added to the query record by adding file upload fields to the search form, for example:<br />
<br />
<source lang="xml"><br />
<input type="file" name="Content"/> <br />
</source><br />
<br />
If the user selects a file for this field, it will be uploaded to SMILA and added as attachment "Content". Of course, there must be pipelets in your search pipeline that can process this attachment. Note, that the attachments will be kept in memory in a default SMILA configuration, so they should not be too large.<br />
<br />
=== Record Search Servlet ===<br />
<br />
In addition there exists the very basic Record Search Servlet available at {{Path|/SMILA/recordsearch}}. <br />
<br />
You can do a POST or GET request on this URL with a SMILA search record in XML representation as the request body. The servlet then parses the given XML and calls the Search Service. The default is to use the SeachPipeline but you can define any other pipeline by adding the {{code|_workflow}} annotation to the search record with the respective pipeline name.<br />
<br />
The servlet returns the XML representation of the record returned by the Search Service as is, in which you can find the search results (see above).<br />
<br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=File:SMILA-search-page-advanced.png&diff=374340File:SMILA-search-page-advanced.png2014-11-28T10:22:09Z<p>Marco.strack.empolis.com: Marco.strack.empolis.com uploaded a new version of &quot;File:SMILA-search-page-advanced.png&quot;</p>
<hr />
<div></div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=File:SMILA-search-page-default.png&diff=374339File:SMILA-search-page-default.png2014-11-28T10:21:14Z<p>Marco.strack.empolis.com: Marco.strack.empolis.com uploaded a new version of &quot;File:SMILA-search-page-default.png&quot;</p>
<hr />
<div></div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=File:SMILA-search-page-default.png&diff=374338File:SMILA-search-page-default.png2014-11-28T10:18:03Z<p>Marco.strack.empolis.com: Marco.strack.empolis.com uploaded a new version of &quot;File:SMILA-search-page-default.png&quot;</p>
<hr />
<div></div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Search&diff=374337SMILA/Documentation/Search2014-11-28T10:09:10Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests internally, and the sample servlet used to create a simple web-based GUI for search. <br />
<br />
=== Introduction ===<br />
<br />
Let's start right at the top: Provided that you installed SMILA and created an index by starting a crawler as described in [[SMILA/Documentation for 5 Minutes to Success|5 Minutes to Success]], you can use you web browser to go to [http://localhost:8080/SMILA/search http://localhost:8080/SMILA/search] and search on the index: <br />
<br />
[[Image:SMILA-search-page-default.png|500px|SMILA's sample search page]] <br />
<br />
What happens behind the scenes when you enter a query string and submit the form, is that, on the server side, a servlet creates a SMILA record from the HTTP parameters, uses the search service to process the search with this record, receives an enriched version of the query record and also a list of result records in XML form, and uses an XSLT stylesheet to create a result page in HTML format. <br />
<br />
The processing inside the search service is done by calling a simple (builtin) javascript file to do the work . <br />
<br />
By clicking the ''Advanced'' link at the top of the search page (or by entering the URL <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>), you can switch to a more detailed search form page, which allows you to construct more specific search queries: <br />
<br />
[[Image:SMILA-search-page-advanced.png|500px|SMILA's advanced sample search page]] <br />
<br />
If you want to use the default search servlet for your own search page, you are encouraged to use the two XSLT files creating these HTML pages as a reference or basis when building your pages. <br />
<br />
=== Search Processing ===<br />
<br />
As mentioned before, the internal processing is done via the search service which makes use of javascript for keeping an high level of flexbility. There is also the possibility to use BPEL workflows to steer the query throughout the system. If you want to dig deeper, a more thorough documentation of both concepts can be found at:<br />
* [[SMILA/Documentation/Scripting]] and<br />
* [[SMILA/Documentation/BPEL Workflow Processor]]<br />
<br />
Since both chapters rather address the indexing side of both concepts, the following two paragraphs provide an explanation of how they're to be used in query scenarios. <br />
<br />
==== Search Scripts ====<br />
<br />
In the scripting environment, this follows the same record-in, record-out semantics you may already know. Difference is, that the input record contains the query and maybe some additional parameters while the output record holds the search result. The best explanation is given by looking at the actual JS which has been used in the demo search described above. You'll find the file '''search.js''' in '''configuration/org.eclipse.smila.scripting'''. <br />
<br />
==== Search Pipelines ====<br />
<br />
Search workflows (or pipelines) look just like indexing pipelines, they are only used a bit differently: Instead of pushing lists of records corresponding to data source objects through them, they are invoked with a single record representing the search request. This record contains the values of the parameters which were defined by the Search API (see below). The request object can be analyzed and enriched with additional information during the workflow before the actual search on the index takes place. The results of this search are not added to the blackboard as records of their own, but are added to the request record under the key "records". Further pipelets may then do further processing based on the request data and the result record list (e.g. highlighting). Finally, the request record including the search results is returned to the client and can be presented. <br />
<br />
Pipelet invocations look the same as in indexing pipelines. See <tt>SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/searchpipeline.bpel</tt> for a complete example search pipeline (the one used in the above sample). <br />
<br />
=== Search Service API ===<br />
<br />
The actual Search API is quite simple: SMILA registers an OSGi service with the interface <tt>org.eclipse.smila.search.api.SearchService</tt>. It provides a few methods that take a SMILA query record and the name of a search script or workflow as input, execute the script/workflow with the record, and return the result in different formats: <br />
<br />
*<tt>Record searchWithScript(String scriptAndFunction, Record query) throws ProcessingException</tt>: This is the basic method of the search service, returning the result records as native SMILA data structures. The other methods call this method for the actual search execution, too, and then just convert the result. <br />
*<tt>org.w3c.dom.Document searchAsXmlWithScript(String scriptAndFunction, Record query) throws ProcessingException</tt>: Returns the search result as an XML DOM document. See below for the schema of the result. <br />
*<tt>String searchAsXmlStringWithScript(String scriptAndFunction, Record query) throws ProcessingException</tt>: Returns the search result as an XML string. See below for the schema of the result.<br />
<br />
The corresponding calls for BPEL are (the same description applies):<br />
<br />
*<tt>Record search(String workflowName, Record query) throws ProcessingException</tt> <br />
*<tt>org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException</tt> <br />
*<tt>String searchAsXmlString(String workflowName, Record query) throws ProcessingException</tt><br />
<br />
The schema of XML search results is basically as follows (target namespace is <tt>http://www.eclipse.org/smila/search</tt>, see <tt>org.eclipse.smila.search.api/xml/search.xsd</tt> for the full definition): <br />
<br />
<source lang="xml"><br />
<element name="SearchResult"><br />
<complexType><br />
<sequence minOccurs="1" maxOccurs="1"><br />
<choice><br />
<element name="Workflow" type="string" minOccurs="1" maxOccurs="1" /><br />
<element name="Script" type="string" minOccurs="1" maxOccurs="1" /><br />
</choice><br />
<element ref="rec:Record" minOccurs="0" maxOccurs="1" /><br />
</sequence><br />
</complexType><br />
</element><br />
</source> <br />
<br />
You can view the result XML when using the sample SMILA search page at <tt>http://localhost:8080/SMILA/search</tt> if you enable the ''Show XML result'' option before submitting the query. <br />
<br />
The content of the query record basically depends a lot on the used search services. However, the Search API also includes a recommendation where to put some basic commonly used search parameters which all index integrations should honor (of course they may specify extensions that are not covered by the generic Search API). The following sections describe these recommendations. <br />
<br />
=== Query Parameters ===<br />
<br />
The query record mainly consists of parameters. The Search API defines the names of these parameters, the allowed values as well as the default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additional parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also defined default values. All parameters are single-valued unless otherwise specified. <br />
<br />
*''query'': Either a search string using a query syntax or a query record describing the query by setting values for attributes (aka fielded search). The implementer for a specific underlying technology may define a query syntax to be able to build complex search criteria in a single string. However, SMILA currently does not define an own query syntax and passes the string as is to its default search engine [[SMILA/Documentation/Solr|Solr]] (see there for handling and interpretation).<br />
**Example using a query string:<br />
<br />
<source lang="xml"><br />
<Record><br />
<Val key="query">meaning of life</Val><br />
</Record><br />
</source> <br />
<br />
*Example using a query object (fielded search):<br />
<br />
<source lang="xml"><br />
<Record><br />
<Map key="query"><br />
<Val key="author">shakespeare</Val><br />
<Val key="title">hamlet</Val><br />
</Map><br />
</Record><br />
</source> <br />
<br />
*''maxcount'': The maximum number of records which should be returned to the search client. Default value is 10. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="maxcount" type="long">3</Val><br />
</source> <br />
<br />
*''offset'': The number of hits which, starting from the top, should be skipped in the search result. Default value is 0. Use this parameter to implement result list paging and to provide the user a means to navigate through the result pages: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ... Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="maxcount" type="long">3</Val><br />
<Val key="offset" type="long">3</Val><br />
</source> <br />
<br />
*''threshold'': The minimal value of the relevance score that a result must have to be returned to the search client. Default is 0.0.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="threshold" type="double">0.5</Val><br />
</source> <br />
<br />
*''language'': The natural language of the query. No default value. This parameter may be required for language-specific pipelets/services that need to know in which language the user is expressing his or her query to be able to deliver feasible results. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">sinn des lebens</Val><br />
<Val key="language">de</Val><br />
</source> <br />
<br />
*''indexname'': Some index services (like Solr) can manage multiple indices at once. When doing so, they can use this parameter to select the index which is to be searched with the current request. However, when using such a scenario, it is recommended to configure a default index name, so that search requests will succeed without having this parameter set explicitly. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="indexname">wikipedia</Val><br />
</source> <br />
<br />
*''resultAttributes'': A multi-valued parameter containing the names of the attributes which the search engine should add to the result records. Since including too many attributes will decrease performance, the list should contain only those attributes that are needed by some pipelets for further processing after the search has taken place or for displaying the results in the end. Omitting the parameter results in getting all available attributes. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="resultAttributes"><br />
<Val>author</Val><br />
<Val>title</Val><br />
</Seq><br />
</source> <br />
<br />
*''highlight'': A sequence of string values specifying the attribute names for which highlighting should be returned. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="highlight"><br />
<Val>content</Val><br />
</Seq><br />
</source> <br />
<br />
*''sortby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "order" ("ascending" | "descending") specifying that the search result should be sorted by the named attributes in the given order. Omitting this parameter results in a search result sorting by descending relevance (score, similarity, ranking, ....). Multiple maps can be added and should be evaluated in the order of their appearance. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="sortby"><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="order">descending</Val><br />
</Map><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Val key="order">ascending</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''facetby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "maxcount" (long). This causes facets to be returned by the search results for the specified attributes, returning "maxcount" values for each attribute. Optionally, each facetby map may contain a map with key "sortby" with keys "order" ("ascending" | "descending") and "criterion" (any string, e.g. "count" or "value") specifying in which order to return the values (e.g. "count" by number of this per facet or "value" by attribute value name). Example:<br />
<br />
{{Note|since 1.0|prior to 1.0 this was named ''groupby'' and has been merely renamed, (see [http://dev.eclipse.org/mhonarc/lists/smila-dev/msg00998.html mail thread]}}<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="facetby"><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="maxcount" type="long">10</Val><br />
</Map><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Map key="sortby"><br />
<Val key="criterion">value</Val><br />
<Val key="order">ascending</Val> <br />
</Map><br />
<Val key="maxcount" type="long">5</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''filter'': A sequence of maps describing for certain attributes which values they must have in valid result records. Each of the maps contains a ''key'' "attribute" and one or more value descriptions: <br />
**"oneOf", "allOf", "noneOf": sequences of values describing required or forbidden attribute values. <br />
**"atLeast", "atMost", "greaterThan", "lessThan": single values describing lower and upper bounds (including or excluding the bound values) for the attribute value. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="filter"><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Seq key="oneOf"><br />
<Val>pratchett</Val><br />
<Val>adams</Val><br />
</Seq><br />
</Map><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="atLeast">1990</Val><br />
<Val key="lessThan">2000</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''ranking'': A configuration defining how to rank the search results. This is highly depending on the used search engine, so we don't specify this further in SMILA.<br />
<br />
=== Result Annotations ===<br />
<br />
The search result is usually the request record, enriched with result data. <br />
<br />
*''records'': A sequence of maps describing the actual search result, meaning the records retrieved from the index. Each record should have an additional attribute "_weight" describing the relevance score of this record with respect to the query. The size of the "record" sequence is limited by the "maxcount" parameter.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<!-- other query parameters --><br />
<Seq key="records"><br />
<Map><br />
<Val key="_weight" type="double">0.95</Val><br />
<Val key="_recordid">file:hamlet</Val><br />
<Val key="title">Hamlet</Val><br />
<Val key="author">Shakespeare</Val><br />
...<br />
</Map><br />
<Map><br />
<Val key="_weight" type="double">0.90</Val><br />
<Val key="_recordid">file:hitchhiker</Val><br />
<Val key="title">Hitchhiker's Guide to the Galaxy</Val><br />
<Val key="author">Adams</Val><br />
...<br />
</Map><br />
</Seq><br />
</source> {{Note|return binary content|<br />
There is no nice way to return binary content anymore as attachents may only be top-level children of a record. These two solutions are possible:<br />
# add an attachment to the search record with a name after this pattern: <resultItem-record.Id>.<resultItem.atachmentName><br />
# convert the byte[] into a string (e.g. base64 encoding, so it is serializable) and return it in the AnyMap<br />
}} <br />
<br />
*''count'': The total number of records in the index that have any relevance to the query. Example see ''runtime''. <br />
*''indexSize'' (optional): The total number of records in the searched index. Example see ''runtime''. <br />
*''runtime'': The execution time of request in milliseconds, added by the search service. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="count" type="long">123456</Val><br />
<Val key="indexSize" type="long">987654321</Val><br />
<Val key="runtime" type="long">42</Val><br />
<!-- other query parameters --><br />
<Seq key="records"><br />
<!-- contains returned records --><br />
</Seq><br />
</source> <br />
<br />
*''facets'': The faceting results as requested by the ''facetby'' parameters. This Map contains a nested Seq for each requested facet and its values.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Map key="facets"><br />
<Seq key="year"><br />
<Map><br />
<Val key="value">2000</Val><br />
<Val key="count" type="long">42</Val><br />
</Map><br />
<Map><br />
<Val key="value">2001</Val><br />
<Val key="count" type="long">21</Val><br />
</Map><br />
...<br />
</Seq><br />
<Seq key="author"><br />
<Map><br />
<Val key="value">adams</Val><br />
<Val key="count" type="long">13</Val><br />
</Map><br />
<Map><br />
<Val key="value">shakespear</Val><br />
<Val key="count" type="long">17</Val><br />
</Map><br />
...<br />
</Seq><br />
</Map><br />
</Val><br />
</source> <br />
<br />
*''_highlight'': The annotation of the result record, usually used to highlight relevant sections from the result documents in order to allow the user to see at one glance if it suits what he or she was looking for. What is returned here exactly, depends on the used search engine. For example, the Solr integration in SMILA returns the raw form of the text and information about the matching parts to be highlighted. Example:<br />
<br />
<source lang="xml"><br />
<Seq key="records"><br />
<Map><br />
...<br />
<Map key="_highlight"><br />
<Map key="content"><br />
<Val key="text">... To be or not to be ...</Val><br />
<Seq key="positions"><br />
<Map><br />
<Val key="start" type="long">7</Val><br />
<Val key="end" type="long">9</Val><br />
<Val key="quality" type="long">100</Val><br />
</Map><br />
<Map><br />
<Val key="start" type="long">20</Val><br />
<Val key="end" type="long">22</Val><br />
<Val key="quality" type="long">95</Val><br />
</Map><br />
</Seq><br />
</Map><br />
<Map><br />
...<br />
</Map><br />
</source> Using the HighlightingPipelet this can be transformed into a highlighted text fragment (here using * as the highlight tag): <source lang="xml"><br />
<Seq key="records"><br />
<Map><br />
<Val key="_weight" type="double">0.95</Val><br />
<Val key="_recordid">file:hamlet</Val><br />
<Val key="title">Hamlet</Val><br />
<Val key="author">Shakespeare</Val><br />
<Map key="_highlight"><br />
<Map key="content"><br />
<Val key="text">... To *be* or not to *be* ...</Val><br />
</Map><br />
<Map><br />
...<br />
</Map><br />
</source><br />
<br />
=== Helper Classes ===<br />
<br />
There are some classes that help a client to create query records with their annotations and to read result records and their annotation. You can find them in package <tt>org.eclipse.smila.search.api.helper</tt>: <br />
<br />
*<tt>QueryBuilder</tt>: A helper class for building queries and sending the query to search service. Returns a result in the form of the next class: <br />
*<tt>ResultAccessor</tt>: A wrapper for the complete search result. Provides methods to access the basic top-level result annotations and to access each search result record wrapped by a: <br />
*<tt>ResultRecordAccessor</tt>: Defines methods for accessing some of the result record annotations.<br />
<br />
See the source code or JavaDocs for more details on the provided methods. <br />
<br />
=== SMILA Search Servlet ===<br />
<br />
In addition to the "search backend", SMILA contains a simple servlet that creates a query record from HTTP parameters and displays the result as an HTML page by converting the XML search result using an XSLT stylesheet. This servlet is intended for quick demos only, not for productive use. It is usually deployed in the Jetty instance that comes with SMILA at <tt>/SMILA/search</tt>. On first invocation, it currently creates a quite empty query record (it sets some default parameters like ''maxcount'' etc.) and processes it with the default pipeline "SearchPipeline". The pipeline should be able to process such a query and return an empty result list, not an error. The XML representation of this empty result is then transformed using the default stylesheet ("SMILASearchDefault") to present an initial search page. <br />
<br />
Note that the servlet actually enriches the XML search result a bit, so the input for the XSLT stylsheet does not completely conform to the defined XML schema. Currently, it adds a section containing the names of indices available in Solr so that the search page can display the names for selection on the left side: <br />
<br />
<source lang="xml"><br />
<SearchResult xmlns="http://www.eclipse.org/smila/search"><br />
<Workflow>searchpipeline</Workflow><br />
<Record xmlns="http://www.eclipse.org/smila/record"><br />
<!-- effective query and embedded result records ---><br />
</Record><br />
<!-- part added by SearchServlet --><br />
<IndexNames><br />
<IndexName>test_index</IndexName><br />
</IndexNames><br />
</SearchResult><br />
</source> <br />
<br />
You can use the same mechanism to add other information to the XML that is necessary for displaying purposes in the search form but not contained in the search service result: You just have to implement your own servlet or extend the default servlet. Please refer to the source code for details. <br />
<br />
==== XSLT Stylesheets for SMILA search and result pages ====<br />
<br />
The stylesheets are loaded from the configuration directory <tt>org.eclipse.smila.search.servlet</tt> and are selected by adding the HTTP parameter "style" to the URL. The value of this parameter must be the filename of the desired stylesheet without the suffix. The file's extension must bei <tt>.xsl</tt>. The servlet currently uses the hardcoded default name "SMILASearchDefault" if no other value was set. <br />
<br />
In the default application, three stylesheets are avaiable: <br />
<br />
*SMILASearchDefault: The default search page. Use this as a reference on how to describe simple queries and present result lists, including paging through bigger results. <br />
*SMILASearchAdvanced: Same layout for the result list but demostrates how to create more complex query records with attribute values and filters. <br />
*SMILASearchTest: Primitive layout without paging but demonstrates the setting of even more query features.<br />
<br />
To start with another than the default stylesheet, you can add a ''style'' parameter to the initial URL. E.g., to start with the "advanced" stylesheet, use: <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>. <br />
<br />
In the following we will describe how to set query record features using the servlet. Please have a look at those sample stylesheets for complete examples on how to apply them, as we will not present something like a full tutorial here (-; <br />
<br />
==== Setting parameters ====<br />
<br />
To set a parameter, just use the parameter name as the HTTP parameter name. All values for this HTTP parameter are added to the "parameters" annotation of the query record. E.g., to set the ''resultSize'' parameter to 7 using a hidden HTML input field, use: <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="resultSize" value="7" /><br />
</source> <br />
<br />
See below for naming rules for the HTTP parameter names to set attribute literals and annotations. Note that you cannot set a parameter with a name that matches one of these rules. <br />
<br />
==== Setting attributes ====<br />
<br />
You can add literal string values to attributes using "A.&lt;AttributeName&gt;" as the HTTP parameter name. E.g., to set a value from an HTML text input field as a literal in attribute "Title", use: <br />
<br />
<source lang="xml"><br />
<input type="text" name="A.Title" /><br />
</source> <br />
<br />
==== Setting other parameters ====<br />
<br />
To add a "sortby" parameter for an attribute, use "sortBy.&lt;AttributeName&gt;=&lt;order&gt;", e.g. <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="sortby.FileSize" value="descending" /><br />
</source> <br />
<br />
To create a filter for an attribute, use HTTP params: <br />
<br />
*"F.val.&lt;AttributeName&gt;" to add filter values to an "oneOf" filter. <br />
*"F.min.&lt;AttributeName&gt;" and "F.max.&lt;AttributeName&gt;" to set the lower/upper bounds of an "atLeast"/"atMost" filter.<br />
<br />
If both "F.val" and "F.min/F.max" paramaters are set, the servlet will create both an enumeration filter and a range filter with the same filter mode. It depends on the used search engine integration what happens in this case. E.g. <br />
<br />
*To set a filter for attribute ''MimeType'' restricting the result to HTML documents, use:<br />
<br />
<source lang="xml"><br />
<input type="hidden" name="F.val.MimeType" value="text/html" /><br />
</source> <br />
<br />
*To set a filter for attribute ''FileSize'' restricting the result to document sizes between 1000 and 10000 bytes, use:<br />
<br />
<source lang="xml"><br />
<input type="hidden" name="F.min.FileSize" value="1000" /><br />
<input type="hidden" name="F.max.FileSize" value="10000" /><br />
</source> <br />
<br />
To set a value in the ranking parameter for the complete record or an attribute, use "R[.&lt;AttributeName&gt;].&lt;ValueName&gt;". E.g., the following input field adds a parameter "Operator=OR" to attribute "Content": <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="R.Operator.Content" value="OR" /><br />
</source><br />
<br />
==== Adding attachments ====<br />
<br />
Attachments can be added to the query record by adding file upload fields to the search form, for example:<br />
<br />
<source lang="xml"><br />
<input type="file" name="Content"/> <br />
</source><br />
<br />
If the user selects a file for this field, it will be uploaded to SMILA and added as attachment "Content". Of course, there must be pipelets in your search pipeline that can process this attachment. Note, that the attachments will be kept in memory in a default SMILA configuration, so they should not be too large.<br />
<br />
=== Record Search Servlet ===<br />
<br />
In addition there exists the very basic Record Search Servlet available at {{Path|/SMILA/recordsearch}}. <br />
<br />
You can do a POST or GET request on this URL with a SMILA search record in XML representation as the request body. The servlet then parses the given XML and calls the Search Service. The default is to use the SeachPipeline but you can define any other pipeline by adding the {{code|_workflow}} annotation to the search record with the respective pipeline name.<br />
<br />
The servlet returns the XML representation of the record returned by the Search Service as is, in which you can find the search results (see above).<br />
<br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Search&diff=374336SMILA/Documentation/Search2014-11-28T10:01:20Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests internally, and the sample servlet used to create a simple web-based GUI for search. <br />
<br />
=== Introduction ===<br />
<br />
Let's start right at the top: Provided that you installed SMILA and created an index by starting a crawler as described in [[SMILA/Documentation for 5 Minutes to Success|5 Minutes to Success]], you can use you web browser to go to [http://localhost:8080/SMILA/search http://localhost:8080/SMILA/search] and search on the index: <br />
<br />
[[Image:SMILA-search-page-default.png|500px|SMILA's sample search page]] <br />
<br />
What happens behind the scenes when you enter a query string and submit the form, is that, on the server side, a servlet creates a SMILA record from the HTTP parameters, uses the search service to process the search with this record, receives an enriched version of the query record and also a list of result records in XML form, and uses an XSLT stylesheet to create a result page in HTML format. <br />
<br />
The processing inside the search service is done by calling a simple (builtin) javascript file to do the work . <br />
<br />
By clicking the ''Advanced'' link at the top of the search page (or by entering the URL <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>), you can switch to a more detailed search form page, which allows you to construct more specific search queries: <br />
<br />
[[Image:SMILA-search-page-advanced.png|500px|SMILA's advanced sample search page]] <br />
<br />
If you want to use the default search servlet for your own search page, you are encouraged to use the two XSLT files creating these HTML pages as a reference or basis when building your pages. <br />
<br />
=== Search Processing ===<br />
<br />
As mentioned before, the internal processing is done via the search service which makes use of javascript for keeping an high level of flexbility. There is also the possibility to use BPEL workflows to steer the query throughout the system. If you want to dig deeper, a more thorough documentation of both concepts can be found at:<br />
* [[SMILA/Documentation/SMILA/Documentation/Scripting]] and<br />
* [[SMILA/Documentation/BPEL Workflow Processor]]<br />
<br />
Since both chapters rather address the indexing side of both concepts, the following two paragraphs provide an explanation of how they're to be used in query scenarios. <br />
<br />
==== Search Scripts =====<br />
<br />
In the scripting environment, this follows the same record-in, record-out semantics you may already know. Difference is, that the input record contains the query and maybe some additional parameters while the output record holds the search result. The best explanation is given by looking at the actual JS which has been used in the demo search described above. You'll find the file '''search.js''' in '''configuration/org.eclipse.smila.scripting'''. <br />
<br />
==== Search Pipelines ====<br />
<br />
Search workflows (or pipelines) look just like indexing pipelines, they are only used a bit differently: Instead of pushing lists of records corresponding to data source objects through them, they are invoked with a single record representing the search request. This record contains the values of the parameters which were defined by the Search API (see below). The request object can be analyzed and enriched with additional information during the workflow before the actual search on the index takes place. The results of this search are not added to the blackboard as records of their own, but are added to the request record under the key "records". Further pipelets may then do further processing based on the request data and the result record list (e.g. highlighting). Finally, the request record including the search results is returned to the client and can be presented. <br />
<br />
Pipelet invocations look the same as in indexing pipelines. See <tt>SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/searchpipeline.bpel</tt> for a complete example search pipeline (the one used in the above sample). <br />
<br />
=== Search Service API ===<br />
<br />
The actual Search API is quite simple: SMILA registers an OSGi service with the interface <tt>org.eclipse.smila.search.api.SearchService</tt>. It provides a few methods that take a SMILA query record and the name of a search script or workflow as input, execute the script/workflow with the record, and return the result in different formats: <br />
<br />
*<tt>Record search(String workflowName, Record query) throws ProcessingException</tt>: This is the basic method of the search service, returning the result records as native SMILA data structures. The other methods call this method for the actual search execution, too, and then just convert the result. <br />
*<tt>org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException</tt>: Returns the search result as an XML DOM document. See below for the schema of the result. <br />
*<tt>String searchAsXmlString(String workflowName, Record query) throws ProcessingException</tt>: Returns the search result as an XML string. See below for the schema of the result.<br />
<br />
The schema of XML search results is basically as follows (target namespace is <tt>http://www.eclipse.org/smila/search</tt>, see <tt>org.eclipse.smila.search.api/xml/search.xsd</tt> for the full definition): <br />
<br />
<source lang="xml"><br />
<element name="SearchResult"><br />
<complexType><br />
<sequence minOccurs="1" maxOccurs="1"><br />
<choice><br />
<element name="Workflow" type="string" minOccurs="1" maxOccurs="1" /><br />
<element name="Script" type="string" minOccurs="1" maxOccurs="1" /><br />
</choice><br />
<element ref="rec:Record" minOccurs="0" maxOccurs="1" /><br />
</sequence><br />
</complexType><br />
</element><br />
</source> <br />
<br />
You can view the result XML when using the sample SMILA search page at <tt>http://localhost:8080/SMILA/search</tt> if you enable the ''Show XML result'' option before submitting the query. <br />
<br />
The content of the query record basically depends a lot on the used search services. However, the Search API also includes a recommendation where to put some basic commonly used search parameters which all index integrations should honor (of course they may specify extensions that are not covered by the generic Search API). The following sections describe these recommendations. <br />
<br />
=== Query Parameters ===<br />
<br />
The query record mainly consists of parameters. The Search API defines the names of these parameters, the allowed values as well as the default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additional parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also defined default values. All parameters are single-valued unless otherwise specified. <br />
<br />
*''query'': Either a search string using a query syntax or a query record describing the query by setting values for attributes (aka fielded search). The implementer for a specific underlying technology may define a query syntax to be able to build complex search criteria in a single string. However, SMILA currently does not define an own query syntax and passes the string as is to its default search engine [[SMILA/Documentation/Solr|Solr]] (see there for handling and interpretation).<br />
**Example using a query string:<br />
<br />
<source lang="xml"><br />
<Record><br />
<Val key="query">meaning of life</Val><br />
</Record><br />
</source> <br />
<br />
*Example using a query object (fielded search):<br />
<br />
<source lang="xml"><br />
<Record><br />
<Map key="query"><br />
<Val key="author">shakespeare</Val><br />
<Val key="title">hamlet</Val><br />
</Map><br />
</Record><br />
</source> <br />
<br />
*''maxcount'': The maximum number of records which should be returned to the search client. Default value is 10. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="maxcount" type="long">3</Val><br />
</source> <br />
<br />
*''offset'': The number of hits which, starting from the top, should be skipped in the search result. Default value is 0. Use this parameter to implement result list paging and to provide the user a means to navigate through the result pages: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ... Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="maxcount" type="long">3</Val><br />
<Val key="offset" type="long">3</Val><br />
</source> <br />
<br />
*''threshold'': The minimal value of the relevance score that a result must have to be returned to the search client. Default is 0.0.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="threshold" type="double">0.5</Val><br />
</source> <br />
<br />
*''language'': The natural language of the query. No default value. This parameter may be required for language-specific pipelets/services that need to know in which language the user is expressing his or her query to be able to deliver feasible results. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">sinn des lebens</Val><br />
<Val key="language">de</Val><br />
</source> <br />
<br />
*''indexname'': Some index services (like Solr) can manage multiple indices at once. When doing so, they can use this parameter to select the index which is to be searched with the current request. However, when using such a scenario, it is recommended to configure a default index name, so that search requests will succeed without having this parameter set explicitly. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="indexname">wikipedia</Val><br />
</source> <br />
<br />
*''resultAttributes'': A multi-valued parameter containing the names of the attributes which the search engine should add to the result records. Since including too many attributes will decrease performance, the list should contain only those attributes that are needed by some pipelets for further processing after the search has taken place or for displaying the results in the end. Omitting the parameter results in getting all available attributes. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="resultAttributes"><br />
<Val>author</Val><br />
<Val>title</Val><br />
</Seq><br />
</source> <br />
<br />
*''highlight'': A sequence of string values specifying the attribute names for which highlighting should be returned. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="highlight"><br />
<Val>content</Val><br />
</Seq><br />
</source> <br />
<br />
*''sortby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "order" ("ascending" | "descending") specifying that the search result should be sorted by the named attributes in the given order. Omitting this parameter results in a search result sorting by descending relevance (score, similarity, ranking, ....). Multiple maps can be added and should be evaluated in the order of their appearance. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="sortby"><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="order">descending</Val><br />
</Map><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Val key="order">ascending</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''facetby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "maxcount" (long). This causes facets to be returned by the search results for the specified attributes, returning "maxcount" values for each attribute. Optionally, each facetby map may contain a map with key "sortby" with keys "order" ("ascending" | "descending") and "criterion" (any string, e.g. "count" or "value") specifying in which order to return the values (e.g. "count" by number of this per facet or "value" by attribute value name). Example:<br />
<br />
{{Note|since 1.0|prior to 1.0 this was named ''groupby'' and has been merely renamed, (see [http://dev.eclipse.org/mhonarc/lists/smila-dev/msg00998.html mail thread]}}<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="facetby"><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="maxcount" type="long">10</Val><br />
</Map><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Map key="sortby"><br />
<Val key="criterion">value</Val><br />
<Val key="order">ascending</Val> <br />
</Map><br />
<Val key="maxcount" type="long">5</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''filter'': A sequence of maps describing for certain attributes which values they must have in valid result records. Each of the maps contains a ''key'' "attribute" and one or more value descriptions: <br />
**"oneOf", "allOf", "noneOf": sequences of values describing required or forbidden attribute values. <br />
**"atLeast", "atMost", "greaterThan", "lessThan": single values describing lower and upper bounds (including or excluding the bound values) for the attribute value. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="filter"><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Seq key="oneOf"><br />
<Val>pratchett</Val><br />
<Val>adams</Val><br />
</Seq><br />
</Map><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="atLeast">1990</Val><br />
<Val key="lessThan">2000</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''ranking'': A configuration defining how to rank the search results. This is highly depending on the used search engine, so we don't specify this further in SMILA.<br />
<br />
=== Result Annotations ===<br />
<br />
The search result is usually the request record, enriched with result data. <br />
<br />
*''records'': A sequence of maps describing the actual search result, meaning the records retrieved from the index. Each record should have an additional attribute "_weight" describing the relevance score of this record with respect to the query. The size of the "record" sequence is limited by the "maxcount" parameter.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<!-- other query parameters --><br />
<Seq key="records"><br />
<Map><br />
<Val key="_weight" type="double">0.95</Val><br />
<Val key="_recordid">file:hamlet</Val><br />
<Val key="title">Hamlet</Val><br />
<Val key="author">Shakespeare</Val><br />
...<br />
</Map><br />
<Map><br />
<Val key="_weight" type="double">0.90</Val><br />
<Val key="_recordid">file:hitchhiker</Val><br />
<Val key="title">Hitchhiker's Guide to the Galaxy</Val><br />
<Val key="author">Adams</Val><br />
...<br />
</Map><br />
</Seq><br />
</source> {{Note|return binary content|<br />
There is no nice way to return binary content anymore as attachents may only be top-level children of a record. These two solutions are possible:<br />
# add an attachment to the search record with a name after this pattern: <resultItem-record.Id>.<resultItem.atachmentName><br />
# convert the byte[] into a string (e.g. base64 encoding, so it is serializable) and return it in the AnyMap<br />
}} <br />
<br />
*''count'': The total number of records in the index that have any relevance to the query. Example see ''runtime''. <br />
*''indexSize'' (optional): The total number of records in the searched index. Example see ''runtime''. <br />
*''runtime'': The execution time of request in milliseconds, added by the search service. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="count" type="long">123456</Val><br />
<Val key="indexSize" type="long">987654321</Val><br />
<Val key="runtime" type="long">42</Val><br />
<!-- other query parameters --><br />
<Seq key="records"><br />
<!-- contains returned records --><br />
</Seq><br />
</source> <br />
<br />
*''facets'': The faceting results as requested by the ''facetby'' parameters. This Map contains a nested Seq for each requested facet and its values.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Map key="facets"><br />
<Seq key="year"><br />
<Map><br />
<Val key="value">2000</Val><br />
<Val key="count" type="long">42</Val><br />
</Map><br />
<Map><br />
<Val key="value">2001</Val><br />
<Val key="count" type="long">21</Val><br />
</Map><br />
...<br />
</Seq><br />
<Seq key="author"><br />
<Map><br />
<Val key="value">adams</Val><br />
<Val key="count" type="long">13</Val><br />
</Map><br />
<Map><br />
<Val key="value">shakespear</Val><br />
<Val key="count" type="long">17</Val><br />
</Map><br />
...<br />
</Seq><br />
</Map><br />
</Val><br />
</source> <br />
<br />
*''_highlight'': The annotation of the result record, usually used to highlight relevant sections from the result documents in order to allow the user to see at one glance if it suits what he or she was looking for. What is returned here exactly, depends on the used search engine. For example, the Solr integration in SMILA returns the raw form of the text and information about the matching parts to be highlighted. Example:<br />
<br />
<source lang="xml"><br />
<Seq key="records"><br />
<Map><br />
...<br />
<Map key="_highlight"><br />
<Map key="content"><br />
<Val key="text">... To be or not to be ...</Val><br />
<Seq key="positions"><br />
<Map><br />
<Val key="start" type="long">7</Val><br />
<Val key="end" type="long">9</Val><br />
<Val key="quality" type="long">100</Val><br />
</Map><br />
<Map><br />
<Val key="start" type="long">20</Val><br />
<Val key="end" type="long">22</Val><br />
<Val key="quality" type="long">95</Val><br />
</Map><br />
</Seq><br />
</Map><br />
<Map><br />
...<br />
</Map><br />
</source> Using the HighlightingPipelet this can be transformed into a highlighted text fragment (here using * as the highlight tag): <source lang="xml"><br />
<Seq key="records"><br />
<Map><br />
<Val key="_weight" type="double">0.95</Val><br />
<Val key="_recordid">file:hamlet</Val><br />
<Val key="title">Hamlet</Val><br />
<Val key="author">Shakespeare</Val><br />
<Map key="_highlight"><br />
<Map key="content"><br />
<Val key="text">... To *be* or not to *be* ...</Val><br />
</Map><br />
<Map><br />
...<br />
</Map><br />
</source><br />
<br />
=== Helper Classes ===<br />
<br />
There are some classes that help a client to create query records with their annotations and to read result records and their annotation. You can find them in package <tt>org.eclipse.smila.search.api.helper</tt>: <br />
<br />
*<tt>QueryBuilder</tt>: A helper class for building queries and sending the query to search service. Returns a result in the form of the next class: <br />
*<tt>ResultAccessor</tt>: A wrapper for the complete search result. Provides methods to access the basic top-level result annotations and to access each search result record wrapped by a: <br />
*<tt>ResultRecordAccessor</tt>: Defines methods for accessing some of the result record annotations.<br />
<br />
See the source code or JavaDocs for more details on the provided methods. <br />
<br />
=== SMILA Search Servlet ===<br />
<br />
In addition to the "search backend", SMILA contains a simple servlet that creates a query record from HTTP parameters and displays the result as an HTML page by converting the XML search result using an XSLT stylesheet. This servlet is intended for quick demos only, not for productive use. It is usually deployed in the Jetty instance that comes with SMILA at <tt>/SMILA/search</tt>. On first invocation, it currently creates a quite empty query record (it sets some default parameters like ''maxcount'' etc.) and processes it with the default pipeline "SearchPipeline". The pipeline should be able to process such a query and return an empty result list, not an error. The XML representation of this empty result is then transformed using the default stylesheet ("SMILASearchDefault") to present an initial search page. <br />
<br />
Note that the servlet actually enriches the XML search result a bit, so the input for the XSLT stylsheet does not completely conform to the defined XML schema. Currently, it adds a section containing the names of indices available in Solr so that the search page can display the names for selection on the left side: <br />
<br />
<source lang="xml"><br />
<SearchResult xmlns="http://www.eclipse.org/smila/search"><br />
<Workflow>searchpipeline</Workflow><br />
<Record xmlns="http://www.eclipse.org/smila/record"><br />
<!-- effective query and embedded result records ---><br />
</Record><br />
<!-- part added by SearchServlet --><br />
<IndexNames><br />
<IndexName>test_index</IndexName><br />
</IndexNames><br />
</SearchResult><br />
</source> <br />
<br />
You can use the same mechanism to add other information to the XML that is necessary for displaying purposes in the search form but not contained in the search service result: You just have to implement your own servlet or extend the default servlet. Please refer to the source code for details. <br />
<br />
==== XSLT Stylesheets for SMILA search and result pages ====<br />
<br />
The stylesheets are loaded from the configuration directory <tt>org.eclipse.smila.search.servlet</tt> and are selected by adding the HTTP parameter "style" to the URL. The value of this parameter must be the filename of the desired stylesheet without the suffix. The file's extension must bei <tt>.xsl</tt>. The servlet currently uses the hardcoded default name "SMILASearchDefault" if no other value was set. <br />
<br />
In the default application, three stylesheets are avaiable: <br />
<br />
*SMILASearchDefault: The default search page. Use this as a reference on how to describe simple queries and present result lists, including paging through bigger results. <br />
*SMILASearchAdvanced: Same layout for the result list but demostrates how to create more complex query records with attribute values and filters. <br />
*SMILASearchTest: Primitive layout without paging but demonstrates the setting of even more query features.<br />
<br />
To start with another than the default stylesheet, you can add a ''style'' parameter to the initial URL. E.g., to start with the "advanced" stylesheet, use: <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>. <br />
<br />
In the following we will describe how to set query record features using the servlet. Please have a look at those sample stylesheets for complete examples on how to apply them, as we will not present something like a full tutorial here (-; <br />
<br />
==== Setting parameters ====<br />
<br />
To set a parameter, just use the parameter name as the HTTP parameter name. All values for this HTTP parameter are added to the "parameters" annotation of the query record. E.g., to set the ''resultSize'' parameter to 7 using a hidden HTML input field, use: <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="resultSize" value="7" /><br />
</source> <br />
<br />
See below for naming rules for the HTTP parameter names to set attribute literals and annotations. Note that you cannot set a parameter with a name that matches one of these rules. <br />
<br />
==== Setting attributes ====<br />
<br />
You can add literal string values to attributes using "A.&lt;AttributeName&gt;" as the HTTP parameter name. E.g., to set a value from an HTML text input field as a literal in attribute "Title", use: <br />
<br />
<source lang="xml"><br />
<input type="text" name="A.Title" /><br />
</source> <br />
<br />
==== Setting other parameters ====<br />
<br />
To add a "sortby" parameter for an attribute, use "sortBy.&lt;AttributeName&gt;=&lt;order&gt;", e.g. <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="sortby.FileSize" value="descending" /><br />
</source> <br />
<br />
To create a filter for an attribute, use HTTP params: <br />
<br />
*"F.val.&lt;AttributeName&gt;" to add filter values to an "oneOf" filter. <br />
*"F.min.&lt;AttributeName&gt;" and "F.max.&lt;AttributeName&gt;" to set the lower/upper bounds of an "atLeast"/"atMost" filter.<br />
<br />
If both "F.val" and "F.min/F.max" paramaters are set, the servlet will create both an enumeration filter and a range filter with the same filter mode. It depends on the used search engine integration what happens in this case. E.g. <br />
<br />
*To set a filter for attribute ''MimeType'' restricting the result to HTML documents, use:<br />
<br />
<source lang="xml"><br />
<input type="hidden" name="F.val.MimeType" value="text/html" /><br />
</source> <br />
<br />
*To set a filter for attribute ''FileSize'' restricting the result to document sizes between 1000 and 10000 bytes, use:<br />
<br />
<source lang="xml"><br />
<input type="hidden" name="F.min.FileSize" value="1000" /><br />
<input type="hidden" name="F.max.FileSize" value="10000" /><br />
</source> <br />
<br />
To set a value in the ranking parameter for the complete record or an attribute, use "R[.&lt;AttributeName&gt;].&lt;ValueName&gt;". E.g., the following input field adds a parameter "Operator=OR" to attribute "Content": <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="R.Operator.Content" value="OR" /><br />
</source><br />
<br />
==== Adding attachments ====<br />
<br />
Attachments can be added to the query record by adding file upload fields to the search form, for example:<br />
<br />
<source lang="xml"><br />
<input type="file" name="Content"/> <br />
</source><br />
<br />
If the user selects a file for this field, it will be uploaded to SMILA and added as attachment "Content". Of course, there must be pipelets in your search pipeline that can process this attachment. Note, that the attachments will be kept in memory in a default SMILA configuration, so they should not be too large.<br />
<br />
=== Record Search Servlet ===<br />
<br />
In addition there exists the very basic Record Search Servlet available at {{Path|/SMILA/recordsearch}}. <br />
<br />
You can do a POST or GET request on this URL with a SMILA search record in XML representation as the request body. The servlet then parses the given XML and calls the Search Service. The default is to use the SeachPipeline but you can define any other pipeline by adding the {{code|_workflow}} annotation to the search record with the respective pipeline name.<br />
<br />
The servlet returns the XML representation of the record returned by the Search Service as is, in which you can find the search results (see above).<br />
<br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Search&diff=374335SMILA/Documentation/Search2014-11-28T09:53:33Z<p>Marco.strack.empolis.com: wip (scripting in search)</p>
<hr />
<div>This page describes the search service and related parts of SMILA. This includes the query and result helpers, the processing of search requests internally, and the sample servlet used to create a simple web-based GUI for search. <br />
<br />
=== Introduction ===<br />
<br />
Let's start right at the top: Provided that you installed SMILA and created an index by starting a crawler as described in [[SMILA/Documentation for 5 Minutes to Success|5 Minutes to Success]], you can use you web browser to go to [http://localhost:8080/SMILA/search http://localhost:8080/SMILA/search] and search on the index: <br />
<br />
[[Image:SMILA-search-page-default.png|500px|SMILA's sample search page]] <br />
<br />
What happens behind the scenes when you enter a query string and submit the form, is that, on the server side, a servlet creates a SMILA record from the HTTP parameters, uses the search service to process the search with this record, receives an enriched version of the query record and also a list of result records in XML form, and uses an XSLT stylesheet to create a result page in HTML format. <br />
<br />
The processing inside the search service is done by calling a simple (builtin) javascript file to do the work . <br />
<br />
By clicking the ''Advanced'' link at the top of the search page (or by entering the URL <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>), you can switch to a more detailed search form page, which allows you to construct more specific search queries: <br />
<br />
[[Image:SMILA-search-page-advanced.png|500px|SMILA's advanced sample search page]] <br />
<br />
If you want to use the default search servlet for your own search page, you are encouraged to use the two XSLT files creating these HTML pages as a reference or basis when building your pages. <br />
<br />
=== Search Processing ===<br />
<br />
As mentioned before, the internal processing is done via the search service which makes use of javascript for keeping an high level of flexbility. There is also the possibility to use BPEL workflows to steer the query throughout the system. If you want to dig deeper, a more thorough documentation of both concepts can be found at:<br />
* [[SMILA/Documentation/SMILA/Documentation/Scripting]] and<br />
* [[SMILA/Documentation/BPEL Workflow Processor]]<br />
<br />
Since both chapters rather address the indexing side of both concepts, the following two paragraphs provide an explanation of how they're to be used in query scenarios. <br />
<br />
==== Search Scripts =====<br />
<br />
In the scripting environment, this follows the same record-in, record-out semantics you may already know. Difference is, that the input record contains the query and maybe some additional parameters while the output record holds the search result. The best explanation is given by looking at the actual JS which has been used in the demo search described above. You'll find the file '''search.js''' in '''configuration/org.eclipse.smila.scripting'''. <br />
<br />
==== Search Pipelines ====<br />
<br />
Search workflows (or pipelines) look just like indexing pipelines, they are only used a bit differently: Instead of pushing lists of records corresponding to data source objects through them, they are invoked with a single record representing the search request. This record contains the values of the parameters which were defined by the Search API (see below). The request object can be analyzed and enriched with additional information during the workflow before the actual search on the index takes place. The results of this search are not added to the blackboard as records of their own, but are added to the request record under the key "records". Further pipelets may then do further processing based on the request data and the result record list (e.g. highlighting). Finally, the request record including the search results is returned to the client and can be presented. <br />
<br />
Pipelet invocations look the same as in indexing pipelines. See <tt>SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/searchpipeline.bpel</tt> for a complete example search pipeline (the one used in the above sample). <br />
<br />
=== Search Service API ===<br />
<br />
The actual Search API is quite simple: SMILA registers an OSGi service with the interface <tt>org.eclipse.smila.search.api.SearchService</tt>. It provides a few methods that take a SMILA query record and the name of a search workflow as input, execute the workflow on the record, and return the result in different formats: <br />
<br />
*<tt>Record search(String workflowName, Record query) throws ProcessingException</tt>: This is the basic method of the search service, returning the result records as native SMILA data structures. The other methods call this method for the actual search execution, too, and then just convert the result. <br />
*<tt>org.w3c.dom.Document searchAsXml(String workflowName, Record query) throws ProcessingException</tt>: Returns the search result as an XML DOM document. See below for the schema of the result. <br />
*<tt>String searchAsXmlString(String workflowName, Record query) throws ProcessingException</tt>: Returns the search result as an XML string. See below for the schema of the result.<br />
<br />
The schema of XML search results is basically as follows (target namespace is <tt>http://www.eclipse.org/smila/search</tt>, see <tt>org.eclipse.smila.search.api/xml/search.xsd</tt> for the full definition): <br />
<br />
<source lang="xml"><br />
<element name="SearchResult"><br />
<complexType><br />
<sequence minOccurs="1" maxOccurs="1"><br />
<element name="Workflow" type="string" minOccurs="1" maxOccurs="1" /><br />
<element ref="rec:Record" minOccurs="0" maxOccurs="1" /><br />
</sequence><br />
</complexType><br />
</element><br />
</source> <br />
<br />
You can view the result XML when using the sample SMILA search page at <tt>http://localhost:8080/SMILA/search</tt> if you enable the ''Show XML result'' option before submitting the query. <br />
<br />
The content of the query record basically depends a lot on the used search services. However, the Search API also includes a recommendation where to put some basic commonly used search parameters which all index integrations should honor (of course they may specify extensions that are not covered by the generic Search API). The following sections describe these recommendations. <br />
<br />
=== Query Parameters ===<br />
<br />
The query record mainly consists of parameters. The Search API defines the names of these parameters, the allowed values as well as the default values for a set of commonly used parameters. All implementations should use these properties if possible, i.e. they should not introduce additional parameters for the same purpose, but it may be possible that certain parameters are not supported because it is not feasible with the underlying technology. For some parameters we also defined default values. All parameters are single-valued unless otherwise specified. <br />
<br />
*''query'': Either a search string using a query syntax or a query record describing the query by setting values for attributes (aka fielded search). The implementer for a specific underlying technology may define a query syntax to be able to build complex search criteria in a single string. However, SMILA currently does not define an own query syntax and passes the string as is to its default search engine [[SMILA/Documentation/Solr|Solr]] (see there for handling and interpretation).<br />
**Example using a query string:<br />
<br />
<source lang="xml"><br />
<Record><br />
<Val key="query">meaning of life</Val><br />
</Record><br />
</source> <br />
<br />
*Example using a query object (fielded search):<br />
<br />
<source lang="xml"><br />
<Record><br />
<Map key="query"><br />
<Val key="author">shakespeare</Val><br />
<Val key="title">hamlet</Val><br />
</Map><br />
</Record><br />
</source> <br />
<br />
*''maxcount'': The maximum number of records which should be returned to the search client. Default value is 10. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="maxcount" type="long">3</Val><br />
</source> <br />
<br />
*''offset'': The number of hits which, starting from the top, should be skipped in the search result. Default value is 0. Use this parameter to implement result list paging and to provide the user a means to navigate through the result pages: If resultSize=10, the "next page" queries can be identical to the initial query, but with resultOffset=10, 20, ... Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="maxcount" type="long">3</Val><br />
<Val key="offset" type="long">3</Val><br />
</source> <br />
<br />
*''threshold'': The minimal value of the relevance score that a result must have to be returned to the search client. Default is 0.0.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="threshold" type="double">0.5</Val><br />
</source> <br />
<br />
*''language'': The natural language of the query. No default value. This parameter may be required for language-specific pipelets/services that need to know in which language the user is expressing his or her query to be able to deliver feasible results. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">sinn des lebens</Val><br />
<Val key="language">de</Val><br />
</source> <br />
<br />
*''indexname'': Some index services (like Solr) can manage multiple indices at once. When doing so, they can use this parameter to select the index which is to be searched with the current request. However, when using such a scenario, it is recommended to configure a default index name, so that search requests will succeed without having this parameter set explicitly. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="indexname">wikipedia</Val><br />
</source> <br />
<br />
*''resultAttributes'': A multi-valued parameter containing the names of the attributes which the search engine should add to the result records. Since including too many attributes will decrease performance, the list should contain only those attributes that are needed by some pipelets for further processing after the search has taken place or for displaying the results in the end. Omitting the parameter results in getting all available attributes. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="resultAttributes"><br />
<Val>author</Val><br />
<Val>title</Val><br />
</Seq><br />
</source> <br />
<br />
*''highlight'': A sequence of string values specifying the attribute names for which highlighting should be returned. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="highlight"><br />
<Val>content</Val><br />
</Seq><br />
</source> <br />
<br />
*''sortby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "order" ("ascending" | "descending") specifying that the search result should be sorted by the named attributes in the given order. Omitting this parameter results in a search result sorting by descending relevance (score, similarity, ranking, ....). Multiple maps can be added and should be evaluated in the order of their appearance. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="sortby"><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="order">descending</Val><br />
</Map><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Val key="order">ascending</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''facetby'': A sequence of maps each containing the ''key'' "attribute" (any string) and the ''key'' "maxcount" (long). This causes facets to be returned by the search results for the specified attributes, returning "maxcount" values for each attribute. Optionally, each facetby map may contain a map with key "sortby" with keys "order" ("ascending" | "descending") and "criterion" (any string, e.g. "count" or "value") specifying in which order to return the values (e.g. "count" by number of this per facet or "value" by attribute value name). Example:<br />
<br />
{{Note|since 1.0|prior to 1.0 this was named ''groupby'' and has been merely renamed, (see [http://dev.eclipse.org/mhonarc/lists/smila-dev/msg00998.html mail thread]}}<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="facetby"><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="maxcount" type="long">10</Val><br />
</Map><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Map key="sortby"><br />
<Val key="criterion">value</Val><br />
<Val key="order">ascending</Val> <br />
</Map><br />
<Val key="maxcount" type="long">5</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''filter'': A sequence of maps describing for certain attributes which values they must have in valid result records. Each of the maps contains a ''key'' "attribute" and one or more value descriptions: <br />
**"oneOf", "allOf", "noneOf": sequences of values describing required or forbidden attribute values. <br />
**"atLeast", "atMost", "greaterThan", "lessThan": single values describing lower and upper bounds (including or excluding the bound values) for the attribute value. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Seq key="filter"><br />
<Map><br />
<Val key="attribute">author</Val><br />
<Seq key="oneOf"><br />
<Val>pratchett</Val><br />
<Val>adams</Val><br />
</Seq><br />
</Map><br />
<Map><br />
<Val key="attribute">year</Val><br />
<Val key="atLeast">1990</Val><br />
<Val key="lessThan">2000</Val><br />
</Map><br />
</Seq><br />
</source> <br />
<br />
*''ranking'': A configuration defining how to rank the search results. This is highly depending on the used search engine, so we don't specify this further in SMILA.<br />
<br />
=== Result Annotations ===<br />
<br />
The search result is usually the request record, enriched with result data. <br />
<br />
*''records'': A sequence of maps describing the actual search result, meaning the records retrieved from the index. Each record should have an additional attribute "_weight" describing the relevance score of this record with respect to the query. The size of the "record" sequence is limited by the "maxcount" parameter.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<!-- other query parameters --><br />
<Seq key="records"><br />
<Map><br />
<Val key="_weight" type="double">0.95</Val><br />
<Val key="_recordid">file:hamlet</Val><br />
<Val key="title">Hamlet</Val><br />
<Val key="author">Shakespeare</Val><br />
...<br />
</Map><br />
<Map><br />
<Val key="_weight" type="double">0.90</Val><br />
<Val key="_recordid">file:hitchhiker</Val><br />
<Val key="title">Hitchhiker's Guide to the Galaxy</Val><br />
<Val key="author">Adams</Val><br />
...<br />
</Map><br />
</Seq><br />
</source> {{Note|return binary content|<br />
There is no nice way to return binary content anymore as attachents may only be top-level children of a record. These two solutions are possible:<br />
# add an attachment to the search record with a name after this pattern: <resultItem-record.Id>.<resultItem.atachmentName><br />
# convert the byte[] into a string (e.g. base64 encoding, so it is serializable) and return it in the AnyMap<br />
}} <br />
<br />
*''count'': The total number of records in the index that have any relevance to the query. Example see ''runtime''. <br />
*''indexSize'' (optional): The total number of records in the searched index. Example see ''runtime''. <br />
*''runtime'': The execution time of request in milliseconds, added by the search service. Example:<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Val key="count" type="long">123456</Val><br />
<Val key="indexSize" type="long">987654321</Val><br />
<Val key="runtime" type="long">42</Val><br />
<!-- other query parameters --><br />
<Seq key="records"><br />
<!-- contains returned records --><br />
</Seq><br />
</source> <br />
<br />
*''facets'': The faceting results as requested by the ''facetby'' parameters. This Map contains a nested Seq for each requested facet and its values.<br />
<br />
<source lang="xml"><br />
<Val key="query">meaning of life</Val><br />
<Map key="facets"><br />
<Seq key="year"><br />
<Map><br />
<Val key="value">2000</Val><br />
<Val key="count" type="long">42</Val><br />
</Map><br />
<Map><br />
<Val key="value">2001</Val><br />
<Val key="count" type="long">21</Val><br />
</Map><br />
...<br />
</Seq><br />
<Seq key="author"><br />
<Map><br />
<Val key="value">adams</Val><br />
<Val key="count" type="long">13</Val><br />
</Map><br />
<Map><br />
<Val key="value">shakespear</Val><br />
<Val key="count" type="long">17</Val><br />
</Map><br />
...<br />
</Seq><br />
</Map><br />
</Val><br />
</source> <br />
<br />
*''_highlight'': The annotation of the result record, usually used to highlight relevant sections from the result documents in order to allow the user to see at one glance if it suits what he or she was looking for. What is returned here exactly, depends on the used search engine. For example, the Solr integration in SMILA returns the raw form of the text and information about the matching parts to be highlighted. Example:<br />
<br />
<source lang="xml"><br />
<Seq key="records"><br />
<Map><br />
...<br />
<Map key="_highlight"><br />
<Map key="content"><br />
<Val key="text">... To be or not to be ...</Val><br />
<Seq key="positions"><br />
<Map><br />
<Val key="start" type="long">7</Val><br />
<Val key="end" type="long">9</Val><br />
<Val key="quality" type="long">100</Val><br />
</Map><br />
<Map><br />
<Val key="start" type="long">20</Val><br />
<Val key="end" type="long">22</Val><br />
<Val key="quality" type="long">95</Val><br />
</Map><br />
</Seq><br />
</Map><br />
<Map><br />
...<br />
</Map><br />
</source> Using the HighlightingPipelet this can be transformed into a highlighted text fragment (here using * as the highlight tag): <source lang="xml"><br />
<Seq key="records"><br />
<Map><br />
<Val key="_weight" type="double">0.95</Val><br />
<Val key="_recordid">file:hamlet</Val><br />
<Val key="title">Hamlet</Val><br />
<Val key="author">Shakespeare</Val><br />
<Map key="_highlight"><br />
<Map key="content"><br />
<Val key="text">... To *be* or not to *be* ...</Val><br />
</Map><br />
<Map><br />
...<br />
</Map><br />
</source><br />
<br />
=== Helper Classes ===<br />
<br />
There are some classes that help a client to create query records with their annotations and to read result records and their annotation. You can find them in package <tt>org.eclipse.smila.search.api.helper</tt>: <br />
<br />
*<tt>QueryBuilder</tt>: A helper class for building queries and sending the query to search service. Returns a result in the form of the next class: <br />
*<tt>ResultAccessor</tt>: A wrapper for the complete search result. Provides methods to access the basic top-level result annotations and to access each search result record wrapped by a: <br />
*<tt>ResultRecordAccessor</tt>: Defines methods for accessing some of the result record annotations.<br />
<br />
See the source code or JavaDocs for more details on the provided methods. <br />
<br />
=== SMILA Search Servlet ===<br />
<br />
In addition to the "search backend", SMILA contains a simple servlet that creates a query record from HTTP parameters and displays the result as an HTML page by converting the XML search result using an XSLT stylesheet. This servlet is intended for quick demos only, not for productive use. It is usually deployed in the Jetty instance that comes with SMILA at <tt>/SMILA/search</tt>. On first invocation, it currently creates a quite empty query record (it sets some default parameters like ''maxcount'' etc.) and processes it with the default pipeline "SearchPipeline". The pipeline should be able to process such a query and return an empty result list, not an error. The XML representation of this empty result is then transformed using the default stylesheet ("SMILASearchDefault") to present an initial search page. <br />
<br />
Note that the servlet actually enriches the XML search result a bit, so the input for the XSLT stylsheet does not completely conform to the defined XML schema. Currently, it adds a section containing the names of indices available in Solr so that the search page can display the names for selection on the left side: <br />
<br />
<source lang="xml"><br />
<SearchResult xmlns="http://www.eclipse.org/smila/search"><br />
<Workflow>searchpipeline</Workflow><br />
<Record xmlns="http://www.eclipse.org/smila/record"><br />
<!-- effective query and embedded result records ---><br />
</Record><br />
<!-- part added by SearchServlet --><br />
<IndexNames><br />
<IndexName>test_index</IndexName><br />
</IndexNames><br />
</SearchResult><br />
</source> <br />
<br />
You can use the same mechanism to add other information to the XML that is necessary for displaying purposes in the search form but not contained in the search service result: You just have to implement your own servlet or extend the default servlet. Please refer to the source code for details. <br />
<br />
==== XSLT Stylesheets for SMILA search and result pages ====<br />
<br />
The stylesheets are loaded from the configuration directory <tt>org.eclipse.smila.search.servlet</tt> and are selected by adding the HTTP parameter "style" to the URL. The value of this parameter must be the filename of the desired stylesheet without the suffix. The file's extension must bei <tt>.xsl</tt>. The servlet currently uses the hardcoded default name "SMILASearchDefault" if no other value was set. <br />
<br />
In the default application, three stylesheets are avaiable: <br />
<br />
*SMILASearchDefault: The default search page. Use this as a reference on how to describe simple queries and present result lists, including paging through bigger results. <br />
*SMILASearchAdvanced: Same layout for the result list but demostrates how to create more complex query records with attribute values and filters. <br />
*SMILASearchTest: Primitive layout without paging but demonstrates the setting of even more query features.<br />
<br />
To start with another than the default stylesheet, you can add a ''style'' parameter to the initial URL. E.g., to start with the "advanced" stylesheet, use: <tt>http://localhost:8080/SMILA/search?style=SMILASearchAdvanced</tt>. <br />
<br />
In the following we will describe how to set query record features using the servlet. Please have a look at those sample stylesheets for complete examples on how to apply them, as we will not present something like a full tutorial here (-; <br />
<br />
==== Setting parameters ====<br />
<br />
To set a parameter, just use the parameter name as the HTTP parameter name. All values for this HTTP parameter are added to the "parameters" annotation of the query record. E.g., to set the ''resultSize'' parameter to 7 using a hidden HTML input field, use: <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="resultSize" value="7" /><br />
</source> <br />
<br />
See below for naming rules for the HTTP parameter names to set attribute literals and annotations. Note that you cannot set a parameter with a name that matches one of these rules. <br />
<br />
==== Setting attributes ====<br />
<br />
You can add literal string values to attributes using "A.&lt;AttributeName&gt;" as the HTTP parameter name. E.g., to set a value from an HTML text input field as a literal in attribute "Title", use: <br />
<br />
<source lang="xml"><br />
<input type="text" name="A.Title" /><br />
</source> <br />
<br />
==== Setting other parameters ====<br />
<br />
To add a "sortby" parameter for an attribute, use "sortBy.&lt;AttributeName&gt;=&lt;order&gt;", e.g. <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="sortby.FileSize" value="descending" /><br />
</source> <br />
<br />
To create a filter for an attribute, use HTTP params: <br />
<br />
*"F.val.&lt;AttributeName&gt;" to add filter values to an "oneOf" filter. <br />
*"F.min.&lt;AttributeName&gt;" and "F.max.&lt;AttributeName&gt;" to set the lower/upper bounds of an "atLeast"/"atMost" filter.<br />
<br />
If both "F.val" and "F.min/F.max" paramaters are set, the servlet will create both an enumeration filter and a range filter with the same filter mode. It depends on the used search engine integration what happens in this case. E.g. <br />
<br />
*To set a filter for attribute ''MimeType'' restricting the result to HTML documents, use:<br />
<br />
<source lang="xml"><br />
<input type="hidden" name="F.val.MimeType" value="text/html" /><br />
</source> <br />
<br />
*To set a filter for attribute ''FileSize'' restricting the result to document sizes between 1000 and 10000 bytes, use:<br />
<br />
<source lang="xml"><br />
<input type="hidden" name="F.min.FileSize" value="1000" /><br />
<input type="hidden" name="F.max.FileSize" value="10000" /><br />
</source> <br />
<br />
To set a value in the ranking parameter for the complete record or an attribute, use "R[.&lt;AttributeName&gt;].&lt;ValueName&gt;". E.g., the following input field adds a parameter "Operator=OR" to attribute "Content": <br />
<br />
<source lang="xml"><br />
<input type="hidden" name="R.Operator.Content" value="OR" /><br />
</source><br />
<br />
==== Adding attachments ====<br />
<br />
Attachments can be added to the query record by adding file upload fields to the search form, for example:<br />
<br />
<source lang="xml"><br />
<input type="file" name="Content"/> <br />
</source><br />
<br />
If the user selects a file for this field, it will be uploaded to SMILA and added as attachment "Content". Of course, there must be pipelets in your search pipeline that can process this attachment. Note, that the attachments will be kept in memory in a default SMILA configuration, so they should not be too large.<br />
<br />
=== Record Search Servlet ===<br />
<br />
In addition there exists the very basic Record Search Servlet available at {{Path|/SMILA/recordsearch}}. <br />
<br />
You can do a POST or GET request on this URL with a SMILA search record in XML representation as the request body. The servlet then parses the given XML and calls the Search Service. The default is to use the SeachPipeline but you can define any other pipeline by adding the {{code|_workflow}} annotation to the search record with the respective pipeline name.<br />
<br />
The servlet returns the XML representation of the record returned by the Search Service as is, in which you can find the search results (see above).<br />
<br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Scripting&diff=373109SMILA/Documentation/Scripting2014-11-03T11:09:54Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>== Scripting SMILA using JavaScript ==<br />
<br />
'''Work In Progress'''<br />
<br />
=== Service Description ===<br />
<br />
* Bundles: <tt>org.eclipse.smila.scripting(.test)</tt><br />
* OSGi service interface: <tt>org.eclipse.smila.scripting.ScriptingEngine</tt><br />
* Service implementation: <tt>org.eclipse.smila.scripting.internal.JavascriptEngine</tt><br />
<br />
The ScriptingEngine provides an alternative for describing "synchronous workflows" by using JavaScript functions instead of BPEL processes. This approach is easier, more flexible, and more maintainable (e.g. debugable), so one day the BPEL approach might be removed completely.<br />
<br />
==== Script Basics ====<br />
<br />
A JavaScript function for SMILA scripting takes one record (including attachments) as an argument, and can return one record (other return types are supported, too and wrapped in a record automatically). For example, a file <tt>helloWorld.js</tt> (the suffix must be ".js") could look like this:<br />
<br />
<pre><br />
function greetings(record) {<br />
record.greetings = "Hello " + record.name + "!";<br />
return record;<br />
}<br />
</pre><br />
<br />
Script files to execute are added by default to <tt>SMILA/configuration/org.eclipse.smila.scripting/js</tt>. They are currently loaded "on-demand" and not stored in the service for reuse, so changes in the files will be effective for the next execution.<br />
<br />
A script is invoked using the <tt>ScriptingEngine.callScript()</tt> methods. The first argument of both methods is a <tt>scriptName</tt> string in format <tt><file>.<function></tt> where the <tt><file></tt> part is the name of the script file (without path and ".js" suffix) and the <tt><function></tt> part is the name of a function defined in this file.<br />
<br />
==== Exposing Script Functions ====<br />
<br />
The script directory can contain "script catalog" files. They can be used to expose and describe available scripts in the ReST API so that a client can detect available scripts. Such a file must be named <tt><prefix>ScriptCatalog.js</tt>, e.g. <tt>smilaScriptCatalog.js</tt>, and must have this format:<br />
<br />
<pre><br />
[<br />
{<br />
name: "helloWorld.greetings",<br />
description: "Get a Hello from SMILA!"<br />
},<br />
// ... more function descriptions<br />
]<br />
</pre><br />
<br />
A catalog file does not define functions, it just produces an array of script function descriptions. A description object must contain a <tt>name</tt> property, we also recommend including a <tt>description</tt> property. Other properties can be added as you like (e.g. a structured description of expected parameters in the passed record).<br />
<br />
The <tt>ScriptingEngine.listScripts()</tt> method merges the arrays produced by all catalog scripts into one array (elements that are not objects or do not have a <tt>name</tt> property are ignored) and sorts them by name.<br />
<br />
The <tt>name</tt> property must be in format <tt><file>.<function></tt>, as described above for the <tt>scriptName</tt> parameter of the <tt>callScripts()</tt> functions.<br />
<br />
==== Configuration ====<br />
<br />
The script directory can be changed on startup using a system property: <tt>SMILA -Dsmila.scripting.dir=/home/smila/js ...</tt>. The system property can also be added to <tt>SMILA.ini</tt>, of course.<br />
<br />
=== Scripting Features ("SDK") ===<br />
<br />
See the [https://developer.mozilla.org/en-US/docs/Rhino_documentation Rhino Documentation] for special JavaScript features available in Rhino. They should work in SMILA, too. Especially the predefined functions available in [https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino/Shell#Predefined_Properties Rhino Shell] should work in SMILA, too (if they are useful). For example, you can use <tt>print(...)</tt> to write something to the console:<br />
<br />
<pre><br />
print("Hello World!");<br />
</pre><br />
<br />
(However, the <tt>quit()</tt> function will do nothing ;-)<br />
<br />
==== Working with Records ====<br />
<br />
The record passed to the script can be accessed just like a native JavaScript object. The record attributes are just treated as object properties:<br />
<br />
<pre><br />
record.string = "a string";<br />
record["integer"] = 42;<br />
record.double = 3.14;<br />
record.boolean = true;<br />
record.map = {<br />
key : "value"<br />
};<br />
record.sequence = [ "Hello", record.string, record.integer, record.double ];<br />
<br />
delete record.name;<br />
</pre><br />
<br />
Iterating over maps and sequences is possible, too:<br />
<br />
<pre><br />
for ( var key in record.map) {<br />
print("map " + key + " to " + record.map.key);<br />
}<br />
<br />
for ( var index in record.sequence) {<br />
print("element " + index + ": " + record.sequence[index]);<br />
} <br />
</pre><br />
<br />
To add elements to sequences, set the value of index <tt>sequence.length</tt>:<br />
<br />
<pre><br />
record.sequence[record.sequence.length] = "added element";<br />
</pre><br />
<br />
The methods of the Javascript Array objects (e.g. push(), concat(), ...) are currently not supported by SMILA sequence objects.<br />
<br />
The record object has the following three special properties whose names start with a dollar sign ($):<br />
<br />
* <tt>$id</tt>: The string value of the attribute <tt>_recordid</tt>. This is just a convenience property. It can be used to read and write the record ID:<br />
<pre><br />
var recordId = record.$id;<br />
record.$id = "changed-id";<br />
</pre><br />
* <tt>$metadata</tt>: In some cases it is necessary to use the actual <tt>AnyMap</tt> object containing the record metadata, for example if you want to call a Java method that defines a parameter of type <tt>Any</tt> or <tt>AnyMap</tt>:<br />
<pre><br />
var writer = new org.eclipse.smila.datamodel.ipc.IpcAnyWriter(true);<br />
var recordAsJson = writer.writeJsonObject(record.$metadata);<br />
</pre><br />
* <tt>$attachments</tt>: Contains an object that provides access to the record's attachments. Its properties correspond to attachment names and can be used to get and set attachment contents of the record.<br /> When reading an attachment, an actual <tt>org.eclipse.smila.datamodel.Attachment</tt> object is returned that can be access by using the Java methods and passed to other Java objects:<br />
<pre><br />
var attachment = record.$attachments.Content;<br />
var contentLength = attachment.size();<br />
var contentAsByteArray = attachment.getAsBytes();<br />
var contentAsStream = attachment.getAsStream();<br />
<br />
var contentAsString = new java.lang.String(contentAsByteArray, "utf-8");<br />
</pre><br />
<br />
To set an attachment, several types of objects are supported to provide the content:<br />
* Java byte Arrays, of course:<br />
<pre><br />
record.$attachment.fromBytes = contentAsByteArray;<br />
</pre><br />
* String (more exactly, <tt>java.lang.CharSequence</tt>) objects are converted to byte arrays using UTF-8 encoding:<br />
<pre><br />
record.$attachments.fromString = "string attached";<br />
</pre><br />
* <tt>java.io.InputStream</tt> objects are read into an byte array and set as an attachment. The stream will be closed after the operation:<br />
<pre><br />
var stream = new FileInputStream(filename);<br />
record.$attachments.fromStream = stream<br />
</pre><br />
* An <tt>org.eclipse.smila.datamodel.Attachment</tt> can be used, too. If the names match, the actual Attachment object will just be attached to the record. Else the implementation will fetch the content from the source attachment and create a new <tt>Attachment</tt> object from it (with the current implementation of attachments in SMILA this will NOT result in copying the actual <tt>byte[]</tt>). If getting the content does not work, an error will be thrown (however, this cannot happen currently).<br />
<pre><br />
record.$attachments.copyAttachment = record.$attachments.originalAttachment<br />
</pre><br />
To delete an attachment, use the <tt>delete</tt> operator:<br />
<pre><br />
delete record.$attachments.Content;<br />
</pre><br />
<br />
<tt>record.$attachments</tt> and <tt>record.$metadata</tt> cannot be used for write access themselves. The <tt>delete</tt> operator will not work on any of the special properties.<br />
<br />
==== Accessing OSGi services ====<br />
<br />
Any active OSGi services in the SMILA VM can be easily accessed from within a script. Just use the globally registered <tt>services</tt> object. For example:<br />
<br />
* Use LanguageIdentifier service:<br />
<pre><br />
var languageId = services.find("org.eclipse.smila.common.language.LanguageIdentifyService");<br />
record.language = languageId.identify(record.Content).getIsoLanguage();<br />
</pre><br />
<br />
* Write record to ObjectStore:<br />
<pre><br />
var objectstore = services.find("org.eclipse.smila.objectstore.ObjectStoreService");<br />
objectstore.ensureStore("store-created-by-script");<br />
<br />
var bonWriter = new org.eclipse.smila.datamodel.ipc.IpcAnyWriter(true);<br />
var bonObject = bonWriter.writeBinaryObject(record.$metadata);<br />
<br />
objectstore.putObject("store-created-by-script", "bon-object", bonObject);<br />
</pre><br />
<br />
See the service documentations for details on how to use them.<br />
<br />
==== Using Pipelets ====<br />
<br />
It is also possible to use pipelets. You must create a pipelet instance first using the global <tt>pipelets.create</tt> function and a configuration object, then you can invoke the created pipelet instance using the <tt>process</tt> function of the instance:<br />
<br />
<pre><br />
function processTika(record)<br />
var tikaConfig = {<br />
"inputType" : "ATTRIBUTE",<br />
"outputType" : "ATTRIBUTE",<br />
"inputName" : "Content",<br />
"outputName" : "PlainContent",<br />
"contentTypeAttribute" : "MimeType",<br />
"exportAsHtml" : false,<br />
"maxLength" : "-1",<br />
"extractProperties" : [ {<br />
"metadataName" : "title",<br />
"targetAttribute" : "Title",<br />
"singleResult" : true<br />
} ]<br />
};<br />
var tika = pipelets.create("org.eclipse.smila.tika.TikaPipelet", tikaConfig);<br />
tika.process(record);<br />
return record;<br />
</pre><br />
<br />
The <tt>process()</tt> function accepts single records and arrays of records as well as single or arrays of <tt>JavaScript</tt> objects that can be converted to <tt>AnyMap</tt> objects. Arrays of records or objects will be processed in a single pipelet invocation.<br />
<br />
The <tt>process</tt> function always returns an array of records, even if only one record was given as input. That's due to the fact that some pipelets create new records resp. split the input record into multiple output records.<br />
<br />
So the signature of the <tt>process</tt> function looks like this:<br />
<pre><br />
Record[] process(Record)<br />
Record[] process(Record[])<br />
Record[] process(AnyMap)<br />
Record[] process(AnyMap[])<br />
Record[] process(<Javascript-Map>)<br />
Record[] process(<Javascript-Map>[])<br />
</pre><br />
<br />
The result of a pipelet invocation can be given to another pipelet for further processing or returned as the final function result.<br />
<br />
'''Using Pipelets - best practice: '''<br />
<br />
In normal case, pipelets will just work on (resp. modify) given input records, but not create new records.<br />
In this case, don't use the result of a pipelet for further script processing but just work with the input record. So you don't have to care about the <tt>process</tt> function always returning an array as result. <br />
<br />
Example-1: Best practice<br />
<pre><br />
function processTika(record)<br />
... <br />
my1stPipelet.process(record);<br />
record.greetings = "Hello world";<br />
my2ndPipelet.process(record);<br />
...<br />
return record;<br />
</pre><br />
<br />
Example-2: When working with the pipelet result, you'll have to deal with arrays:<br />
<pre><br />
function processTika(record)<br />
... <br />
var result1 = my1stPipelet.process(record);<br />
result1[0].greetings = "Hello world";<br />
var result2 = my2ndPipelet.process(result1);<br />
...<br />
return result2[0];<br />
</pre><br />
<br />
==== Using other scripts: <tt>require</tt> ====<br />
<br />
This is basically an implementation of the [http://wiki.commonjs.org/wiki/Modules/1.1 CommonJS Module Specification], so you may want to refer to details there.<br />
<br />
Scripts can use functions resp. objects from other scripts (aka "modules") using the global <tt>require</tt> function. The argument to <tt>require</tt> is the path to the imported script without the ".js" suffix, relative to the SMILA script directory.<br />
<br />
The prerequisite for using an object from another script is that it has been made available via registration in the "exports" object. The result of <tt>require</tt> is this "exports" object, so the exported functions can be accessed in the importing script via this object.<br />
<br />
Within one script execution, multiple <tt>require</tt> calls for one module (even from different scripts) cause the module to be loaded only once and return the same "exports" object. So the scope to which the module was loaded is shared by each importer, local variables in this context are the same regardless from where the module is called.<br />
<br />
'''Example:'''<br />
We call a function in script <tt>helloWorld.js</tt> which uses a function from script <tt>utils/myUtils.js</tt>:<br />
<br />
helloWorld.js<br />
<pre><br />
// required scripts<br />
var myUtils = require("utils/myUtils");<br />
<br />
function greetings(record) {<br />
var normalizedName = myUtils.normalize(record.name)<br />
record.greetings = "Hello " + normalizedName + "!";<br />
return record;<br />
}<br />
</pre><br />
<br />
utils/myUtils.js<br />
<pre><br />
// objects used in other scripts<br />
exports.normalize = normalize<br />
<br />
function normalize(str) { <br />
return str.toUpperCase()<br />
}<br />
</pre><br />
<br />
'''Conventions''':<br />
* Exported functions should be exported under their original function name.<br />
* The object created with require() should be named like the required script.<br />
* If a script contains requires and/or exports, they should be listed at the beginning of the script, starting with the requires.<br />
<pre><br />
// required scripts<br />
var myUtils = require("utils/myUtils");<br />
var myCommons = require("commons/myCommons");<br />
...<br />
<br />
// objects used in other scripts<br />
exports.myFunction1 = myFunction1;<br />
exports.myFunction2 = myFunction2;<br />
...<br />
<br />
function myFunction1() {<br />
...<br />
}<br />
<br />
function myFunction2() {<br />
...<br />
}<br />
...<br />
</pre><br />
<br />
==== Logging ====<br />
<br />
To output log messages from within your scripting environment, SMILA provides two ways to do this. One way is to use the built in default logger which is accessible via the '''log''' object. This object is already provided within your scope. <br />
<pre><br />
function logDemo() {<br />
log.info("This msg will be logged on level >>info<<");<br />
log.error("This msg will be logged on level >>error<<");<br />
log.warn("This msg will be logged on level >>warn<<");<br />
}<br />
</pre><br />
<br />
The messages can then be found in the smila.log file. Levels available are ''trace'', ''debug'', ''info'', ''warn'', ''error'' and ''fatal''. Depending on the levels set in SMILA log4.properties, your log message may be discarded. <br />
<br />
If a certain level is enabled can be checked with:<br />
<pre><br />
boolean log.isTraceEnabled();<br />
boolean log.isDebugEnabled();<br />
boolean log.isInfoEnabled();<br />
boolean log.isWarnEnabled();<br />
boolean log.isErrorEnabled();<br />
boolean log.isFatalEnabled();<br />
</pre> <br />
<br />
The other way is to create a custom logger. To do so the environment exposes a vanila ''org.apache.commons.logging.LogFactory'' to the script contexts.<br />
<pre><br />
function logFactoryDemo() {<br />
var myLogger = LogFactory.getLog("myLogger");<br />
myLogger.info("This is a log message"); <br />
}<br />
</pre> <br />
<br />
==== Debugging ====<br />
<br />
Yes, it is possible. See [[SMILA/Documentation/Scripting/Debugging]].<br />
<br />
==== ScriptProcessorWorker ====<br />
<br />
The [[SMILA/Documentation/Worker/ScriptProcessorWorker | ScriptProcessorWorker]] is a [[SMILA/Glossary#W|worker]] designed to process (synchronous) script calls inside an [[SMILA/Glossary#W|asynchronous workflow]].<br />
<br />
=== HTTP REST API ===<br />
<br />
Scripts handled via ReST API must be located on top level of the configured script folder, i.e. scripts in subfolders can currently not be called via ReST API.<br />
<br />
==== Managing Scripts ====<br />
<br />
'''URL'''<br />
<br />
<tt><nowiki>http://<hostname>:8080/smila/script</nowiki></tt><br />
<br />
'''Methods'''<br />
<br />
GET: list exposed scripts - show the result of <tt>ScriptingEngine.listScripts()</tt>.<br />
<br />
'''Request'''<br />
<br />
No parameters, no request body. <br />
<br />
'''Response'''<br />
<br />
Result of <tt>ScriptingEngine.listScripts()</tt>, wrapped as a JSON object with property "scripts" containing the array of script descriptions:<br />
<br />
<pre><br />
{<br />
"scripts": [<br />
{<br />
"name": "helloWorld.greetings",<br />
"description": "Get a Hello from SMILA!",<br />
"url": "http://localhost:8080/smila/script/helloWorld.greetings/"<br />
}<br />
]<br />
}<br />
</pre><br />
<br />
'''URL'''<br />
<br />
<tt><nowiki>http://<hostname>:8080/smila/script/<scriptfile>.<function></nowiki></tt><br />
<br />
'''Methods'''<br />
<br />
GET: show script description.<br />
<br />
'''Request'''<br />
<br />
No parameters, no request body.<br />
<br />
'''Response'''<br />
<br />
Description object from the above list with the matching name.<br />
<br />
'''Response Codes'''<br />
<br />
* 200 OK: Success<br />
* 404 Not Found: Function is not exposed in any ScriptCatalog file.<br />
<br />
'''Example Request'''<br />
<br />
<pre><br />
GET /smila/script/helloWorld.greetings/<br />
</pre><br />
<br />
'''Example Response'''<br />
<br />
<pre><br />
{<br />
"name": "helloWorld.greetings",<br />
"description": "Get a Hello from SMILA!",<br />
"url": "http://localhost:8080/smila/script/helloWorld.greetings/"<br />
}<br />
</pre><br />
<br />
==== Executing a script ====<br />
<br />
'''URL'''<br />
<br />
<tt><nowiki>http://<hostname>:8080/smila/script/<script-file>.<function></nowiki></tt><br />
<br />
'''Methods'''<br />
<br />
POST: execute script with record in request body. Attachments are supported, too.<br />
<br />
'''Response'''<br />
<br />
Metadata part of result of <tt>ScriptingEngine.callScript("<script-file>.<function>", requestRecord)</tt>. If the result contains attachments they are not returned via the ReST API.<br />
<br />
'''Response-Codes'''<br />
<br />
* 200 OK: Script executed successfully.<br />
* 400 Bad Request: Last URL part does not have <script-file>.<function> format, or error in Script execution<br />
* 404 Not Found: Script file does not exist or does not contain the function<br />
<br />
'''Example Request'''<br />
<br />
<pre><br />
POST http://localhost:8080/smila/script/helloWorld.greetings<br />
{<br />
"name": "Juergen"<br />
}<br />
</pre><br />
<br />
'''Example Response'''<br />
<br />
<pre><br />
{<br />
"name": "Juergen",<br />
"greetings": "Hello Juergen!"<br />
}<br />
</pre><br />
<br />
=== Examples ===<br />
Simple example using SMILA's default scripts for indexing and search.<br />
<br />
Add some records to index:<br />
<pre><br />
POST http://localhost:8080/smila/script/add.process<br />
{<br />
"_recordid": "id1",<br />
"Title": "Scripting rules!",<br />
"Content": "yet another SMILA document",<br />
"MimeType": "text/plain"<br />
}<br />
</pre><br />
<br />
Search index:<br />
<pre><br />
POST http://localhost:8080/smila/script/search.process<br />
{<br />
"query": "SMILA",<br />
"resultAttributes": ["Title", "Content"]<br />
}<br />
</pre><br />
<br />
Delete first record from index:<br />
<pre><br />
POST http://localhost:8080/smila/script/delete.process<br />
{<br />
"_recordid": "id1"<br />
}<br />
</pre></div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Scripting&diff=372983SMILA/Documentation/Scripting2014-10-30T15:37:50Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>== Scripting SMILA using Javascript ==<br />
<br />
'''Work In Progress'''<br />
<br />
=== Service Description ===<br />
<br />
* Bundles: org.eclipse.smila.scripting(.test)<br />
* OSGi service interface: org.eclipse.smila.scripting.ScriptingEngine<br />
* Service implementation: org.eclipse.smila.scripting.internal.JavascriptEngine<br />
<br />
The ScriptingEngine provides an alternative for describing "synchronous workflows" by using Javascript functions instead of BPEL processes. This approach is easier, more flexible and more maintainable (e.g. debugable), so one day the BPEL approach might be removed completely.<br />
<br />
==== Scripts Basics ====<br />
<br />
A javascript function for SMILA scripting takes one record (including attachments) as an argument, and can return one record (other return types are supported, too and wrapped in a record automatically). For example, a file <tt>helloWorld.js</tt> (the suffix must be ".js") could look like this:<br />
<br />
<pre><br />
function greetings(record) {<br />
record.greetings = "Hello " + record.name + "!";<br />
return record;<br />
}<br />
</pre><br />
<br />
Script files to execute are added by default to <tt>SMILA/configuration/org.eclipse.smila.scripting/js</tt>. They are currently loaded "on-demand" and not stored in the service for reuse, so changes in the files will be effective for the next execution.<br />
<br />
A script is invoked using the <tt>ScriptingEngine.callScript()</tt> methods. The first argument of both methods is a "scriptName" string in format "<file>.<function>" where the <file> part is the name of the script file (without path and ".js" suffix) and the <function> part is the name of a function defined in this file.<br />
<br />
==== Exposing script functions ====<br />
<br />
The script directory can contain "script catalog" files. They can be used to expose and describe available scripts in the ReST API so that a client can detect available scripts. Such a file must be named <tt><prefix>ScriptCatalog.js</tt>, e.g. <tt>smilaScriptCatalog.js</tt> and must have this format:<br />
<br />
<pre><br />
[<br />
{<br />
name: "helloWorld.greetings",<br />
description: "Get a Hello from SMILA!"<br />
},<br />
// ... more function descriptions<br />
]<br />
</pre><br />
<br />
A catalog file does not define functions, it just produces an array of script function descriptions. A description object must contain a "name" property, we recommend to include a "description" property. Other properties can be added as you like (e.g. a structured description of expected parameters in the passed record).<br />
<br />
The <tt>ScriptingEngine.listScripts()</tt> method merges the arrays produced by all catalog scripts into one array (elements that are not objects or do not have a "name" property are ignored) and sorts them by name.<br />
<br />
The name property must be in format <file>.<function>, as described above for the scriptName parameter of the callScripts() functions. <br />
<br />
==== Configuration ====<br />
<br />
* The script directory can be changed on startup using a system property: <tt>SMILA -Dsmila.scripting.dir=/home/smila/js ...</tt>. The system property can also be added to SMILA.ini, of course.<br />
<br />
<br />
=== Scripting Features ("SDK") ===<br />
<br />
See the [https://developer.mozilla.org/en-US/docs/Rhino_documentation Rhino Documentation] for special Javascript features available in Rhino. They should work in SMILA, too. Especially the predefined functions available in [https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino/Shell#Predefined_Properties Rhino Shell] should work in SMILA, too (if they are useful). For example, you can use <tt>print(...)</tt> to write something to the console:<br />
<br />
<pre><br />
print("Hello World!");<br />
</pre><br />
<br />
(However, the <tt>quit()</tt> function will do nothing ;-)<br />
<br />
==== Working with Records ====<br />
<br />
The record passed to the script can be accessed just like a native Javascript object. The record attributes are just treated as object properties:<br />
<br />
<pre><br />
record.string = "a string";<br />
record["integer"] = 42;<br />
record.double = 3.14;<br />
record.boolean = true;<br />
record.map = {<br />
key : "value"<br />
};<br />
record.sequence = [ "Hello", record.string, record.integer, record.double ];<br />
<br />
delete record.name;<br />
</pre><br />
<br />
Iterating over maps and sequences is possible, too:<br />
<br />
<pre><br />
for ( var key in record.map) {<br />
print("map " + key + " to " + record.map.key);<br />
}<br />
<br />
for ( var index in record.sequence) {<br />
print("element " + index + ": " + record.sequence[index]);<br />
} <br />
</pre><br />
<br />
The record object has three special properties, whose names start with a dollar sign ($):<br />
* <tt>$id</tt>: The string value of attribute <tt>_recordid</tt>. This is just a convenience property. It can be used to read and write the record ID:<br />
<pre><br />
var recordId = record.$id;<br />
record.$id = "changed-id";<br />
</pre><br />
* <tt>$metadata</tt>: in some cases it is necessary to use the actual AnyMap object containing the record metadata, for example if you want to call a Java method that defines a parameter of type Any or AnyMap:<br />
<pre><br />
var writer = new org.eclipse.smila.datamodel.ipc.IpcAnyWriter(true);<br />
var recordAsJson = writer.writeJsonObject(record.$metadata);<br />
</pre><br />
* <tt>$attachments</tt>: contains an object that provides access to the record attachments. Its properties correspond to attachment names and can be used to get and set attachment contents of the record<br />
** When reading an attachment, an actual <tt>org.eclipse.smila.datamodel.Attachment</tt> object is returned that can be access by using the Java methods and passed to other Java objects:<br />
<pre><br />
var attachment = record.$attachments.Content;<br />
var contentLength = attachment.size();<br />
var contentAsByteArray = attachment.getAsBytes();<br />
var contentAsStream = attachment.getAsStream();<br />
<br />
var contentAsString = new java.lang.String(contentAsByteArray, "utf-8");<br />
</pre><br />
** To set an attachment, several types of objects are supported to provide the content:<br />
*** Java byte Arrays, of course:<br />
<pre><br />
record.$attachment.fromBytes = contentAsByteArray;<br />
</pre><br />
*** String (more exactly, java.lang.CharSequence) objects are converted to byte arrays using UTF-8 encoding:<br />
<pre><br />
record.$attachments.fromString = "string attached";<br />
</pre><br />
*** java.io.InputStream objects are read into an byte array and set as an attachment. The stream will be closed after the operation:<br />
<pre><br />
var stream = new FileInputStream(filename);<br />
record.$attachments.fromStream = stream<br />
</pre><br />
*** An <tt>org.eclipse.smila.datamodel.Attachment</tt> can be used, too. If the names match, the actual Attachment object will just be attached to the record. Else the implementation will fetch the content from the source attachment and create a new Attachment object from it (with the current implementation of Attachments in SMILA this will NOT result in copying the actual byte[]). If getting the content does not work, an error will be thrown (however, this cannot happen currently).<br />
<pre><br />
record.$attachments.copyAttachment = record.$attachments.originalAttachment<br />
</pre><br />
** To delete an attachment, use the <tt>delete</tt> operator:<br />
<pre><br />
delete record.$attachments.Content;<br />
</pre><br />
<br />
<tt>record.$attachments</tt> and <tt>record.$metadata</tt> cannot be used for write-access themselves. The <tt>delete</tt> operator will not work on any of the special properties.<br />
<br />
==== Accessing OSGi services ====<br />
<br />
Any active OSGi services in the SMILA VM can be easily accessed from within a script. Just use the globally registered <tt>services</tt> object. For example:<br />
<br />
* Use LanguageIdentifier service:<br />
<pre><br />
var languageId = services.find("org.eclipse.smila.common.language.LanguageIdentifyService");<br />
record.language = languageId.identify(record.Content).getIsoLanguage();<br />
</pre><br />
<br />
* Write record to ObjectStore:<br />
<pre><br />
var objectstore = services.find("org.eclipse.smila.objectstore.ObjectStoreService");<br />
objectstore.ensureStore("store-created-by-script");<br />
<br />
var bonWriter = new org.eclipse.smila.datamodel.ipc.IpcAnyWriter(true);<br />
var bonObject = bonWriter.writeBinaryObject(record.$metadata);<br />
<br />
objectstore.putObject("store-created-by-script", "bon-object", bonObject);<br />
</pre><br />
<br />
See the service documentations for details on how to use them.<br />
<br />
==== Using Pipelets ====<br />
<br />
It is also possible to use pipelets. You must create a pipelet instance first using the global <tt>pipelets.create</tt> function and a configuration object, then you can invoke the created pipelet instance using the <tt>process</tt> function of the instance:<br />
<br />
<pre><br />
function processTika(record)<br />
var tikaConfig = {<br />
"inputType" : "ATTRIBUTE",<br />
"outputType" : "ATTRIBUTE",<br />
"inputName" : "Content",<br />
"outputName" : "PlainContent",<br />
"contentTypeAttribute" : "MimeType",<br />
"exportAsHtml" : false,<br />
"maxLength" : "-1",<br />
"extractProperties" : [ {<br />
"metadataName" : "title",<br />
"targetAttribute" : "Title",<br />
"singleResult" : true<br />
} ]<br />
};<br />
var tika = pipelets.create("org.eclipse.smila.tika.TikaPipelet", tikaConfig);<br />
tika.process(record);<br />
return record;<br />
</pre><br />
<br />
The <tt>process()</tt> function accepts single records and arrays of records as well as single or arrays of Javascript objects that can be converted to AnyMap objects. Arrays of records or objects will be processed in a single pipelet invocation.<br />
<br />
The <tt>process</tt> function always return an array of records, even if only one record was given as input. That's due to the fact, that some pipelets create new records resp. split the input record into multiple output records.<br />
<br />
So the signature of the <tt>process</tt> function looks like this:<br />
<pre><br />
Record[] process(Record)<br />
Record[] process(Record[])<br />
Record[] process(AnyMap)<br />
Record[] process(AnyMap[])<br />
Record[] process(<Javascript-Map>)<br />
Record[] process(<Javascript-Map>[])<br />
</pre><br />
<br />
The result of a pipelet invocation can be given to another pipelet for further processing or returned as the final function result.<br />
<br />
'''Using Pipelets - best practice: '''<br />
<br />
In normal case, pipelets will just work on (resp. modify) given input records, but not create new records.<br />
In this case, don't use the result of a pipelet for further script processing but just work with the input record. So you don't have to care about the <tt>process</tt> function always returning an array as result. <br />
<br />
Example-1: Best practice<br />
<pre><br />
function processTika(record)<br />
... <br />
my1stPipelet.process(record);<br />
record.greetings = "Hello world";<br />
my2ndPipelet.process(record);<br />
...<br />
return record;<br />
</pre><br />
<br />
Example-2: When working with the pipelet result, you 'd have to deal with arrays:<br />
<pre><br />
function processTika(record)<br />
... <br />
var result1 = my1stPipelet.process(record);<br />
result1[0].greetings = "Hello world";<br />
var result2 = my2ndPipelet.process(result1);<br />
...<br />
return result2[0];<br />
</pre><br />
<br />
==== Using other scripts: <tt>require</tt> ====<br />
<br />
This is basically an implementation of the [http://wiki.commonjs.org/wiki/Modules/1.1 CommonJS Module Specification], so you may want to refer to details there.<br />
<br />
Scripts can use functions resp. objects from other scripts (aka "modules") using the global <tt>require</tt> function. The argument to <tt>require</tt> is the path to the imported script without the ".js" suffix, relative to the SMILA script directory.<br />
<br />
The prerequisite for using an object from another script is that it has been made available via registration in the "exports" object. The result of <tt>require</tt> is this "exports" object, so the exported functions can be accessed in the importing script via this object.<br />
<br />
Within one script execution, multiple <tt>require</tt> calls for one module (even from different scripts) cause the module to be loaded only once and return the same "exports" object. So the scope to which the module was loaded is shared by each importer, local variables in this context are the same regardless from where the module is called.<br />
<br />
'''Example:'''<br />
We call a function in script <tt>helloWorld.js</tt> which uses a function from script <tt>utils/myUtils.js</tt>:<br />
<br />
helloWorld.js<br />
<pre><br />
// required scripts<br />
var myUtils = require("utils/myUtils");<br />
<br />
function greetings(record) {<br />
var normalizedName = myUtils.normalize(record.name)<br />
record.greetings = "Hello " + normalizedName + "!";<br />
return record;<br />
}<br />
</pre><br />
<br />
utils/myUtils.js<br />
<pre><br />
// objects used in other scripts<br />
exports.normalize = normalize<br />
<br />
function normalize(str) { <br />
return str.toUpperCase()<br />
}<br />
</pre><br />
<br />
'''Conventions''':<br />
* Exported functions should be exported under their original function name.<br />
* The object created with require() should be named like the required script.<br />
* If a script contains requires and/or exports, they should be listed at the beginning of the script, starting with the requires.<br />
<pre><br />
// required scripts<br />
var myUtils = require("utils/myUtils");<br />
var myCommons = require("commons/myCommons");<br />
...<br />
<br />
// objects used in other scripts<br />
exports.myFunction1 = myFunction1;<br />
exports.myFunction2 = myFunction2;<br />
...<br />
<br />
function myFunction1() {<br />
...<br />
}<br />
<br />
function myFunction2() {<br />
...<br />
}<br />
...<br />
</pre><br />
<br />
==== Logging ====<br />
<br />
To output log messages from within your scripting environment, SMILA provides two ways to do this. One way is to use the built in default logger which is accessible via the '''log''' object. This object is already provided within your scope. <br />
<pre><br />
function logDemo() {<br />
log.info("This msg will be logged on level >>info<<");<br />
log.error("This msg will be logged on level >>error<<");<br />
log.warn("This msg will be logged on level >>warn<<");<br />
}<br />
</pre><br />
<br />
The messages can then be found in the smila.log file. Levels available are ''trace'', ''debug'', ''info'', ''warn'', ''error'' and ''fatal''. Depending on the levels set in SMILA log4.properties, your log message may be discarded. <br />
<br />
If a certain level is enabled can be checked with:<br />
<pre><br />
boolean log.isTraceEnabled();<br />
boolean log.isDebugEnabled();<br />
boolean log.isInfoEnabled();<br />
boolean log.isWarnEnabled();<br />
boolean log.isErrorEnabled();<br />
boolean log.isFatalEnabled();<br />
</pre> <br />
<br />
The other way is to create a custom logger. To do so the environment exposes a vanila ''org.apache.commons.logging.LogFactory'' to the script contexts.<br />
<pre><br />
function logFactoryDemo() {<br />
var myLoggger = LogFactory.getLog("myLogger");<br />
myLogger.info("This is a log message"); <br />
}<br />
</pre> <br />
<br />
==== Debugging ====<br />
<br />
Yes, it is possible. See [[SMILA/Documentation/Scripting/Debugging]].<br />
<br />
==== ScriptProcessorWorker ====<br />
<br />
The [[SMILA/Documentation/Worker/ScriptProcessorWorker | ScriptProcessorWorker]] is a [[SMILA/Glossary#W|worker]] designed to process (synchronous) script calls inside an [[SMILA/Glossary#W|asynchronous workflow]].<br />
<br />
=== HTTP REST API ===<br />
<br />
Scripts handled via ReST API must be located on top level of the configured script folder, i.e. scripts in subfolders can currently not be called via ReST API.<br />
<br />
==== Manage Scripts ====<br />
<br />
__URL:__ <tt>http://<hostname>:8080/smila/script</tt><br />
<br />
Methods:<br />
* GET: list exposed scripts - show the result of <tt>ScriptingEngine.listScripts()</tt>. No parameters, no request body. <br />
<br />
Response:<br />
* result of <tt>ScriptingEngine.listScripts()</tt>, wrapped as a JSON object with property "scripts" containing the array of script descriptions:<br />
<pre><br />
{<br />
"scripts": [<br />
{<br />
"name": "helloWorld.greetings",<br />
"description": "Get a Hello from SMILA!",<br />
"url": "http://localhost:8080/smila/script/helloWorld.greetings/"<br />
}<br />
]<br />
}<br />
</pre><br />
<br />
<br />
__URL:__ <tt>http://<hostname>:8080/smila/script/<scriptfile>.<function></tt><br />
<br />
Methods:<br />
* GET: show script description. No parameters, no request body.<br />
<br />
Response:<br />
* description object from above list with the matching name.<br />
<br />
Response-Codes<br />
* 200 OK: Success<br />
* 404 Not Found: Function is not exposed in any ScriptCatalog file.<br />
<br />
Example: <tt>GET /smila/script/helloWorld.greetings</tt> yields:<br />
<br />
<pre><br />
{<br />
"name": "helloWorld.greetings",<br />
"description": "Get a Hello from SMILA!",<br />
"url": "http://localhost:8080/smila/script/helloWorld.greetings/"<br />
}<br />
</pre><br />
<br />
==== Execute a script ====<br />
<br />
__URL:__ <tt>http://<hostname>:8080/smila/script/<script-file>.<function></tt><br />
<br />
Methods<br />
* POST: execute script with record in request body. Attachments are supported, too.<br />
<br />
Response:<br />
* Metadata part of result of <tt>ScriptingEngine.callScript("<script-file>.<function>", requestRecord)</tt>. If the result contains attachments they are not returned via the ReST-API.<br />
<br />
Response-Codes<br />
* 200 OK: Script executed successfully.<br />
* 400 Bad Request: Last URL part does not have <script-file>.<function> format, or error in Script execution<br />
* 404 Not Found: Script file does not exist or does not contain the function<br />
<br />
Example request:<br />
<pre><br />
POST http://localhost:8080/smila/script/helloWorld.greetings<br />
{<br />
"name": "Juergen"<br />
}<br />
</pre><br />
<br />
Response:<br />
<pre><br />
{<br />
"name": "Juergen",<br />
"greetings": "Hello Juergen!"<br />
}<br />
</pre><br />
<br />
=== Examples ===<br />
Simple example using SMILA's default scripts for indexing and search.<br />
<br />
add some records to index:<br />
<pre><br />
POST http://localhost:8080/smila/script/add.process<br />
{<br />
"_recordid": "id1",<br />
"Title": "Scripting rules!",<br />
"Content": "yet another SMILA document",<br />
"MimeType": "text/plain"<br />
}<br />
</pre><br />
<br />
search in index:<br />
<pre><br />
POST http://localhost:8080/smila/script/search.process<br />
{<br />
"query": "SMILA",<br />
"resultAttributes": ["Title", "Content"]<br />
}<br />
</pre><br />
<br />
delete first record from index:<br />
<pre><br />
POST http://localhost:8080/smila/script/delete.process<br />
{<br />
"_recordid": "id1"<br />
}<br />
</pre></div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Scripting&diff=372970SMILA/Documentation/Scripting2014-10-30T13:30:35Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>== Scripting SMILA using Javascript ==<br />
<br />
'''Work In Progress'''<br />
<br />
=== Service Description ===<br />
<br />
* Bundles: org.eclipse.smila.scripting(.test)<br />
* OSGi service interface: org.eclipse.smila.scripting.ScriptingEngine<br />
* Service implementation: org.eclipse.smila.scripting.internal.JavascriptEngine<br />
<br />
The ScriptingEngine provides an alternative for describing "synchronous workflows" by using Javascript functions instead of BPEL processes. This approach is easier, more flexible and more maintainable (e.g. debugable), so one day the BPEL approach might be removed completely.<br />
<br />
==== Scripts Basics ====<br />
<br />
A javascript function for SMILA scripting takes one record (including attachments) as an argument, and can return one record (other return types are supported, too and wrapped in a record automatically). For example, a file <tt>helloWorld.js</tt> (the suffix must be ".js") could look like this:<br />
<br />
<pre><br />
function greetings(record) {<br />
record.greetings = "Hello " + record.name + "!";<br />
return record;<br />
}<br />
</pre><br />
<br />
Script files to execute are added by default to <tt>SMILA/configuration/org.eclipse.smila.scripting/js</tt>. They are currently loaded "on-demand" and not stored in the service for reuse, so changes in the files will be effective for the next execution.<br />
<br />
A script is invoked using the <tt>ScriptingEngine.callScript()</tt> methods. The first argument of both methods is a "scriptName" string in format "<file>.<function>" where the <file> part is the name of the script file (without path and ".js" suffix) and the <function> part is the name of a function defined in this file.<br />
<br />
==== Exposing script functions ====<br />
<br />
The script directory can contain "script catalog" files. They can be used to expose and describe available scripts in the ReST API so that a client can detect available scripts. Such a file must be named <tt><prefix>ScriptCatalog.js</tt>, e.g. <tt>smilaScriptCatalog.js</tt> and must have this format:<br />
<br />
<pre><br />
[<br />
{<br />
name: "helloWorld.greetings",<br />
description: "Get a Hello from SMILA!"<br />
},<br />
// ... more function descriptions<br />
]<br />
</pre><br />
<br />
A catalog file does not define functions, it just produces an array of script function descriptions. A description object must contain a "name" property, we recommend to include a "description" property. Other properties can be added as you like (e.g. a structured description of expected parameters in the passed record).<br />
<br />
The <tt>ScriptingEngine.listScripts()</tt> method merges the arrays produced by all catalog scripts into one array (elements that are not objects or do not have a "name" property are ignored) and sorts them by name.<br />
<br />
The name property must be in format <file>.<function>, as described above for the scriptName parameter of the callScripts() functions. <br />
<br />
==== Configuration ====<br />
<br />
* The script directory can be changed on startup using a system property: <tt>SMILA -Dsmila.scripting.dir=/home/smila/js ...</tt>. The system property can also be added to SMILA.ini, of course.<br />
<br />
<br />
=== Scripting Features ("SDK") ===<br />
<br />
See the [https://developer.mozilla.org/en-US/docs/Rhino_documentation Rhino Documentation] for special Javascript features available in Rhino. They should work in SMILA, too. Especially the predefined functions available in [https://developer.mozilla.org/en-US/docs/Mozilla/Projects/Rhino/Shell#Predefined_Properties Rhino Shell] should work in SMILA, too (if they are useful). For example, you can use <tt>print(...)</tt> to write something to the console:<br />
<br />
<pre><br />
print("Hello World!");<br />
</pre><br />
<br />
(However, the <tt>quit()</tt> function will do nothing ;-)<br />
<br />
==== Working with Records ====<br />
<br />
The record passed to the script can be accessed just like a native Javascript object. The record attributes are just treated as object properties:<br />
<br />
<pre><br />
record.string = "a string";<br />
record["integer"] = 42;<br />
record.double = 3.14;<br />
record.boolean = true;<br />
record.map = {<br />
key : "value"<br />
};<br />
record.sequence = [ "Hello", record.string, record.integer, record.double ];<br />
<br />
delete record.name;<br />
</pre><br />
<br />
Iterating over maps and sequences is possible, too:<br />
<br />
<pre><br />
for ( var key in record.map) {<br />
print("map " + key + " to " + record.map.key);<br />
}<br />
<br />
for ( var index in record.sequence) {<br />
print("element " + index + ": " + record.sequence[index]);<br />
} <br />
</pre><br />
<br />
The record object has three special properties, whose names start with a dollar sign ($):<br />
* <tt>$id</tt>: The string value of attribute <tt>_recordid</tt>. This is just a convenience property. It can be used to read and write the record ID:<br />
<pre><br />
var recordId = record.$id;<br />
record.$id = "changed-id";<br />
</pre><br />
* <tt>$metadata</tt>: in some cases it is necessary to use the actual AnyMap object containing the record metadata, for example if you want to call a Java method that defines a parameter of type Any or AnyMap:<br />
<pre><br />
var writer = new org.eclipse.smila.datamodel.ipc.IpcAnyWriter(true);<br />
var recordAsJson = writer.writeJsonObject(record.$metadata);<br />
</pre><br />
* <tt>$attachments</tt>: contains an object that provides access to the record attachments. Its properties correspond to attachment names and can be used to get and set attachment contents of the record<br />
** When reading an attachment, an actual <tt>org.eclipse.smila.datamodel.Attachment</tt> object is returned that can be access by using the Java methods and passed to other Java objects:<br />
<pre><br />
var attachment = record.$attachments.Content;<br />
var contentLength = attachment.size();<br />
var contentAsByteArray = attachment.getAsBytes();<br />
var contentAsStream = attachment.getAsStream();<br />
<br />
var contentAsString = new java.lang.String(contentAsByteArray, "utf-8");<br />
</pre><br />
** To set an attachment, several types of objects are supported to provide the content:<br />
*** Java byte Arrays, of course:<br />
<pre><br />
record.$attachment.fromBytes = contentAsByteArray;<br />
</pre><br />
*** String (more exactly, java.lang.CharSequence) objects are converted to byte arrays using UTF-8 encoding:<br />
<pre><br />
record.$attachments.fromString = "string attached";<br />
</pre><br />
*** java.io.InputStream objects are read into an byte array and set as an attachment. The stream will be closed after the operation:<br />
<pre><br />
var stream = new FileInputStream(filename);<br />
record.$attachments.fromStream = stream<br />
</pre><br />
*** An <tt>org.eclipse.smila.datamodel.Attachment</tt> can be used, too. If the names match, the actual Attachment object will just be attached to the record. Else the implementation will fetch the content from the source attachment and create a new Attachment object from it (with the current implementation of Attachments in SMILA this will NOT result in copying the actual byte[]). If getting the content does not work, an error will be thrown (however, this cannot happen currently).<br />
<pre><br />
record.$attachments.copyAttachment = record.$attachments.originalAttachment<br />
</pre><br />
** To delete an attachment, use the <tt>delete</tt> operator:<br />
<pre><br />
delete record.$attachments.Content;<br />
</pre><br />
<br />
<tt>record.$attachments</tt> and <tt>record.$metadata</tt> cannot be used for write-access themselves. The <tt>delete</tt> operator will not work on any of the special properties.<br />
<br />
==== Accessing OSGi services ====<br />
<br />
Any active OSGi services in the SMILA VM can be easily accessed from within a script. Just use the globally registered <tt>services</tt> object. For example:<br />
<br />
* Use LanguageIdentifier service:<br />
<pre><br />
var languageId = services.find("org.eclipse.smila.common.language.LanguageIdentifyService");<br />
record.language = languageId.identify(record.Content).getIsoLanguage();<br />
</pre><br />
<br />
* Write record to ObjectStore:<br />
<pre><br />
var objectstore = services.find("org.eclipse.smila.objectstore.ObjectStoreService");<br />
objectstore.ensureStore("store-created-by-script");<br />
<br />
var bonWriter = new org.eclipse.smila.datamodel.ipc.IpcAnyWriter(true);<br />
var bonObject = bonWriter.writeBinaryObject(record.$metadata);<br />
<br />
objectstore.putObject("store-created-by-script", "bon-object", bonObject);<br />
</pre><br />
<br />
See the service documentations for details on how to use them.<br />
<br />
==== Using Pipelets ====<br />
<br />
It is also possible to use pipelets. You must create a pipelet instance first using the global <tt>pipelets.create</tt> function and a configuration object, then you can invoke the created pipelet instance using the <tt>process</tt> function of the instance:<br />
<br />
<pre><br />
function processTika(record)<br />
var tikaConfig = {<br />
"inputType" : "ATTRIBUTE",<br />
"outputType" : "ATTRIBUTE",<br />
"inputName" : "Content",<br />
"outputName" : "PlainContent",<br />
"contentTypeAttribute" : "MimeType",<br />
"exportAsHtml" : false,<br />
"maxLength" : "-1",<br />
"extractProperties" : [ {<br />
"metadataName" : "title",<br />
"targetAttribute" : "Title",<br />
"singleResult" : true<br />
} ]<br />
};<br />
var tika = pipelets.create("org.eclipse.smila.tika.TikaPipelet", tikaConfig);<br />
tika.process(record);<br />
return record;<br />
</pre><br />
<br />
The <tt>process()</tt> function accepts single records and arrays of records as well as single or arrays of Javascript objects that can be converted to AnyMap objects. Arrays of records or objects will be processed in a single pipelet invocation.<br />
<br />
The <tt>process</tt> function always return an array of records, even if only one record was given as input. That's due to the fact, that some pipelets create new records resp. split the input record into multiple output records.<br />
<br />
So the signature of the <tt>process</tt> function looks like this:<br />
<pre><br />
Record[] process(Record)<br />
Record[] process(Record[])<br />
Record[] process(AnyMap)<br />
Record[] process(AnyMap[])<br />
Record[] process(<Javascript-Map>)<br />
Record[] process(<Javascript-Map>[])<br />
</pre><br />
<br />
The result of a pipelet invocation can be given to another pipelet for further processing or returned as the final function result.<br />
<br />
'''Using Pipelets - best practice: '''<br />
<br />
In normal case, pipelets will just work on (resp. modify) given input records, but not create new records.<br />
In this case, don't use the result of a pipelet for further script processing but just work with the input record. So you don't have to care about the <tt>process</tt> function always returning an array as result. <br />
<br />
Example-1: Best practice<br />
<pre><br />
function processTika(record)<br />
... <br />
my1stPipelet.process(record);<br />
record.greetings = "Hello world";<br />
my2ndPipelet.process(record);<br />
...<br />
return record;<br />
</pre><br />
<br />
Example-2: When working with the pipelet result, you 'd have to deal with arrays:<br />
<pre><br />
function processTika(record)<br />
... <br />
var result1 = my1stPipelet.process(record);<br />
result1[0].greetings = "Hello world";<br />
var result2 = my2ndPipelet.process(result1);<br />
...<br />
return result2[0];<br />
</pre><br />
<br />
==== Using other scripts: <tt>require</tt> ====<br />
<br />
This is basically an implementation of the [http://wiki.commonjs.org/wiki/Modules/1.1 CommonJS Module Specification], so you may want to refer to details there.<br />
<br />
Scripts can use functions resp. objects from other scripts (aka "modules") using the global <tt>require</tt> function. The argument to <tt>require</tt> is the path to the imported script without the ".js" suffix, relative to the SMILA script directory.<br />
<br />
The prerequisite for using an object from another script is that it has been made available via registration in the "exports" object. The result of <tt>require</tt> is this "exports" object, so the exported functions can be accessed in the importing script via this object.<br />
<br />
Within one script execution, multiple <tt>require</tt> calls for one module (even from different scripts) cause the module to be loaded only once and return the same "exports" object. So the scope to which the module was loaded is shared by each importer, local variables in this context are the same regardless from where the module is called.<br />
<br />
'''Example:'''<br />
We call a function in script <tt>helloWorld.js</tt> which uses a function from script <tt>utils/myUtils.js</tt>:<br />
<br />
helloWorld.js<br />
<pre><br />
// required scripts<br />
var myUtils = require("utils/myUtils");<br />
<br />
function greetings(record) {<br />
var normalizedName = myUtils.normalize(record.name)<br />
record.greetings = "Hello " + normalizedName + "!";<br />
return record;<br />
}<br />
</pre><br />
<br />
utils/myUtils.js<br />
<pre><br />
// objects used in other scripts<br />
exports.normalize = normalize<br />
<br />
function normalize(str) { <br />
return str.toUpperCase()<br />
}<br />
</pre><br />
<br />
'''Conventions''':<br />
* Exported functions should be exported under their original function name.<br />
* The object created with require() should be named like the required script.<br />
* If a script contains requires and/or exports, they should be listed at the beginning of the script, starting with the requires.<br />
<pre><br />
// required scripts<br />
var myUtils = require("utils/myUtils");<br />
var myCommons = require("commons/myCommons");<br />
...<br />
<br />
// objects used in other scripts<br />
exports.myFunction1 = myFunction1;<br />
exports.myFunction2 = myFunction2;<br />
...<br />
<br />
function myFunction1() {<br />
...<br />
}<br />
<br />
function myFunction2() {<br />
...<br />
}<br />
...<br />
</pre><br />
<br />
==== Logging ====<br />
<br />
To output log messages from within your scripting environment, SMILA provides two ways to do this. One way is to use the built in default logger which is accessible via the '''log''' object. This object is already provided within your scope. <br />
<pre><br />
function logDemo() {<br />
log.info("This msg will be logged on level >>info<<");<br />
log.error("This msg will be logged on level >>error<<");<br />
log.warn("This msg will be logged on level >>warn<<");<br />
}<br />
</pre><br />
<br />
The messages can then be found in the smila.log file. Levels available are ''trace'', ''debug'', ''info'', ''warn'', ''error'' and ''fatal''. Depending on the levels set in SMILA log4.properties, your log message may be discarded. <br />
<br />
The other way is to create a custom logger. To do so the environment exposes a vanila ''org.apache.commons.logging.LogFactory'' to the script contexts.<br />
<pre><br />
function logFactoryDemo() {<br />
var myLoggger = LogFactory.getLog("myLogger");<br />
myLogger.info("This is a log message"); <br />
}<br />
</pre> <br />
<br />
==== Debugging ====<br />
<br />
'''TODO'''<br />
<br />
==== ScriptProcessorWorker ====<br />
<br />
The [[SMILA/Documentation/Worker/ScriptProcessorWorker | ScriptProcessorWorker]] is a [[SMILA/Glossary#W|worker]] designed to process (synchronous) script calls inside an [[SMILA/Glossary#W|asynchronous workflow]].<br />
<br />
=== HTTP REST API ===<br />
<br />
Scripts handled via ReST API must be located on top level of the configured script folder, i.e. scripts in subfolders can currently not be called via ReST API.<br />
<br />
==== Manage Scripts ====<br />
<br />
__URL:__ <tt>http://<hostname>:8080/smila/script</tt><br />
<br />
Methods:<br />
* GET: list exposed scripts - show the result of <tt>ScriptingEngine.listScripts()</tt>. No parameters, no request body. <br />
<br />
Response:<br />
* result of <tt>ScriptingEngine.listScripts()</tt>, wrapped as a JSON object with property "scripts" containing the array of script descriptions:<br />
<pre><br />
{<br />
"scripts": [<br />
{<br />
"name": "helloWorld.greetings",<br />
"description": "Get a Hello from SMILA!",<br />
"url": "http://localhost:8080/smila/script/helloWorld.greetings/"<br />
}<br />
]<br />
}<br />
</pre><br />
<br />
<br />
__URL:__ <tt>http://<hostname>:8080/smila/script/<scriptfile>.<function></tt><br />
<br />
Methods:<br />
* GET: show script description. No parameters, no request body.<br />
<br />
Response:<br />
* description object from above list with the matching name.<br />
<br />
Response-Codes<br />
* 200 OK: Success<br />
* 404 Not Found: Function is not exposed in any ScriptCatalog file.<br />
<br />
Example: <tt>GET /smila/script/helloWorld.greetings</tt> yields:<br />
<br />
<pre><br />
{<br />
"name": "helloWorld.greetings",<br />
"description": "Get a Hello from SMILA!",<br />
"url": "http://localhost:8080/smila/script/helloWorld.greetings/"<br />
}<br />
</pre><br />
<br />
==== Execute a script ====<br />
<br />
__URL:__ <tt>http://<hostname>:8080/smila/script/<script-file>.<function></tt><br />
<br />
Methods<br />
* POST: execute script with record in request body. Attachments are supported, too.<br />
<br />
Response:<br />
* Metadata part of result of <tt>ScriptingEngine.callScript("<script-file>.<function>", requestRecord)</tt>. If the result contains attachments they are not returned via the ReST-API.<br />
<br />
Response-Codes<br />
* 200 OK: Script executed successfully.<br />
* 400 Bad Request: Last URL part does not have <script-file>.<function> format, or error in Script execution<br />
* 404 Not Found: Script file does not exist or does not contain the function<br />
<br />
Example request:<br />
<pre><br />
POST http://localhost:8080/smila/script/helloWorld.greetings<br />
{<br />
"name": "Juergen"<br />
}<br />
</pre><br />
<br />
Response:<br />
<pre><br />
{<br />
"name": "Juergen",<br />
"greetings": "Hello Juergen!"<br />
}<br />
</pre><br />
<br />
=== Examples ===<br />
Simple example using SMILA's default scripts for indexing and search.<br />
<br />
add some records to index:<br />
<pre><br />
POST http://localhost:8080/smila/script/add.process<br />
{<br />
"_recordid": "id1",<br />
"Title": "Scripting rules!",<br />
"Content": "yet another SMILA document",<br />
"MimeType": "text/plain"<br />
}<br />
</pre><br />
<br />
search in index:<br />
<pre><br />
POST http://localhost:8080/smila/script/search.process<br />
{<br />
"query": "SMILA",<br />
"resultAttributes": ["Title", "Content"]<br />
}<br />
</pre><br />
<br />
delete first record from index:<br />
<pre><br />
POST http://localhost:8080/smila/script/delete.process<br />
{<br />
"_recordid": "id1"<br />
}<br />
</pre></div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Importing/Crawler/JDBC&diff=363507SMILA/Documentation/Importing/Crawler/JDBC2014-05-26T09:13:25Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>JDBC Crawler and JDBC Fetcher worker are used for importing data via JDBC from a database. For a big picture and the worker's interaction have a look at the [[SMILA/Documentation/Importing/Concept | Importing Concept]].<br />
<br />
=== JDBC Crawler ===<br />
<br />
The JDBC Crawler executes an SQL statement and crawles the result set, producing a record for each row of the result set.<br />
<br />
===== Configuration =====<br />
<br />
The JDBC Crawler worker is usually the first worker in a workflow and the job is started in <tt>runOnce</tt> mode.<br />
<br />
* Worker name: <tt>jdbcCrawler</tt><br />
* Parameters: <br />
** <tt>dataSource</tt>: ''(req.)'' value for attribute <tt>_source</tt>, needed e.g. by the delta service<br />
** <tt>dbUrl</tt>: ''(req.)'' database URL to connect to<br />
** <tt>dbProps</tt>: ''(opt.)'' properties used when connecting to the database. <br />
*** <tt>user</tt> ''(opt.)'' user name<br />
*** <tt>password</tt> ''(opt.)'' user password<br />
*** <tt>...</tt> ''(opt.)'' any property supported by the used JDBC driver<br />
** <tt>crawlSql</tt>: ''(req.)'' the SQL statement to execute to get the records<br />
** <tt>splitLimitsSql</tt>: ''(opt.)'' the SQL statement to determine limits for splitting the result set into smaller parts based on an integer column. See [[SMILA/Documentation/Importing/Crawler/JDBC#Splitting|Splitting]] for details.<br />
** <tt>splitIncrement</tt>: ''(opt., >0)'' the increment to use for splitting the integer range determined by <tt>splitLimitsSql</tt> See [[SMILA/Documentation/Importing/Crawler/JDBC#Splitting|Splitting]] for details.<br />
** <tt>mapping</tt> ''(req.)'' specifies how to map database column names to record attributes or attachments.<br />
*** <tt>COLUMN-1</tt> ''(opt.)'' mapping of the first column to an attribute/attachment<br />
*** <tt>COLUMN-2</tt> ''(opt.)'' mapping of the second column to an attribute/attachment<br />
*** <tt>COLUMN-N</tt> ''(opt.)'' mapping of the last column to an attribute/attachment<br />
** <tt>idColumns</tt> ''(req.)'' a list of database column names used to generate the record ID from<br />
** <tt>deltaColumns</tt> ''(opt.)'' a list of database column names (upper-case) used to generate the value for attribute _deltaHash. If this is not set and some sort of deltaImportStrategy has been selected, the attributes provided in <tt>mapping</tt> will be used to generate the _deltaHash.<br />
** <tt>maxAttachmentSize</tt> ''(opt.)'' maximum accepted size of BLOB/CLOB column values (in bytes/characters) for attachment creation. Default is "1000000000" (1 billion). Larger values are skipped and a warning is written to the log.<br />
** parameters to control size of output bulks, see below for details<br />
*** <tt>maxRecordsPerBulk</tt> ''(opt., >0)'' maximum number of records in one bulk. (default: 1000)<br />
* Task generator: <tt>[[SMILA/Documentation/TaskGenerators#RunOnceTriggerTaskGenerator|runOnceTrigger]]</tt><br />
* Input slots:<br />
** <tt>splitsToCrawl</tt>: Descriptions of split-crawl tasks, see [[SMILA/Documentation/Importing/Crawler/JDBC#Splitting|Splitting]] for details<br />
* Output slots:<br />
** <tt>crawledRecords</tt>: Bulks with crawled records to be processed.<br />
** <tt>splitsToCrawl</tt>: Descriptions for split-crawl tasks, see [[SMILA/Documentation/Importing/Crawler/JDBC#Splitting|Splitting]] for details<br />
<br />
===== Processing =====<br />
<br />
The JDBC Crawler executes the <tt>crawlSql</tt> statement and produces one record per result row in the bucket connected to <tt>crawledRecords</tt>. Please note that internally database column names are normalized to upper-case. In the configuration however any casing can be used. The resulting records contain only the values of the columns configured in the <tt>mapping</tt>. Whether a column is represented as an attribute or as an attachment depends on the type of the database column. Below is a table that summarizes the supported database types, how they are mapped to Java types and whether they are represented as attributes or attachments:<br />
<br />
{| border=1<br />
|- bgcolor=grey<br />
!Database Type !! Java Type !! represented as<br />
|-<br />
| BIT || Boolean || attribute<br />
|-<br />
| BOOLEAN || Boolean || attribute<br />
|-<br />
| BIGINT || Long || attribute<br />
|-<br />
| INTEGER || Long || attribute<br />
|-<br />
| SMALLINT || Long || attribute<br />
|-<br />
| TINYINT || Long || attribute<br />
|-<br />
| DOUBLE || Double || attribute<br />
|-<br />
| FLOAT || Double || attribute<br />
|-<br />
| REAL || Double || attribute<br />
|-<br />
| DECIMAL || Double(scale>0) or Long || attribute<br />
|-<br />
| NUMERIC || Double(scale>0) or Long || attribute<br />
|-<br />
| DATE || Date || attribute<br />
|-<br />
| TIME || DateTime || attribute<br />
|-<br />
| TIMESTAMP || DateTime || attribute<br />
|-<br />
| CHAR || String || attribute<br />
|-<br />
| VARCHAR || String || attribute<br />
|-<br />
| NCHAR || String || attribute<br />
|-<br />
| NVARCHAR || String || attribute<br />
|-<br />
| ROWID || String || attribute<br />
|-<br />
| BINARY || byte[] || attachment<br />
|-<br />
| VARBINARY || byte[] || attachment<br />
|-<br />
| LONGVARBINARY || byte[] || attachment<br />
|-<br />
| BLOB || byte[] || attachment<br />
|-<br />
| CLOB || byte[] || attachment<br />
|-<br />
| NCLOB || byte[] || attachment<br />
|-<br />
| LONGNVARCHAR || byte[] (UTF-8 encoded) || attachment (because CLOB-like types are often reported by JDBC drivers as this type)<br />
|-<br />
| LONGVARCHAR || byte[] (UTF-8 encoded) || attachment (because CLOB-like types are often reported by JDBC drivers as this type)<br />
|-<br />
| NULL || - || no entry is generated<br />
|} <br />
<br />
The following types are not fully supported: <br />
*ARRAY<br />
*DATALINK<br />
*DISTINCT<br />
*JAVA_OBJECT<br />
*OTHER<br />
*REF<br />
*SQLXML<br />
*STRUCT<br />
The crawler tries to automatically convert any values into attributes of an appropriate data type. If this is not possible an attachment with the bytes is generated.<br />
<br />
The records are collected in bulks, whose size can be configured via the parameter <tt>maxRecordsPerBulk</tt>:<br />
* <tt>maxRecordsPerBulk</tt> has the same effect in any of the following cases:<br />
** ''not configured:'' a new <tt>crawledRecords</tt> bulk is started after 1000 records.<br />
** ''configured:'' a new <tt>crawledRecords</tt> bulk is started when the configured value is reached.<br />
<br />
Please note that <tt>maxRecordsPerBulk</tt> must be > 0. Otherwise your job will fail.<br />
<br />
''Source'':<br />
The attribute <tt>_source</tt> is set from the task parameter <tt>dataSource</tt> which has no further meaning currently, but it is needed by the delta service.<br />
<br />
===== Splitting =====<br />
<br />
Reading data from very large tables in a single task can be problematic for two reasons:<br />
* Performance: Reading a large table sequentially can take very long, obviously. You will want to parallelize the process.<br />
* Memory: When accessing large result sets, the JDBC driver can cause excessive memory usage. It can be necessary to read the table in smaller portions.<br />
The ''Splitting'' feature of the JDBC crawler worker can be used to do this. Basically, it works like this:<br />
* In the initial crawl task the crawler determines only the size of the table instead of reading the rows. Therefore you need to provide an SQL statement as parameter "splitLmitsSql" that yields a single row with two integer values "min" and "max".<br />
* Then the crawler creates a series of smaller intervals that cover the complete [min,max] range. The bounds for each of these intervals are written to an own record bulk in output slot "splitsToCrawl" then. The size of the intervals is determined by the "splitIncrement" parameter. No "crawledRecords" will be created by this initial task.<br />
* Each "splitsToCrawl" bulk is then processed in a separate follow-up task using the "crawlSql" statement. It must contain two "?" as parameter placeholders, which will be filled by the crawler with the "min" (first "?") and "max" (second "?") value from the input record. All rows produced by the resulting statement will be mapped to records and written to "crawledRecords" just like in normal one-step crawling. <br />
<br />
Of course, you must take care that the "splitLimitSql" and "crawlSql" statements are consistent and that the "crawlSql" statement really reads each row exactly once if executed repeatedly with the split-interval bounds. <br />
<br />
Both parameters "splitLimitsSql" and "splitIncrement" need to be set to enable splitting, and the crawler's "splitsToCrawl" output slot must be connected via a transient bucket to the input slot with the same name. Invalid parameter values (e.g. "splitIncrement"<=0) or workflow configurations will cause the crawl job to fail. If you don't need to do splitting, you can omit the loopback connection from the crawler workers "splitsToCrawl" output to input slot in your workflow.<br />
<br />
'''Example'''<br />
<br />
As a simple example we want to crawl a very large table that has an INTEGER key column named ID. So the "splitLimitsSql" get the minimum and maximum value from this ID column, and the "crawlSql" would restrict the result set for the task based on a given upper and lower bound for this column:<br />
<br />
<pre><br />
{<br />
...<br />
"parameters": {<br />
...<br />
"splitLimitsSql": "SELECT min(ID) as MIN, max(ID) as MAX FROM VERY_LARGE_TABLE",<br />
"crawlSql": "SELECT * FROM VERY_LARGE_TABLE WHERE ID >= ? and ID <= ?",<br />
"splitIncrement": 10000,<br />
...<br />
"mapping": {<br />
// as usual<br />
}<br />
}<br />
}<br />
</pre><br />
<br />
The "splitLimitsSql" must deliver the limits in a single row with column names MIN and MAX. Lets assume it yields ''MIN=1'' and ''MAX=1000000''. Then, using the "splitIncrement" value of 10,000, the first task creates 100 splitToCrawl records with the following bounds:<br />
* MIN=1, MAX=10,000<br />
* MIN=10,001, MAX=20,000<br />
* ...<br />
* MIN=990,001, MAX=1,000,000<br />
This means: for each split interval, MAX equals ''MIN+splitIncrement-1'', and the MIN of the next split interal is ''MAX+1'' of the previous interval. This is repeated until the MAX value of the record is equal to or greater than the MAX determined by the "splitLimitsSql" statement.<br />
<br />
Each of these records is then used in a separate task to get a subset of the complete table and map it to "crawledRecords". This is done by using the MIN and MAX value if the input record as values for the first two parameters of a prepared statements created from "crawlSql", i.e. in place of the "?" characters of the "crawlSql" string. These tasks can be executed in parallel as allowed by the scaleUp parameter which should greatly improve the crawl performance.<br />
<br />
In general, the column used for splitting does not need to be a (primary) key column, i.e. there can be multiple rows with the same value of the split column. For example, the column can have a very limited number of distinct values, e.g. some category key). Then you can use <tt>"splitIncrement":1</tt> to create a split for each distinct value, so that all rows of the same "category" are fetched in a separate split-crawl task. <br />
<br />
Furthermore, it is of course not necessary that there is a row for each different split column value. However, this all means that the "splitIncrement" value does not necessarily have a relation to the number of rows fetched in this split crawl task.<br />
<br />
Also, the split ranges do not have to be read from "real" columns. You can do whatever is possible in SQL to generate those numbers, as long as you can provide a "crawlSql" statement that fetches the rows you want to have as actual records.<br />
<br />
The configuration of the jdbcFetcher is not affected by splitting.<br />
<br />
=== JDBC Fetcher ===<br />
<br />
For each input record, it executes the <tt>fetchSql</tt> statement and adds the result rows to the record.<br />
<br />
===== Configuration =====<br />
<br />
* Worker name: <tt>jdbcFetcher</tt><br />
* Parameters: <br />
** <tt>dbUrl</tt>: ''(req.)'' database URL to connect to<br />
** <tt>dbProps</tt>: ''(opt.)'' propereties used when connecting to the database. <br />
*** <tt>user</tt> ''(opt.)'' user name<br />
*** <tt>password</tt> ''(opt.)'' user password<br />
*** <tt>...</tt> ''(opt.)'' any property supported by the used JDBC driver<br />
** <tt>fetchSql</tt>: ''(opt.)'' the SQL statement executed to fetch the data which is used to enrich the crawled input record. It may contain one or more '?' as parameter placeholders, see [http://docs.oracle.com/javase/6/docs/api/java/sql/PreparedStatement.html PreparedStatement]. If the <tt>fetchSql</tt> parameter isn't set, the originally crawled record is written unchanged to the output.<br />
** <tt>fetchParameterAttributes</tt>: ''(opt.)'' a list of record attribute names who's values are used in the given order as parameters (substitutes for the '?') in the <tt>fetchSql</tt> statement. If <tt>fetchParameterAttributes</tt> isn't set, the record attributes which were mapped from the <tt>idColumns</tt> are used as <tt>fetchSql</tt> statement parameter substitutes.<br />
** <tt>mapping</tt> ''(req.)'' specifies how to map database column names to record attributes or attachments. Please note that database column names are normalized to upper-case.<br />
*** <tt>COLUMN-1</tt> ''(opt.)'' mapping of the first column to an attribute/attachment<br />
*** <tt>COLUMN-2</tt> ''(opt.)'' mapping of the second column to an attribute/attachment<br />
*** <tt>COLUMN-N</tt> ''(opt.)'' mapping of the last column to an attribute/attachment<br />
** <tt>idColumns</tt> ''(opt.)'' If <tt>fetchParameterAttributes</tt> isn't set, the record attributes which were mapped from the <tt>idColumns</tt> DB columns are used as parameter (substitutes for the '?') in the <tt>fetchSql</tt> statement. In this case the id columns must also be specified in the mapping.<br />
** <tt>maxAttachmentSize</tt> ''(opt.)'' maximum accepted size of BLOB/CLOB column values (in bytes/characters) for attachment creation. Default is "1000000000" (1 billion). Larger values are skipped and a warning is written to the log.<br />
* Input slots:<br />
** <tt>recordsToFetch</tt><br />
* Output slots:<br />
** <tt>fetchedRecords</tt><br />
<br />
===== Processing =====<br />
<br />
The JDBC Fetcher is used to enrich the input records with the data selected by the <tt>fetchSql</tt> statement. The <tt>fetchSql</tt> is executed as [http://docs.oracle.com/javase/6/docs/api/java/sql/PreparedStatement.html PreparedStatement]. So it can have parameters ('?') which will be replaced by either the values of the record attributes specified by the <tt>fetchParameterAttributes</tt> or (per default) by the values of the record attributes which were mapped from the <tt>idColumns</tt>.<br />
<br />
All columns that are selected by the <tt>fetchSql</tt> query and are mapped in the <tt>mapping</tt> section will enrich the crawled record. That means, the mapped attributes are added to the record's metadata, binary data (BLOB, CLOB) will be added as attachments. Have a look at the table above for a description of the mapping from database types to Java types.<br />
<br />
In general, the <tt>fetchSql</tt> query will return exactly one row. If it returns no row at all (resp. no result), the originally crawled record is written unchanged to the output. If it returns more than one row, the values of the rows are merged (e.g. as lists) before being added to the input record.<br />
<br />
If one of the attributes used as fetch parameters is not set or is does not have a simple values, but a sequence or a map, it is skipped and no additional data will be fetched for this record. It will just be written unchanged to the output.<br />
<br />
=== Sample JDBC crawl job ===<br />
<br />
This example uses the following fictitious tables of database EmailDB:<br />
<br />
{| border=1<br />
|+ Emails<br />
|- bgcolor=grey<br />
! Sender !! Receiver !! Subject !! SendDate !! BodyId<br />
|}<br />
<br />
{| border=1<br />
|+ EmailBodies<br />
|- bgcolor=grey<br />
! BodyId !! Body<br />
|}<br />
<br />
<br />
<pre><br />
{<br />
"name":"crawlJdbcJob",<br />
"workflow":"jdbcCrawling",<br />
"parameters":{<br />
"dataSource":"emails",<br />
"dbUrl":"jdbc:derby:memory:EmailDB",<br />
"dbProps":<br />
{<br />
"user": "admin",<br />
"password": "topsecret"<br />
},<br />
"crawlSql":"SELECT * FROM Emails",<br />
"fetchSql":"SELECT Body FROM EmailBodies WHERE BodyId=?",<br />
"fetchParameterAttributes": "bodyReference",<br />
"idColumns": ["Sender", "Receiver"],<br />
"deltaColumns": "SendDate",<br />
"mapping":{<br />
"Sender":"From",<br />
"Receiver":"To", <br />
"Subject":"Title", <br />
"SendDate":"lastModified", <br />
"BodyId":"bodyReference"<br />
},<br />
"jobToPushTo":"indexUpdateJob",<br />
"tempStore": "temp"<br />
}<br />
}<br />
</pre><br />
<br />
=== Adding JDBC Drivers ===<br />
<br />
By default SMILA includes only JDBC drivers for Derby. If you want to access other databases then you have to provide according JDBC drivers.<br />
Have a look [[SMILA/Documentation/Adding_JDBC_Drivers|here]] to learn how to add JDBC drivers to SMILA.<br />
<br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets&diff=342887SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets2013-07-09T07:22:03Z<p>Marco.strack.empolis.com: /* org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet */</p>
<hr />
<div>This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets</tt>.<br />
<br />
== General ==<br />
<br />
All pipelets in this bundle support the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.<br />
<br />
''' Read Type '''<br />
* ''runtime'': Parameters are read when processing records. Parameter value can be set per Record.<br />
* ''init'': Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.<br />
<br />
== org.eclipse.smila.processing.pipelets.CommitRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Commits each record in the ''input'' variable on the blackboard to the storages. Can be used to save the records immediately during the workflow instead of only when a workflow has been finished.<br />
<br />
=== Configuration ===<br />
<br />
none.<br />
<br />
== org.eclipse.smila.processing.pipelets.AddValuesPipelet ==<br />
<br />
Adds something to an attribute in the processed records. If the attribute does not contain a sequence already, the current value will be wrapped in one before the new values are added.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to add values to<br />
|-<br />
|''valuesToAdd''<br />
|Anything, usually a value or a sequence of values<br />
|runtime<br />
|The values to add<br />
|}<br />
<br />
=== Example ===<br />
<br />
From a test pipeline: This adds two string values to whatever already exists in attribute "out" of the processed records.<br />
<br />
<source lang="xml"><br />
<proc:invokePipelet name="addValuesToNonExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">out</rec:Val><br />
<rec:Seq key="valuesToAdd"><br />
<rec:Val>value1</rec:Val><br />
<rec:Val>value2</rec:Val><br />
</rec:Seq><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SetValuePipelet ==<br />
<br />
Sets a value for an attribute in every processed records. If the attribute exists already, it is not change by default. Useful for initializations of required attributes.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to set the value for<br />
|-<br />
|''value''<br />
|anything<br />
|runtime<br />
|The constant value to set for the attribute (a map or sequence is possible, too)<br />
|-<br />
|''overwrite''<br />
|boolean<br />
|runtime<br />
|Indicates to overwrite any value that the attribute contains already (optional, defaults to false)<br />
|}<br />
<br />
=== Example ===<br />
<br />
This sets a map containing two values into attribute1, even if there is already a value in that attribute.<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="setMapForExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">attribute1</rec:Val><br />
<rec:Val key="overwrite" type="boolean">true</rec:Val><br />
<rec:Map key="value"><br />
<rec:Val key="key1">value1</rec:Val><br />
<rec:Val key="key2">value2</rec:Val><br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.RemoveAttributePipelet ==<br />
<br />
Removes an attribute from each record. <br />
<br />
=== Configuration ===<br />
<br />
The configuration property is either read from the <tt>_parameters</tt> attribute of a record or from the pipelet configuration. If not set at all, the record remains unchanged.<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''removeAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to remove<br />
|}<br />
<br />
=== Example === <br />
<br />
To remove the complete structure in attribute <tt>_parameters</tt>, use: <br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="removeParameters"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.RemoveAttributePipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="removeAttribute">_parameters</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FilterPipelet ==<br />
<br />
Copies only those record IDs to the result which match a configurable regular expression in a configurable single-valued attribute. This is useful for conditional processing while at the same time pushing multiple records through the pipeline in a single request: Instead of using BPEL conditions use a FilterPipelet to select only the matching records in a new variable and use the this variable as the input variable for the next pipelets. You can still use the original BPEL variable in the BPEL <tt><reply></tt> activity at the end of the pipeline to return all records as the final result.<br />
<br />
=== Configuration ===<br />
The configuration properties are read either from the <tt>_parameters</tt> attribute of each record or from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''filterAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to match<br />
|-<br />
|''filterExpression''<br />
|A string value<br />
|runtime<br />
|The regular expression to match the attribute value against<br />
|}<br />
<br />
=== Example === <br />
<br />
To get only those records in the <tt>textRecords</tt> BPEL variable that have a MimeType starting with <tt>text</tt> something like this could be used:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeFilterPipelet"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FilterPipelet" /><br />
<proc:variables input="request" output="textRecords" /><br />
<proc:configuration><br />
<rec:Val key="filterAttribute">MimeType</rec:Val><br />
<rec:Val key="filterExpression">text/.+</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.HtmlToTextPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extract plain text and metadata from an HTML document from an attribute or attachment of each record and writes the results to configurable attributes or attachments.<br />
<br />
The pipelet uses the CyberNeko HTML parser [http://nekohtml.sourceforge.net/ NekoHTML] to parse HTML documents.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the HTML input is found in an attachment or in an attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the plain text should be stored in an attachment or in an attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|Name of input attachment or path to input attribute (process literals of attribute)<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
|Name of output attachment or path to output attribute for plain text (store result as literals of attribute)<br />
|-<br />
|''defaultEncoding''<br />
|String<br />
|runtime<br />
|Optional, default encoding to apply to documents when not specified in the documents themselves<br />
|-<br />
|''removeContentTags''<br />
|String<br />
|runtime<br />
|Comma-separated list of HTML tags (case-insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to ''"applet,frame,object,script,style"''. If the value is set, you must add the default tags explicitly to have their contents removed, too.<br />
|-<br />
|''meta:<name>''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><META></tt> tag with ''name="<name>"'' (case insensitive) to the attribute named as the value of the property. E.g. a property named ''"meta:author"'' with value "authors" causes the content attributes of <tt><META name="author" content="..."></tt> tags to be stored in the attribute ''authors'' of the respective record.<br />
|-<br />
|''tag:title''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><TITLE></tt> tag with to the attribute named as the value of the property.<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration extracts plain text from the HTML document in attachment ''"html"'' and stores the results to the attribute ''"text"''. It removes the complete content of heading tags <tt><nowiki><h1>, ..., <h4></nowiki></tt>. In addition to that, it looks for <tt><meta></tt> tags with names ''"author"'' and ''"keywords"'' and stores their contents in attributes ''"authors"'' and ''"keywords"'', respectively:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeHtml2Txt"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">html</rec:Val><br />
<rec:Val key="outputName">text</rec:Val><br />
<rec:Val key="defaultEncoding">UTF-8</rec:Val><br />
<rec:Val key="meta:author">author</rec:Val><br />
<rec:Val key="meta:keywords">keywords</rec:Val><br />
<rec:Val key="meta:title">title</rec:Val><br />
<rec:Val key="removeContentTags">h1,h2,h3,h4</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.CopyPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to copy or move attribute values to other attributes or to copy or move a string value between attributes and/or attachments. It suppoprts two execution modes:<br />
* COPY: copy the value from the input attribute/attachment to the output attribute/attachment <br />
* MOVE: same as COPY, but after that delete the value from the input attribute/attachment<br />
When an attribute is copied to another attribute, the type remains the same. When copying an attachment to an attribute, a string value is created by assuming the the attachment is a text in UTF-8 encoding. When copying an attribute value to an attachment, the attribute must be single value which is interpreted as a string value and converted to a byte array using UTF-8 encoding.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if the input is found in an attachment or attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|name of input attachment or input attribute<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
| name of output attachment or output attribute<br />
|-<br />
|''mode''<br />
|String : ''COPY, MOVE''<br />
|runtime<br />
| execution mode. Copy the value or move (copy and delete) the value. Default is COPY.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':<br />
<br />
<source lang="xml"><br />
<!-- copy txt from attachment to attribute --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeCopyContent"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="outputName">TextContent</rec:Val><br />
<rec:Val key="mode">COPY</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extracts literal values from an attribute that has a nested map. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure, a path needs to be specified. The pipelet supports different execution modes: <br />
*FIRST: selects only the first literal of the specified attribute<br />
*LAST: selects only the last literal of the specified attribute<br />
*ALL_AS_LIST: selects all literal values of the specified attribute and returns a list<br />
*ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
<b>Note</b>:<br />
If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputPath''<br />
|String<br />
|runtime<br />
|the path to the input attribute with Literals<br />
|-<br />
|''outputPath''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently)<br />
|-<br />
|''mode''<br />
|String : ''FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE''<br />
|runtime<br />
| execution mode. See above for details.<br />
|-<br />
|''separator''<br />
|String<br />
|runtime<br />
| the separation string used for mode ALL_AS_ONE. Default is a blank<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:<br />
<br />
<source lang="xml"><br />
<!-- extract content --><br />
<extensionActivity><br />
<proc:invokePipelet name="extract content"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputPath">Contents/Value</rec:Val><br />
<rec:Val key="outputPath">Content</rec:Val><br />
<rec:Val key="mode">ALL_AS_ONE</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ReplacePipelet ==<br />
<br />
=== Description ===<br />
<br />
Searches for one or more patterns in the literal value of an attribute and substitutes the found occurrences by the configured replacements. <br />
<br />
You can choose from different matching types:<br />
<br />
* ''entity'': Every pattern is matched against the whole attribute value (with respect to the ''ignoreCase'' property) and the first matching pattern defines the new value of the attribute. If no pattern matches, the result is the current value of the attribute.<br />
* ''substring'': All patterns that are part of the attribute value are replaced.<br />
* ''regexp'': Interpret all patterns as [http://en.wikipedia.org/wiki/Regular_expression regular expression], see [http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String) Matcher#replaceAll(String)]<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute that contains the literal to search in<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the result value as string, defaults to the input attribute<br />
|-<br />
|''type''<br />
|String : ''entity'', ''substring'', ''regexp''<br />
|init<br />
|Identifies the type of the pattern, see above for details. Defaults to ''substring''.<br />
|-<br />
|''ignoreCase''<br />
|Boolean<br />
|init<br />
|indicates that the case is ignored when matching patterns, defaults to ''false''.<br />
|-<br />
|''mapping''<br />
|Map<br />
|init<br />
|A mapping of multiple patterns and replacements. Each key is a pattern and its value the replacement.<br />
|-<br />
|''pattern''<br />
|String<br />
|init<br />
|the pattern to apply to the literal value (see above for a description of possible types), required if no mapping is given<br />
|-<br />
|''replacement''<br />
|String<br />
|init<br />
|the substitution string used to replace all occurrences of the pattern, defaults to the empty string<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to map language ids to their label:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="set language label"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">Language</rec:Val><br />
<rec:Val key="outputAttribute">LanguageLabel</rec:Val><br />
<rec:Val key="type">entity</rec:Val><br />
<rec:Val key="ignoreCase" type="boolean">true</rec:Val><br />
<rec:Map key="mapping"><br />
<rec:Val key="de">German</rec:Val><br />
<rec:Val key="en">English</rec:Val><br />
<rec:Val key="es">Spanish</rec:Val><br />
<rec:Val key="fr">French</rec:Val><br />
...<br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to cut the time information from a timestamp:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="cut time"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">ModificationTime</rec:Val><br />
<rec:Val key="outputAttribute">ModificationDate</rec:Val><br />
<rec:Val key="type">regexp</rec:Val><br />
<rec:Val key="pattern">[T ].*</rec:Val><br />
<rec:Val key="replacement"></rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ScriptPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes a script for each record. <br />
<br />
For execution the [http://en.wikipedia.org/wiki/Scripting_for_the_Java_Platform Java Scripting API (JSR 223)] is responsible - thus any compatible scripting engine can be used. JavaScript is available "out of the box" and the default script language.<br />
<br />
The context of the script will contain four variables:<br />
* ''blackboard'': a reference to the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/blackboard/Blackboard.html blackboard]<br />
* ''id'': the ID of the current record<br />
* ''record'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/AnyMap.html metadata] of the current record<br />
* ''results'': a slightly modified version of a [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/util/ResultCollector.html result collector] that provides methods to add a new record id to the list of result ids (''results.addResult('...id...')'') and to drop the current record from the same list (''results.excludeCurrentRecord()'')<br />
* ''parameterAccessor'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/parameters/ParameterAccessor.html ParameterAccessor] instance for access to the configuration (e.g. ''parameterAccessor.getParameterAny("configMap").asMap().getLongValue("longValue")'').<br />
<br />
Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into [[SMILA/Development_Guidelines/How_to_write_a_Pipelet|your own pipelet]].<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String<br />
|init<br />
|the mime type of the scripting language, defaults to "text/javascript"<br />
|-<br />
|''scriptFile''<br />
|String<br />
|runtime<br />
|the path of the file that contains the script - modifications of this file are observed on every execution of the pipelet<br />
|-<br />
|''script''<br />
|String<br />
|init<br />
|The "inline" script, required unless ''scriptFile'' is specified (ignored in that case)<br />
|-<br />
|''resultAttribute''<br />
|String<br />
|runtime<br />
|The name of an attribute that will receive the result of the script (usually the result of the last expression)<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to concatenate the values of two attributes and save the result into a third one:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="create full name"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="script">record.get("firstName") + " " + record.get("lastName")</rec:Val><br />
<rec:Val key="resultAttribute">fullName</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to execute a java script file from $SMILA_PATH$/configuration/example.js:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="execute script"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="scriptFile">configuration/example.js</rec:Val> <br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ExecPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes an external program for each record. <br />
<br />
This pipelet may be used to integrate native programs into the pipeline. <br />
<br />
'''Attention''': This pipelet may lead to security issues! Please be aware that although one can not change the executed command during runtime (as this parameter is only evaluated at initialization time), it is possible to change the arguments and input of the command using values in the processed record. Every "pipeline developer" should ensure that only arguments in the expected value range are processed (especially if the program is allowing files from the file system as arguments).<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''command''<br />
|String<br />
|init<br />
|The program to execute (including its path in the file system).<br />
|-<br />
|''directory''<br />
|String<br />
|runtime<br />
|The (optional) working directory for the command. The SMILA directory is used if not given.<br />
|-<br />
|''parameters''<br />
|Sequence of strings<br />
|runtime<br />
|The optional parameters given to the program (ignored if the contents of the parameters attribute exists).<br />
|-<br />
|''parametersAttribute''<br />
|String<br />
|runtime<br />
|The optional name of the attribute that contains the sequence of parameters given to the program.<br />
|-<br />
|''inputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that contains the bytes to send as input for the program.<br />
|-<br />
|''outputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the standard output of the program.<br />
|-<br />
|''exitCodeAttribute''<br />
|String<br />
|runtime<br />
|The name of the attribute that is filled with the exit code of the program.<br />
|-<br />
|''errorAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the error output of the program.<br />
|-<br />
|''failOnError''<br />
|Either a boolean or a sequence of strings<br />
|runtime<br />
|Indicates to mark a record as failed if the program returns an error code. Either as a sequence of exit code ranges or as a boolean where "true" means that everything except 0 is an error code. Defaults to false.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to execute FFMPEG for transformation of an MP3 input file into a WAV output file:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="ConvertMP3"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ExecPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="command">.../ffmpeg</rec:Val><br />
<rec:Seq key="parameters"><br />
<rec:Val>-i</rec:Val><br />
<rec:Val>.../example.mp3</rec:Val><br />
<rec:Val>-ar</rec:Val><br />
<rec:Val>16000</rec:Val><br />
<rec:Val>.../example.wav</rec:Val><br />
</rec:Seq><br />
<rec:Val key="failOnError" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet is used to identify the MIME type of a document. <br />
It uses an <tt>[[SMILA/Documentation/MimeTypeIdentifier| org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier]]</tt> service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.<br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''FileExtensionAttribute''||String||init||Optional||Name of the attribute containing the file extension<br />
|-<br />
|''ContentAttachment''||String||init||Optional||Name of the attachment containing the file content<br />
|-<br />
|''MetaDataAttribute''||String||init||Optional||Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information<br />
|-<br />
|''MimeTypeAttribute''||String||init||Required||Name of the attribute to store the identified MIME type to<br />
|}<br />
Note that at least one of the properties ''FileExtensionAttribute'', ''ContentAttachment'', and ''MetaDataAttribute'' must be specified!<br />
<br />
=== Example ===<br />
<br />
The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect MimeType"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="FileExtensionAttribute">Extension</rec:Val><br />
<rec:Val key="MetaDataAttribute">MetaData</rec:Val><br />
<rec:Val key="MimeTypeAttribute">MimeType</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<br />
== org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet identifies the language of textual input and stores the returned ISO 639 language code to some target attribute. It uses an <tt>org.eclipse.smila.common.language.LanguageIdentifier</tt> service to perform the actual identification. If the identification does not return a language, the specified <tt>DefaultLanguage</tt> (or <tt>DefaultAlternativeName</tt>) is returned. If no defaults are specified, no value is set.<br />
<br />
The pipelet returns the detected language as an ISO 639 code. Where you need special language tags in your application, the pipelet is able to produce<br />
an alternative language code according to a configurable mapping. To define such a mapping, create the file <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt>. The following shows an exemplary mapping:<br />
<br />
<source lang="text"><br />
de=german<br />
en=english<br />
es=spanish<br />
fi=finnish<br />
fr=french<br />
</source><br />
<br />
The pipelet uses [http://tika.apache.org/ Apache Tika] technology for the actual language detection. <br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''ContentAttribute''||String||runtime||Required||Name of the attribute containing the text whose language should be identified<br />
|-<br />
|''LanguageAttribute''||String||runtime||Optional||Name of the attribute to store the code of the identified language to<br />
|-<br />
|''DefaultLanguage''||String||runtime||Optional||Language code to set if no language could be detected. If not set and no language could be identified, the <tt>LanguageAttribute</tt> attribute remains empty.<br />
|-<br />
|''AlternativeNameAttribute''||String||runtime||Optional||Name of the attribute to store the alternative language code of the identified language to. The mapping defining this alternative code must be located in <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt> (see above).<br />
|-<br />
|''DefaultAlternativeName''||String||runtime||Optional||Alternative language code to set if no language could be detected. If not set and no language could be identified, the <tt>DefaultAlternativeName</tt> attribute remains empty. <br />
|-<br />
|''UseCertainLanguagesOnly''||Boolean||runtime||Optional||Boolean flag indicating whether to apply only those languages that were identified with a reasonable certainty (true) or all (false). Default is false.<br />
|}<br />
<br />
<br />
=== Example ===<br />
<br />
The following example could be used to identify the language of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect Language"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="ContentAttribute">Content</rec:Val><br />
<rec:Val key="LanguageAttribute">Language</rec:Val><br />
<rec:Val key="DefaultLanguage">de</rec:Val><br />
<rec:Val key="AlternativeNameAttribute">AltLanguage</rec:Val><br />
<rec:Val key="DefaultAlternativeName">german</rec:Val><br />
<rec:Val key="UseCertainLanguagesOnly">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to read content from a file and add it as an attachment.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the file to read from<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to store the content <br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
<source lang="xml"><br />
<!-- read from file and add attachment --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeReadFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to write the content of an attachment to a file.<br />
<br />
If the attachment does not exist a warning is logged, but the record will not be dropped.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the target file<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to write to the file <br />
|-<br />
|''append''<br />
|Boolean<br />
|runtime<br />
|Indicates to append the attachment to the file (if it exists already), defaults to false<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example saves all bytes of the attachment "content" to the file path that is contained in the attribute "path".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.PushRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Sends all current records to another (asynchronous) job.<br />
<br />
The records are not removed from the pipeline - thus a following pipelet in the current pipeline will process the records as well.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String <br />
|init<br />
|The name of the target job.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example sends all current records to the job "TheOtherJob".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="callJob"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.PushRecordsPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="job">TheOtherJob</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
Fills attributes of the record from a JSON string.<br />
<br />
It is not possible to overwrite the record id of the record, even if a key "_recordid" exists in the JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is found in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|init<br />
|name of the input attachment or input attribute that contains the JSON string<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|init<br />
|the optional name of the attribute in the record where the generated object is put into. If no attribute is specified and the object is a map, all contained attributes are written to the current record.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
The following examples use this input object:<br />
<source lang="javascript"><br />
{ "jsonString": "{\"attribute1\": \"value1\"}" }<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the attribute "jsonObject":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
<rec:Val key="outputAttribute">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"jsonObject": { <br />
"attribute1": "value1"<br />
}<br />
}<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the object itself:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"attribute1": "value1"<br />
}<br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
Writes some or all attributes of the record into a JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttributes''<br />
|String/Sequence of String<br />
|init<br />
|the names of the attributes in the record that contain the objects to write into JSON. If nothing is given, the whole record is used. If only a string is given, the content of that attribute is used.<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is written to an attachment or attribute of the record<br />
|-<br />
|''outputName''<br />
|String<br />
|init<br />
|name of the target attachment or attribute<br />
|-<br />
|''printPretty''<br />
|Boolean<br />
|init<br />
|Indicates to format the output for better readability, defaults to true.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This example writes the content of attribute "a1" into the attribute "value" without any whitespaces:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttributes">a1</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="outputName">value</rec:Val><br />
<rec:Val key="printPretty" type="boolean">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<source lang="javascript"><br />
input : { "a1": [ 1 ], "a2": 2 }<br />
result : { "a1": [ 1 ], "a2": 2, "value": "[1]" }<br />
</source><br />
<br />
This example appends the whole object to the file "records.log":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONLogEntry"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">jsonLog</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONFileName"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">jsonFile</rec:Val><br />
<rec:Val key="value">records.log</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="appendToJSONLog"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">jsonFile</rec:Val><br />
<rec:Val key="contentAttachment">jsonLog</rec:Val><br />
<rec:Val key="append" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet ==<br />
<br />
Splits a single input document to multiple seperate output records. Just think of a book and its pages as an example. The splitter will copy the attribute values of the document to each record. However, if an attribute exists both in the area of the enclosing document and a sub record, the resulting record will carry its own attribute value instead of the document's one. In short: record values beat document values.<br />
The functionality of document splitting will only be applied, if the input datum carries the given partsAttribute (see ''configuration''). In case the input record does not carry that specific attribute, it will be put out unchanged. If the attribute is existent, yet misses any values, it will be removed.<br />
<br />
If the splitting functionality will be applied, each output record will receive an additional "_documentId" attribute containing the "_recordid" or the enclosing document. The effective "_recordid" of the output records will be an aggregate of "_documentId" and page number joined via the string "###". Example:<br />
<br />
<source lang="javascript"><br />
{<br />
"_documentId": "book.pdf",<br />
"_recordid": "book.pdf###0",<br />
...<br />
}<br />
</source><br />
<br />
<br />
<br />
=== Configuration ===<br />
<br />
The configuration property is read from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''partsAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute where the single pages of the document are contained. <br />
|}<br />
<br />
=== Example === <br />
<br />
Imagine this example input to be split by the pipelet:<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"subPages":<br />
[<br />
{<br />
"content": "public spending must be pro-cyclical",<br />
"author": "adam smith"<br />
},<br />
{<br />
"content": "public spending must be counter-cyclical"<br />
}<br />
]<br />
}<br />
</source><br />
<br />
The configuration must be as follows then:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="splitDocument"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="partsAttribute">subPages</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The output will be two seperate records:<br />
<br />
<source lang="javascript"><br />
<br />
[<br />
{<br />
"_recordid": "document0.pdf###0",<br />
"_documentId": "document0.pdf",<br />
"author": "adam smith",<br />
"content": "public spending must be pro-cyclical"<br />
},<br />
{<br />
"_recordid": "document0.pdf###1",<br />
"_documentId": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"content": "public spending must be counter-cyclical" <br />
}<br />
]<br />
<br />
</source><br />
<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets&diff=342886SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets2013-07-09T07:21:39Z<p>Marco.strack.empolis.com: /* org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet */</p>
<hr />
<div>This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets</tt>.<br />
<br />
== General ==<br />
<br />
All pipelets in this bundle support the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.<br />
<br />
''' Read Type '''<br />
* ''runtime'': Parameters are read when processing records. Parameter value can be set per Record.<br />
* ''init'': Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.<br />
<br />
== org.eclipse.smila.processing.pipelets.CommitRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Commits each record in the ''input'' variable on the blackboard to the storages. Can be used to save the records immediately during the workflow instead of only when a workflow has been finished.<br />
<br />
=== Configuration ===<br />
<br />
none.<br />
<br />
== org.eclipse.smila.processing.pipelets.AddValuesPipelet ==<br />
<br />
Adds something to an attribute in the processed records. If the attribute does not contain a sequence already, the current value will be wrapped in one before the new values are added.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to add values to<br />
|-<br />
|''valuesToAdd''<br />
|Anything, usually a value or a sequence of values<br />
|runtime<br />
|The values to add<br />
|}<br />
<br />
=== Example ===<br />
<br />
From a test pipeline: This adds two string values to whatever already exists in attribute "out" of the processed records.<br />
<br />
<source lang="xml"><br />
<proc:invokePipelet name="addValuesToNonExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">out</rec:Val><br />
<rec:Seq key="valuesToAdd"><br />
<rec:Val>value1</rec:Val><br />
<rec:Val>value2</rec:Val><br />
</rec:Seq><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SetValuePipelet ==<br />
<br />
Sets a value for an attribute in every processed records. If the attribute exists already, it is not change by default. Useful for initializations of required attributes.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to set the value for<br />
|-<br />
|''value''<br />
|anything<br />
|runtime<br />
|The constant value to set for the attribute (a map or sequence is possible, too)<br />
|-<br />
|''overwrite''<br />
|boolean<br />
|runtime<br />
|Indicates to overwrite any value that the attribute contains already (optional, defaults to false)<br />
|}<br />
<br />
=== Example ===<br />
<br />
This sets a map containing two values into attribute1, even if there is already a value in that attribute.<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="setMapForExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">attribute1</rec:Val><br />
<rec:Val key="overwrite" type="boolean">true</rec:Val><br />
<rec:Map key="value"><br />
<rec:Val key="key1">value1</rec:Val><br />
<rec:Val key="key2">value2</rec:Val><br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.RemoveAttributePipelet ==<br />
<br />
Removes an attribute from each record. <br />
<br />
=== Configuration ===<br />
<br />
The configuration property is either read from the <tt>_parameters</tt> attribute of a record or from the pipelet configuration. If not set at all, the record remains unchanged.<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''removeAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to remove<br />
|}<br />
<br />
=== Example === <br />
<br />
To remove the complete structure in attribute <tt>_parameters</tt>, use: <br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="removeParameters"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.RemoveAttributePipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="removeAttribute">_parameters</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FilterPipelet ==<br />
<br />
Copies only those record IDs to the result which match a configurable regular expression in a configurable single-valued attribute. This is useful for conditional processing while at the same time pushing multiple records through the pipeline in a single request: Instead of using BPEL conditions use a FilterPipelet to select only the matching records in a new variable and use the this variable as the input variable for the next pipelets. You can still use the original BPEL variable in the BPEL <tt><reply></tt> activity at the end of the pipeline to return all records as the final result.<br />
<br />
=== Configuration ===<br />
The configuration properties are read either from the <tt>_parameters</tt> attribute of each record or from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''filterAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to match<br />
|-<br />
|''filterExpression''<br />
|A string value<br />
|runtime<br />
|The regular expression to match the attribute value against<br />
|}<br />
<br />
=== Example === <br />
<br />
To get only those records in the <tt>textRecords</tt> BPEL variable that have a MimeType starting with <tt>text</tt> something like this could be used:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeFilterPipelet"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FilterPipelet" /><br />
<proc:variables input="request" output="textRecords" /><br />
<proc:configuration><br />
<rec:Val key="filterAttribute">MimeType</rec:Val><br />
<rec:Val key="filterExpression">text/.+</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.HtmlToTextPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extract plain text and metadata from an HTML document from an attribute or attachment of each record and writes the results to configurable attributes or attachments.<br />
<br />
The pipelet uses the CyberNeko HTML parser [http://nekohtml.sourceforge.net/ NekoHTML] to parse HTML documents.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the HTML input is found in an attachment or in an attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the plain text should be stored in an attachment or in an attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|Name of input attachment or path to input attribute (process literals of attribute)<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
|Name of output attachment or path to output attribute for plain text (store result as literals of attribute)<br />
|-<br />
|''defaultEncoding''<br />
|String<br />
|runtime<br />
|Optional, default encoding to apply to documents when not specified in the documents themselves<br />
|-<br />
|''removeContentTags''<br />
|String<br />
|runtime<br />
|Comma-separated list of HTML tags (case-insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to ''"applet,frame,object,script,style"''. If the value is set, you must add the default tags explicitly to have their contents removed, too.<br />
|-<br />
|''meta:<name>''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><META></tt> tag with ''name="<name>"'' (case insensitive) to the attribute named as the value of the property. E.g. a property named ''"meta:author"'' with value "authors" causes the content attributes of <tt><META name="author" content="..."></tt> tags to be stored in the attribute ''authors'' of the respective record.<br />
|-<br />
|''tag:title''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><TITLE></tt> tag with to the attribute named as the value of the property.<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration extracts plain text from the HTML document in attachment ''"html"'' and stores the results to the attribute ''"text"''. It removes the complete content of heading tags <tt><nowiki><h1>, ..., <h4></nowiki></tt>. In addition to that, it looks for <tt><meta></tt> tags with names ''"author"'' and ''"keywords"'' and stores their contents in attributes ''"authors"'' and ''"keywords"'', respectively:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeHtml2Txt"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">html</rec:Val><br />
<rec:Val key="outputName">text</rec:Val><br />
<rec:Val key="defaultEncoding">UTF-8</rec:Val><br />
<rec:Val key="meta:author">author</rec:Val><br />
<rec:Val key="meta:keywords">keywords</rec:Val><br />
<rec:Val key="meta:title">title</rec:Val><br />
<rec:Val key="removeContentTags">h1,h2,h3,h4</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.CopyPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to copy or move attribute values to other attributes or to copy or move a string value between attributes and/or attachments. It suppoprts two execution modes:<br />
* COPY: copy the value from the input attribute/attachment to the output attribute/attachment <br />
* MOVE: same as COPY, but after that delete the value from the input attribute/attachment<br />
When an attribute is copied to another attribute, the type remains the same. When copying an attachment to an attribute, a string value is created by assuming the the attachment is a text in UTF-8 encoding. When copying an attribute value to an attachment, the attribute must be single value which is interpreted as a string value and converted to a byte array using UTF-8 encoding.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if the input is found in an attachment or attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|name of input attachment or input attribute<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
| name of output attachment or output attribute<br />
|-<br />
|''mode''<br />
|String : ''COPY, MOVE''<br />
|runtime<br />
| execution mode. Copy the value or move (copy and delete) the value. Default is COPY.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':<br />
<br />
<source lang="xml"><br />
<!-- copy txt from attachment to attribute --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeCopyContent"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="outputName">TextContent</rec:Val><br />
<rec:Val key="mode">COPY</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extracts literal values from an attribute that has a nested map. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure, a path needs to be specified. The pipelet supports different execution modes: <br />
*FIRST: selects only the first literal of the specified attribute<br />
*LAST: selects only the last literal of the specified attribute<br />
*ALL_AS_LIST: selects all literal values of the specified attribute and returns a list<br />
*ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
<b>Note</b>:<br />
If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputPath''<br />
|String<br />
|runtime<br />
|the path to the input attribute with Literals<br />
|-<br />
|''outputPath''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently)<br />
|-<br />
|''mode''<br />
|String : ''FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE''<br />
|runtime<br />
| execution mode. See above for details.<br />
|-<br />
|''separator''<br />
|String<br />
|runtime<br />
| the separation string used for mode ALL_AS_ONE. Default is a blank<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:<br />
<br />
<source lang="xml"><br />
<!-- extract content --><br />
<extensionActivity><br />
<proc:invokePipelet name="extract content"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputPath">Contents/Value</rec:Val><br />
<rec:Val key="outputPath">Content</rec:Val><br />
<rec:Val key="mode">ALL_AS_ONE</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ReplacePipelet ==<br />
<br />
=== Description ===<br />
<br />
Searches for one or more patterns in the literal value of an attribute and substitutes the found occurrences by the configured replacements. <br />
<br />
You can choose from different matching types:<br />
<br />
* ''entity'': Every pattern is matched against the whole attribute value (with respect to the ''ignoreCase'' property) and the first matching pattern defines the new value of the attribute. If no pattern matches, the result is the current value of the attribute.<br />
* ''substring'': All patterns that are part of the attribute value are replaced.<br />
* ''regexp'': Interpret all patterns as [http://en.wikipedia.org/wiki/Regular_expression regular expression], see [http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String) Matcher#replaceAll(String)]<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute that contains the literal to search in<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the result value as string, defaults to the input attribute<br />
|-<br />
|''type''<br />
|String : ''entity'', ''substring'', ''regexp''<br />
|init<br />
|Identifies the type of the pattern, see above for details. Defaults to ''substring''.<br />
|-<br />
|''ignoreCase''<br />
|Boolean<br />
|init<br />
|indicates that the case is ignored when matching patterns, defaults to ''false''.<br />
|-<br />
|''mapping''<br />
|Map<br />
|init<br />
|A mapping of multiple patterns and replacements. Each key is a pattern and its value the replacement.<br />
|-<br />
|''pattern''<br />
|String<br />
|init<br />
|the pattern to apply to the literal value (see above for a description of possible types), required if no mapping is given<br />
|-<br />
|''replacement''<br />
|String<br />
|init<br />
|the substitution string used to replace all occurrences of the pattern, defaults to the empty string<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to map language ids to their label:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="set language label"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">Language</rec:Val><br />
<rec:Val key="outputAttribute">LanguageLabel</rec:Val><br />
<rec:Val key="type">entity</rec:Val><br />
<rec:Val key="ignoreCase" type="boolean">true</rec:Val><br />
<rec:Map key="mapping"><br />
<rec:Val key="de">German</rec:Val><br />
<rec:Val key="en">English</rec:Val><br />
<rec:Val key="es">Spanish</rec:Val><br />
<rec:Val key="fr">French</rec:Val><br />
...<br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to cut the time information from a timestamp:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="cut time"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">ModificationTime</rec:Val><br />
<rec:Val key="outputAttribute">ModificationDate</rec:Val><br />
<rec:Val key="type">regexp</rec:Val><br />
<rec:Val key="pattern">[T ].*</rec:Val><br />
<rec:Val key="replacement"></rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ScriptPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes a script for each record. <br />
<br />
For execution the [http://en.wikipedia.org/wiki/Scripting_for_the_Java_Platform Java Scripting API (JSR 223)] is responsible - thus any compatible scripting engine can be used. JavaScript is available "out of the box" and the default script language.<br />
<br />
The context of the script will contain four variables:<br />
* ''blackboard'': a reference to the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/blackboard/Blackboard.html blackboard]<br />
* ''id'': the ID of the current record<br />
* ''record'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/AnyMap.html metadata] of the current record<br />
* ''results'': a slightly modified version of a [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/util/ResultCollector.html result collector] that provides methods to add a new record id to the list of result ids (''results.addResult('...id...')'') and to drop the current record from the same list (''results.excludeCurrentRecord()'')<br />
* ''parameterAccessor'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/parameters/ParameterAccessor.html ParameterAccessor] instance for access to the configuration (e.g. ''parameterAccessor.getParameterAny("configMap").asMap().getLongValue("longValue")'').<br />
<br />
Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into [[SMILA/Development_Guidelines/How_to_write_a_Pipelet|your own pipelet]].<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String<br />
|init<br />
|the mime type of the scripting language, defaults to "text/javascript"<br />
|-<br />
|''scriptFile''<br />
|String<br />
|runtime<br />
|the path of the file that contains the script - modifications of this file are observed on every execution of the pipelet<br />
|-<br />
|''script''<br />
|String<br />
|init<br />
|The "inline" script, required unless ''scriptFile'' is specified (ignored in that case)<br />
|-<br />
|''resultAttribute''<br />
|String<br />
|runtime<br />
|The name of an attribute that will receive the result of the script (usually the result of the last expression)<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to concatenate the values of two attributes and save the result into a third one:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="create full name"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="script">record.get("firstName") + " " + record.get("lastName")</rec:Val><br />
<rec:Val key="resultAttribute">fullName</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to execute a java script file from $SMILA_PATH$/configuration/example.js:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="execute script"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="scriptFile">configuration/example.js</rec:Val> <br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ExecPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes an external program for each record. <br />
<br />
This pipelet may be used to integrate native programs into the pipeline. <br />
<br />
'''Attention''': This pipelet may lead to security issues! Please be aware that although one can not change the executed command during runtime (as this parameter is only evaluated at initialization time), it is possible to change the arguments and input of the command using values in the processed record. Every "pipeline developer" should ensure that only arguments in the expected value range are processed (especially if the program is allowing files from the file system as arguments).<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''command''<br />
|String<br />
|init<br />
|The program to execute (including its path in the file system).<br />
|-<br />
|''directory''<br />
|String<br />
|runtime<br />
|The (optional) working directory for the command. The SMILA directory is used if not given.<br />
|-<br />
|''parameters''<br />
|Sequence of strings<br />
|runtime<br />
|The optional parameters given to the program (ignored if the contents of the parameters attribute exists).<br />
|-<br />
|''parametersAttribute''<br />
|String<br />
|runtime<br />
|The optional name of the attribute that contains the sequence of parameters given to the program.<br />
|-<br />
|''inputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that contains the bytes to send as input for the program.<br />
|-<br />
|''outputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the standard output of the program.<br />
|-<br />
|''exitCodeAttribute''<br />
|String<br />
|runtime<br />
|The name of the attribute that is filled with the exit code of the program.<br />
|-<br />
|''errorAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the error output of the program.<br />
|-<br />
|''failOnError''<br />
|Either a boolean or a sequence of strings<br />
|runtime<br />
|Indicates to mark a record as failed if the program returns an error code. Either as a sequence of exit code ranges or as a boolean where "true" means that everything except 0 is an error code. Defaults to false.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to execute FFMPEG for transformation of an MP3 input file into a WAV output file:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="ConvertMP3"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ExecPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="command">.../ffmpeg</rec:Val><br />
<rec:Seq key="parameters"><br />
<rec:Val>-i</rec:Val><br />
<rec:Val>.../example.mp3</rec:Val><br />
<rec:Val>-ar</rec:Val><br />
<rec:Val>16000</rec:Val><br />
<rec:Val>.../example.wav</rec:Val><br />
</rec:Seq><br />
<rec:Val key="failOnError" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet is used to identify the MIME type of a document. <br />
It uses an <tt>[[SMILA/Documentation/MimeTypeIdentifier| org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier]]</tt> service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.<br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''FileExtensionAttribute''||String||init||Optional||Name of the attribute containing the file extension<br />
|-<br />
|''ContentAttachment''||String||init||Optional||Name of the attachment containing the file content<br />
|-<br />
|''MetaDataAttribute''||String||init||Optional||Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information<br />
|-<br />
|''MimeTypeAttribute''||String||init||Required||Name of the attribute to store the identified MIME type to<br />
|}<br />
Note that at least one of the properties ''FileExtensionAttribute'', ''ContentAttachment'', and ''MetaDataAttribute'' must be specified!<br />
<br />
=== Example ===<br />
<br />
The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect MimeType"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="FileExtensionAttribute">Extension</rec:Val><br />
<rec:Val key="MetaDataAttribute">MetaData</rec:Val><br />
<rec:Val key="MimeTypeAttribute">MimeType</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<br />
== org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet identifies the language of textual input and stores the returned ISO 639 language code to some target attribute. It uses an <tt>org.eclipse.smila.common.language.LanguageIdentifier</tt> service to perform the actual identification. If the identification does not return a language, the specified <tt>DefaultLanguage</tt> (or <tt>DefaultAlternativeName</tt>) is returned. If no defaults are specified, no value is set.<br />
<br />
The pipelet returns the detected language as an ISO 639 code. Where you need special language tags in your application, the pipelet is able to produce<br />
an alternative language code according to a configurable mapping. To define such a mapping, create the file <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt>. The following shows an exemplary mapping:<br />
<br />
<source lang="text"><br />
de=german<br />
en=english<br />
es=spanish<br />
fi=finnish<br />
fr=french<br />
</source><br />
<br />
The pipelet uses [http://tika.apache.org/ Apache Tika] technology for the actual language detection. <br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''ContentAttribute''||String||runtime||Required||Name of the attribute containing the text whose language should be identified<br />
|-<br />
|''LanguageAttribute''||String||runtime||Optional||Name of the attribute to store the code of the identified language to<br />
|-<br />
|''DefaultLanguage''||String||runtime||Optional||Language code to set if no language could be detected. If not set and no language could be identified, the <tt>LanguageAttribute</tt> attribute remains empty.<br />
|-<br />
|''AlternativeNameAttribute''||String||runtime||Optional||Name of the attribute to store the alternative language code of the identified language to. The mapping defining this alternative code must be located in <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt> (see above).<br />
|-<br />
|''DefaultAlternativeName''||String||runtime||Optional||Alternative language code to set if no language could be detected. If not set and no language could be identified, the <tt>DefaultAlternativeName</tt> attribute remains empty. <br />
|-<br />
|''UseCertainLanguagesOnly''||Boolean||runtime||Optional||Boolean flag indicating whether to apply only those languages that were identified with a reasonable certainty (true) or all (false). Default is false.<br />
|}<br />
<br />
<br />
=== Example ===<br />
<br />
The following example could be used to identify the language of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect Language"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="ContentAttribute">Content</rec:Val><br />
<rec:Val key="LanguageAttribute">Language</rec:Val><br />
<rec:Val key="DefaultLanguage">de</rec:Val><br />
<rec:Val key="AlternativeNameAttribute">AltLanguage</rec:Val><br />
<rec:Val key="DefaultAlternativeName">german</rec:Val><br />
<rec:Val key="UseCertainLanguagesOnly">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to read content from a file and add it as an attachment.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the file to read from<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to store the content <br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
<source lang="xml"><br />
<!-- read from file and add attachment --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeReadFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to write the content of an attachment to a file.<br />
<br />
If the attachment does not exist a warning is logged, but the record will not be dropped.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the target file<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to write to the file <br />
|-<br />
|''append''<br />
|Boolean<br />
|runtime<br />
|Indicates to append the attachment to the file (if it exists already), defaults to false<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example saves all bytes of the attachment "content" to the file path that is contained in the attribute "path".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.PushRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Sends all current records to another (asynchronous) job.<br />
<br />
The records are not removed from the pipeline - thus a following pipelet in the current pipeline will process the records as well.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String <br />
|init<br />
|The name of the target job.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example sends all current records to the job "TheOtherJob".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="callJob"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.PushRecordsPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="job">TheOtherJob</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
Fills attributes of the record from a JSON string.<br />
<br />
It is not possible to overwrite the record id of the record, even if a key "_recordid" exists in the JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is found in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|init<br />
|name of the input attachment or input attribute that contains the JSON string<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|init<br />
|the optional name of the attribute in the record where the generated object is put into. If no attribute is specified and the object is a map, all contained attributes are written to the current record.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
The following examples use this input object:<br />
<source lang="javascript"><br />
{ "jsonString": "{\"attribute1\": \"value1\"}" }<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the attribute "jsonObject":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
<rec:Val key="outputAttribute">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"jsonObject": { <br />
"attribute1": "value1"<br />
}<br />
}<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the object itself:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"attribute1": "value1"<br />
}<br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
Writes some or all attributes of the record into a JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttributes''<br />
|String/Sequence of String<br />
|init<br />
|the names of the attributes in the record that contain the objects to write into JSON. If nothing is given, the whole record is used. If only a string is given, the content of that attribute is used.<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is written to an attachment or attribute of the record<br />
|-<br />
|''outputName''<br />
|String<br />
|init<br />
|name of the target attachment or attribute<br />
|-<br />
|''printPretty''<br />
|Boolean<br />
|init<br />
|Indicates to format the output for better readability, defaults to true.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This example writes the content of attribute "a1" into the attribute "value" without any whitespaces:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttributes">a1</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="outputName">value</rec:Val><br />
<rec:Val key="printPretty" type="boolean">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<source lang="javascript"><br />
input : { "a1": [ 1 ], "a2": 2 }<br />
result : { "a1": [ 1 ], "a2": 2, "value": "[1]" }<br />
</source><br />
<br />
This example appends the whole object to the file "records.log":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONLogEntry"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">jsonLog</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONFileName"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">jsonFile</rec:Val><br />
<rec:Val key="value">records.log</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="appendToJSONLog"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">jsonFile</rec:Val><br />
<rec:Val key="contentAttachment">jsonLog</rec:Val><br />
<rec:Val key="append" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet ==<br />
<br />
Splits a single input document to multiple seperate output records. Just think of a book and its pages as an example. The splitter will copy the attribute values of the document to each record. However, if an attribute exists both in the area of the enclosing document and a sub record, the resulting record will carry its own attribute value instead of the document's one. In short: record values beat document values.<br />
The functionality of document splitting will only be applied, if the input datum carries the given partsAttribute (see ''configuration''). In case the input record does not carry that specific attribute, it will be put out unchanged. If the attribute is existent, yet misses any values, it will be removed.<br />
<br />
If the splitting functionality will be applied, each output record will receive an additional "_documentid" attribute containing the "_recordid" or the enclosing document. The effective "_recordid" of the output records will be an aggregate of "_documentId" and page number joined via the string "###". Example:<br />
<br />
<source lang="javascript"><br />
{<br />
"_documentId": "book.pdf",<br />
"_recordid": "book.pdf###0",<br />
...<br />
}<br />
</source><br />
<br />
<br />
<br />
=== Configuration ===<br />
<br />
The configuration property is read from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''partsAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute where the single pages of the document are contained. <br />
|}<br />
<br />
=== Example === <br />
<br />
Imagine this example input to be split by the pipelet:<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"subPages":<br />
[<br />
{<br />
"content": "public spending must be pro-cyclical",<br />
"author": "adam smith"<br />
},<br />
{<br />
"content": "public spending must be counter-cyclical"<br />
}<br />
]<br />
}<br />
</source><br />
<br />
The configuration must be as follows then:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="splitDocument"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="partsAttribute">subPages</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The output will be two seperate records:<br />
<br />
<source lang="javascript"><br />
<br />
[<br />
{<br />
"_recordid": "document0.pdf###0",<br />
"_documentId": "document0.pdf",<br />
"author": "adam smith",<br />
"content": "public spending must be pro-cyclical"<br />
},<br />
{<br />
"_recordid": "document0.pdf###1",<br />
"_documentId": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"content": "public spending must be counter-cyclical" <br />
}<br />
]<br />
<br />
</source><br />
<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets&diff=342829SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets2013-07-08T12:38:02Z<p>Marco.strack.empolis.com: /* org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet */</p>
<hr />
<div>This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets</tt>.<br />
<br />
== General ==<br />
<br />
All pipelets in this bundle support the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.<br />
<br />
''' Read Type '''<br />
* ''runtime'': Parameters are read when processing records. Parameter value can be set per Record.<br />
* ''init'': Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.<br />
<br />
== org.eclipse.smila.processing.pipelets.CommitRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Commits each record in the ''input'' variable on the blackboard to the storages. Can be used to save the records immediately during the workflow instead of only when a workflow has been finished.<br />
<br />
=== Configuration ===<br />
<br />
none.<br />
<br />
== org.eclipse.smila.processing.pipelets.AddValuesPipelet ==<br />
<br />
Adds something to an attribute in the processed records. If the attribute does not contain a sequence already, the current value will be wrapped in one before the new values are added.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to add values to<br />
|-<br />
|''valuesToAdd''<br />
|Anything, usually a value or a sequence of values<br />
|runtime<br />
|The values to add<br />
|}<br />
<br />
=== Example ===<br />
<br />
From a test pipeline: This adds two string values to whatever already exists in attribute "out" of the processed records.<br />
<br />
<source lang="xml"><br />
<proc:invokePipelet name="addValuesToNonExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">out</rec:Val><br />
<rec:Seq key="valuesToAdd"><br />
<rec:Val>value1</rec:Val><br />
<rec:Val>value2</rec:Val><br />
</rec:Seq><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SetValuePipelet ==<br />
<br />
Sets a value for an attribute in every processed records. If the attribute exists already, it is not change by default. Useful for initializations of required attributes.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to set the value for<br />
|-<br />
|''value''<br />
|anything<br />
|runtime<br />
|The constant value to set for the attribute (a map or sequence is possible, too)<br />
|-<br />
|''overwrite''<br />
|boolean<br />
|runtime<br />
|Indicates to overwrite any value that the attribute contains already (optional, defaults to false)<br />
|}<br />
<br />
=== Example ===<br />
<br />
This sets a map containing two values into attribute1, even if there is already a value in that attribute.<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="setMapForExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">attribute1</rec:Val><br />
<rec:Val key="overwrite" type="boolean">true</rec:Val><br />
<rec:Map key="value"><br />
<rec:Val key="key1">value1</rec:Val><br />
<rec:Val key="key2">value2</rec:Val><br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.RemoveAttributePipelet ==<br />
<br />
Removes an attribute from each record. <br />
<br />
=== Configuration ===<br />
<br />
The configuration property is either read from the <tt>_parameters</tt> attribute of a record or from the pipelet configuration. If not set at all, the record remains unchanged.<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''removeAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to remove<br />
|}<br />
<br />
=== Example === <br />
<br />
To remove the complete structure in attribute <tt>_parameters</tt>, use: <br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="removeParameters"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.RemoveAttributePipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="removeAttribute">_parameters</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FilterPipelet ==<br />
<br />
Copies only those record IDs to the result which match a configurable regular expression in a configurable single-valued attribute. This is useful for conditional processing while at the same time pushing multiple records through the pipeline in a single request: Instead of using BPEL conditions use a FilterPipelet to select only the matching records in a new variable and use the this variable as the input variable for the next pipelets. You can still use the original BPEL variable in the BPEL <tt><reply></tt> activity at the end of the pipeline to return all records as the final result.<br />
<br />
=== Configuration ===<br />
The configuration properties are read either from the <tt>_parameters</tt> attribute of each record or from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''filterAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to match<br />
|-<br />
|''filterExpression''<br />
|A string value<br />
|runtime<br />
|The regular expression to match the attribute value against<br />
|}<br />
<br />
=== Example === <br />
<br />
To get only those records in the <tt>textRecords</tt> BPEL variable that have a MimeType starting with <tt>text</tt> something like this could be used:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeFilterPipelet"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FilterPipelet" /><br />
<proc:variables input="request" output="textRecords" /><br />
<proc:configuration><br />
<rec:Val key="filterAttribute">MimeType</rec:Val><br />
<rec:Val key="filterExpression">text/.+</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.HtmlToTextPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extract plain text and metadata from an HTML document from an attribute or attachment of each record and writes the results to configurable attributes or attachments.<br />
<br />
The pipelet uses the CyberNeko HTML parser [http://nekohtml.sourceforge.net/ NekoHTML] to parse HTML documents.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the HTML input is found in an attachment or in an attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the plain text should be stored in an attachment or in an attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|Name of input attachment or path to input attribute (process literals of attribute)<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
|Name of output attachment or path to output attribute for plain text (store result as literals of attribute)<br />
|-<br />
|''defaultEncoding''<br />
|String<br />
|runtime<br />
|Optional, default encoding to apply to documents when not specified in the documents themselves<br />
|-<br />
|''removeContentTags''<br />
|String<br />
|runtime<br />
|Comma-separated list of HTML tags (case-insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to ''"applet,frame,object,script,style"''. If the value is set, you must add the default tags explicitly to have their contents removed, too.<br />
|-<br />
|''meta:<name>''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><META></tt> tag with ''name="<name>"'' (case insensitive) to the attribute named as the value of the property. E.g. a property named ''"meta:author"'' with value "authors" causes the content attributes of <tt><META name="author" content="..."></tt> tags to be stored in the attribute ''authors'' of the respective record.<br />
|-<br />
|''tag:title''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><TITLE></tt> tag with to the attribute named as the value of the property.<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration extracts plain text from the HTML document in attachment ''"html"'' and stores the results to the attribute ''"text"''. It removes the complete content of heading tags <tt><nowiki><h1>, ..., <h4></nowiki></tt>. In addition to that, it looks for <tt><meta></tt> tags with names ''"author"'' and ''"keywords"'' and stores their contents in attributes ''"authors"'' and ''"keywords"'', respectively:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeHtml2Txt"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">html</rec:Val><br />
<rec:Val key="outputName">text</rec:Val><br />
<rec:Val key="defaultEncoding">UTF-8</rec:Val><br />
<rec:Val key="meta:author">author</rec:Val><br />
<rec:Val key="meta:keywords">keywords</rec:Val><br />
<rec:Val key="meta:title">title</rec:Val><br />
<rec:Val key="removeContentTags">h1,h2,h3,h4</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.CopyPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to copy or move attribute values to other attributes or to copy or move a string value between attributes and/or attachments. It suppoprts two execution modes:<br />
* COPY: copy the value from the input attribute/attachment to the output attribute/attachment <br />
* MOVE: same as COPY, but after that delete the value from the input attribute/attachment<br />
When an attribute is copied to another attribute, the type remains the same. When copying an attachment to an attribute, a string value is created by assuming the the attachment is a text in UTF-8 encoding. When copying an attribute value to an attachment, the attribute must be single value which is interpreted as a string value and converted to a byte array using UTF-8 encoding.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if the input is found in an attachment or attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|name of input attachment or input attribute<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
| name of output attachment or output attribute<br />
|-<br />
|''mode''<br />
|String : ''COPY, MOVE''<br />
|runtime<br />
| execution mode. Copy the value or move (copy and delete) the value. Default is COPY.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':<br />
<br />
<source lang="xml"><br />
<!-- copy txt from attachment to attribute --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeCopyContent"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="outputName">TextContent</rec:Val><br />
<rec:Val key="mode">COPY</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extracts literal values from an attribute that has a nested map. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure, a path needs to be specified. The pipelet supports different execution modes: <br />
*FIRST: selects only the first literal of the specified attribute<br />
*LAST: selects only the last literal of the specified attribute<br />
*ALL_AS_LIST: selects all literal values of the specified attribute and returns a list<br />
*ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
<b>Note</b>:<br />
If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputPath''<br />
|String<br />
|runtime<br />
|the path to the input attribute with Literals<br />
|-<br />
|''outputPath''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently)<br />
|-<br />
|''mode''<br />
|String : ''FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE''<br />
|runtime<br />
| execution mode. See above for details.<br />
|-<br />
|''separator''<br />
|String<br />
|runtime<br />
| the separation string used for mode ALL_AS_ONE. Default is a blank<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:<br />
<br />
<source lang="xml"><br />
<!-- extract content --><br />
<extensionActivity><br />
<proc:invokePipelet name="extract content"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputPath">Contents/Value</rec:Val><br />
<rec:Val key="outputPath">Content</rec:Val><br />
<rec:Val key="mode">ALL_AS_ONE</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ReplacePipelet ==<br />
<br />
=== Description ===<br />
<br />
Searches for one or more patterns in the literal value of an attribute and substitutes the found occurrences by the configured replacements. <br />
<br />
You can choose from different matching types:<br />
<br />
* ''entity'': Every pattern is matched against the whole attribute value (with respect to the ''ignoreCase'' property) and the first matching pattern defines the new value of the attribute. If no pattern matches, the result is the current value of the attribute.<br />
* ''substring'': All patterns that are part of the attribute value are replaced.<br />
* ''regexp'': Interpret all patterns as [http://en.wikipedia.org/wiki/Regular_expression regular expression], see [http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String) Matcher#replaceAll(String)]<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute that contains the literal to search in<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the result value as string, defaults to the input attribute<br />
|-<br />
|''type''<br />
|String : ''entity'', ''substring'', ''regexp''<br />
|init<br />
|Identifies the type of the pattern, see above for details. Defaults to ''substring''.<br />
|-<br />
|''ignoreCase''<br />
|Boolean<br />
|init<br />
|indicates that the case is ignored when matching patterns, defaults to ''false''.<br />
|-<br />
|''mapping''<br />
|Map<br />
|init<br />
|A mapping of multiple patterns and replacements. Each key is a pattern and its value the replacement.<br />
|-<br />
|''pattern''<br />
|String<br />
|init<br />
|the pattern to apply to the literal value (see above for a description of possible types), required if no mapping is given<br />
|-<br />
|''replacement''<br />
|String<br />
|init<br />
|the substitution string used to replace all occurrences of the pattern, defaults to the empty string<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to map language ids to their label:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="set language label"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">Language</rec:Val><br />
<rec:Val key="outputAttribute">LanguageLabel</rec:Val><br />
<rec:Val key="type">entity</rec:Val><br />
<rec:Val key="ignoreCase" type="boolean">true</rec:Val><br />
<rec:Map key="mapping"><br />
<rec:Val key="de">German</rec:Val><br />
<rec:Val key="en">English</rec:Val><br />
<rec:Val key="es">Spanish</rec:Val><br />
<rec:Val key="fr">French</rec:Val><br />
...<br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to cut the time information from a timestamp:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="cut time"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">ModificationTime</rec:Val><br />
<rec:Val key="outputAttribute">ModificationDate</rec:Val><br />
<rec:Val key="type">regexp</rec:Val><br />
<rec:Val key="pattern">[T ].*</rec:Val><br />
<rec:Val key="replacement"></rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ScriptPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes a script for each record. <br />
<br />
For execution the [http://en.wikipedia.org/wiki/Scripting_for_the_Java_Platform Java Scripting API (JSR 223)] is responsible - thus any compatible scripting engine can be used. JavaScript is available "out of the box" and the default script language.<br />
<br />
The context of the script will contain four variables:<br />
* ''blackboard'': a reference to the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/blackboard/Blackboard.html blackboard]<br />
* ''id'': the ID of the current record<br />
* ''record'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/AnyMap.html metadata] of the current record<br />
* ''results'': a slightly modified version of a [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/util/ResultCollector.html result collector] that provides methods to add a new record id to the list of result ids (''results.addResult('...id...')'') and to drop the current record from the same list (''results.excludeCurrentRecord()'')<br />
* ''parameterAccessor'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/parameters/ParameterAccessor.html ParameterAccessor] instance for access to the configuration (e.g. ''parameterAccessor.getParameterAny("configMap").asMap().getLongValue("longValue")'').<br />
<br />
Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into [[SMILA/Development_Guidelines/How_to_write_a_Pipelet|your own pipelet]].<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String<br />
|init<br />
|the mime type of the scripting language, defaults to "text/javascript"<br />
|-<br />
|''scriptFile''<br />
|String<br />
|runtime<br />
|the path of the file that contains the script - modifications of this file are observed on every execution of the pipelet<br />
|-<br />
|''script''<br />
|String<br />
|init<br />
|The "inline" script, required unless ''scriptFile'' is specified (ignored in that case)<br />
|-<br />
|''resultAttribute''<br />
|String<br />
|runtime<br />
|The name of an attribute that will receive the result of the script (usually the result of the last expression)<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to concatenate the values of two attributes and save the result into a third one:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="create full name"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="script">record.get("firstName") + " " + record.get("lastName")</rec:Val><br />
<rec:Val key="resultAttribute">fullName</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to execute a java script file from $SMILA_PATH$/configuration/example.js:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="execute script"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="scriptFile">configuration/example.js</rec:Val> <br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ExecPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes an external program for each record. <br />
<br />
This pipelet may be used to integrate native programs into the pipeline. <br />
<br />
'''Attention''': This pipelet may lead to security issues! Please be aware that although one can not change the executed command during runtime (as this parameter is only evaluated at initialization time), it is possible to change the arguments and input of the command using values in the processed record. Every "pipeline developer" should ensure that only arguments in the expected value range are processed (especially if the program is allowing files from the file system as arguments).<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''command''<br />
|String<br />
|init<br />
|The program to execute (including its path in the file system).<br />
|-<br />
|''directory''<br />
|String<br />
|runtime<br />
|The (optional) working directory for the command. The SMILA directory is used if not given.<br />
|-<br />
|''parameters''<br />
|Sequence of strings<br />
|runtime<br />
|The optional parameters given to the program (ignored if the contents of the parameters attribute exists).<br />
|-<br />
|''parametersAttribute''<br />
|String<br />
|runtime<br />
|The optional name of the attribute that contains the sequence of parameters given to the program.<br />
|-<br />
|''inputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that contains the bytes to send as input for the program.<br />
|-<br />
|''outputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the standard output of the program.<br />
|-<br />
|''exitCodeAttribute''<br />
|String<br />
|runtime<br />
|The name of the attribute that is filled with the exit code of the program.<br />
|-<br />
|''errorAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the error output of the program.<br />
|-<br />
|''failOnError''<br />
|Either a boolean or a sequence of strings<br />
|runtime<br />
|Indicates to mark a record as failed if the program returns an error code. Either as a sequence of exit code ranges or as a boolean where "true" means that everything except 0 is an error code. Defaults to false.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to execute FFMPEG for transformation of an MP3 input file into a WAV output file:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="ConvertMP3"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ExecPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="command">.../ffmpeg</rec:Val><br />
<rec:Seq key="parameters"><br />
<rec:Val>-i</rec:Val><br />
<rec:Val>.../example.mp3</rec:Val><br />
<rec:Val>-ar</rec:Val><br />
<rec:Val>16000</rec:Val><br />
<rec:Val>.../example.wav</rec:Val><br />
</rec:Seq><br />
<rec:Val key="failOnError" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet is used to identify the MIME type of a document. <br />
It uses an <tt>[[SMILA/Documentation/MimeTypeIdentifier| org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier]]</tt> service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.<br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''FileExtensionAttribute''||String||init||Optional||Name of the attribute containing the file extension<br />
|-<br />
|''ContentAttachment''||String||init||Optional||Name of the attachment containing the file content<br />
|-<br />
|''MetaDataAttribute''||String||init||Optional||Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information<br />
|-<br />
|''MimeTypeAttribute''||String||init||Required||Name of the attribute to store the identified MIME type to<br />
|}<br />
Note that at least one of the properties ''FileExtensionAttribute'', ''ContentAttachment'', and ''MetaDataAttribute'' must be specified!<br />
<br />
=== Example ===<br />
<br />
The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect MimeType"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="FileExtensionAttribute">Extension</rec:Val><br />
<rec:Val key="MetaDataAttribute">MetaData</rec:Val><br />
<rec:Val key="MimeTypeAttribute">MimeType</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<br />
== org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet identifies the language of textual input and stores the returned ISO 639 language code to some target attribute. It uses an <tt>org.eclipse.smila.common.language.LanguageIdentifier</tt> service to perform the actual identification. If the identification does not return a language, the specified <tt>DefaultLanguage</tt> (or <tt>DefaultAlternativeName</tt>) is returned. If no defaults are specified, no value is set.<br />
<br />
The pipelet returns the detected language as an ISO 639 code. Where you need special language tags in your application, the pipelet is able to produce<br />
an alternative language code according to a configurable mapping. To define such a mapping, create the file <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt>. The following shows an exemplary mapping:<br />
<br />
<source lang="text"><br />
de=german<br />
en=english<br />
es=spanish<br />
fi=finnish<br />
fr=french<br />
</source><br />
<br />
The pipelet uses [http://tika.apache.org/ Apache Tika] technology for the actual language detection. <br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''ContentAttribute''||String||runtime||Required||Name of the attribute containing the text whose language should be identified<br />
|-<br />
|''LanguageAttribute''||String||runtime||Optional||Name of the attribute to store the code of the identified language to<br />
|-<br />
|''DefaultLanguage''||String||runtime||Optional||Language code to set if no language could be detected. If not set and no language could be identified, the <tt>LanguageAttribute</tt> attribute remains empty.<br />
|-<br />
|''AlternativeNameAttribute''||String||runtime||Optional||Name of the attribute to store the alternative language code of the identified language to. The mapping defining this alternative code must be located in <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt> (see above).<br />
|-<br />
|''DefaultAlternativeName''||String||runtime||Optional||Alternative language code to set if no language could be detected. If not set and no language could be identified, the <tt>DefaultAlternativeName</tt> attribute remains empty. <br />
|-<br />
|''UseCertainLanguagesOnly''||Boolean||runtime||Optional||Boolean flag indicating whether to apply only those languages that were identified with a reasonable certainty (true) or all (false). Default is false.<br />
|}<br />
<br />
<br />
=== Example ===<br />
<br />
The following example could be used to identify the language of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect Language"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="ContentAttribute">Content</rec:Val><br />
<rec:Val key="LanguageAttribute">Language</rec:Val><br />
<rec:Val key="DefaultLanguage">de</rec:Val><br />
<rec:Val key="AlternativeNameAttribute">AltLanguage</rec:Val><br />
<rec:Val key="DefaultAlternativeName">german</rec:Val><br />
<rec:Val key="UseCertainLanguagesOnly">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to read content from a file and add it as an attachment.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the file to read from<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to store the content <br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
<source lang="xml"><br />
<!-- read from file and add attachment --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeReadFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to write the content of an attachment to a file.<br />
<br />
If the attachment does not exist a warning is logged, but the record will not be dropped.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the target file<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to write to the file <br />
|-<br />
|''append''<br />
|Boolean<br />
|runtime<br />
|Indicates to append the attachment to the file (if it exists already), defaults to false<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example saves all bytes of the attachment "content" to the file path that is contained in the attribute "path".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.PushRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Sends all current records to another (asynchronous) job.<br />
<br />
The records are not removed from the pipeline - thus a following pipelet in the current pipeline will process the records as well.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String <br />
|init<br />
|The name of the target job.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example sends all current records to the job "TheOtherJob".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="callJob"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.PushRecordsPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="job">TheOtherJob</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
Fills attributes of the record from a JSON string.<br />
<br />
It is not possible to overwrite the record id of the record, even if a key "_recordid" exists in the JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is found in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|init<br />
|name of the input attachment or input attribute that contains the JSON string<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|init<br />
|the optional name of the attribute in the record where the generated object is put into. If no attribute is specified and the object is a map, all contained attributes are written to the current record.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
The following examples use this input object:<br />
<source lang="javascript"><br />
{ "jsonString": "{\"attribute1\": \"value1\"}" }<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the attribute "jsonObject":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
<rec:Val key="outputAttribute">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"jsonObject": { <br />
"attribute1": "value1"<br />
}<br />
}<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the object itself:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"attribute1": "value1"<br />
}<br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
Writes some or all attributes of the record into a JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttributes''<br />
|String/Sequence of String<br />
|init<br />
|the names of the attributes in the record that contain the objects to write into JSON. If nothing is given, the whole record is used. If only a string is given, the content of that attribute is used.<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is written to an attachment or attribute of the record<br />
|-<br />
|''outputName''<br />
|String<br />
|init<br />
|name of the target attachment or attribute<br />
|-<br />
|''printPretty''<br />
|Boolean<br />
|init<br />
|Indicates to format the output for better readability, defaults to true.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This example writes the content of attribute "a1" into the attribute "value" without any whitespaces:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttributes">a1</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="outputName">value</rec:Val><br />
<rec:Val key="printPretty" type="boolean">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<source lang="javascript"><br />
input : { "a1": [ 1 ], "a2": 2 }<br />
result : { "a1": [ 1 ], "a2": 2, "value": "[1]" }<br />
</source><br />
<br />
This example appends the whole object to the file "records.log":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONLogEntry"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">jsonLog</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONFileName"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">jsonFile</rec:Val><br />
<rec:Val key="value">records.log</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="appendToJSONLog"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">jsonFile</rec:Val><br />
<rec:Val key="contentAttachment">jsonLog</rec:Val><br />
<rec:Val key="append" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet ==<br />
<br />
Splits a single input document to multiple seperate output records. Just think of a book and its pages as an example. The splitter will copy the attribute values of the document to each record. However, if an attribute exists both in the area of the enclosing document and a sub record, the resulting record will carry its own attribute value instead of the document's one. In short: record values beat document values.<br />
The functionality of document splitting will only be applied, if the input datum carries the given partsAttribute (see ''configuration''). In case the input record does not carry that specific attribute, it will be put out unchanged. If the attribute is existent, yet misses any values, it will be removed.<br />
<br />
If the splitting functionality will be applied, each output record will receive an additional "_documentid" attribute containing the "_recordid" or the enclosing document. The effective "_recordid" of the output records will be an aggregate of "_documentid" and page number joined via the string "###". Example:<br />
<br />
<source lang="javascript"><br />
{<br />
"_documentid": "book.pdf",<br />
"_recordid": "book.pdf###0",<br />
...<br />
}<br />
</source><br />
<br />
<br />
<br />
=== Configuration ===<br />
<br />
The configuration property is read from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''partsAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute where the single pages of the document are contained. <br />
|}<br />
<br />
=== Example === <br />
<br />
Imagine this example input to be split by the pipelet:<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"subPages":<br />
[<br />
{<br />
"content": "public spending must be pro-cyclical",<br />
"author": "adam smith"<br />
},<br />
{<br />
"content": "public spending must be counter-cyclical"<br />
}<br />
]<br />
}<br />
</source><br />
<br />
The configuration must be as follows then:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="splitDocument"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="partsAttribute">subPages</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The output will be two seperate records:<br />
<br />
<source lang="javascript"><br />
<br />
[<br />
{<br />
"_recordid": "document0.pdf###0",<br />
"_documentid": "document0.pdf",<br />
"author": "adam smith",<br />
"content": "public spending must be pro-cyclical"<br />
},<br />
{<br />
"_recordid": "document0.pdf###1",<br />
"_documentid": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"content": "public spending must be counter-cyclical" <br />
}<br />
]<br />
<br />
</source><br />
<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets&diff=342828SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets2013-07-08T12:26:25Z<p>Marco.strack.empolis.com: /* Example */</p>
<hr />
<div>This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets</tt>.<br />
<br />
== General ==<br />
<br />
All pipelets in this bundle support the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.<br />
<br />
''' Read Type '''<br />
* ''runtime'': Parameters are read when processing records. Parameter value can be set per Record.<br />
* ''init'': Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.<br />
<br />
== org.eclipse.smila.processing.pipelets.CommitRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Commits each record in the ''input'' variable on the blackboard to the storages. Can be used to save the records immediately during the workflow instead of only when a workflow has been finished.<br />
<br />
=== Configuration ===<br />
<br />
none.<br />
<br />
== org.eclipse.smila.processing.pipelets.AddValuesPipelet ==<br />
<br />
Adds something to an attribute in the processed records. If the attribute does not contain a sequence already, the current value will be wrapped in one before the new values are added.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to add values to<br />
|-<br />
|''valuesToAdd''<br />
|Anything, usually a value or a sequence of values<br />
|runtime<br />
|The values to add<br />
|}<br />
<br />
=== Example ===<br />
<br />
From a test pipeline: This adds two string values to whatever already exists in attribute "out" of the processed records.<br />
<br />
<source lang="xml"><br />
<proc:invokePipelet name="addValuesToNonExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">out</rec:Val><br />
<rec:Seq key="valuesToAdd"><br />
<rec:Val>value1</rec:Val><br />
<rec:Val>value2</rec:Val><br />
</rec:Seq><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SetValuePipelet ==<br />
<br />
Sets a value for an attribute in every processed records. If the attribute exists already, it is not change by default. Useful for initializations of required attributes.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to set the value for<br />
|-<br />
|''value''<br />
|anything<br />
|runtime<br />
|The constant value to set for the attribute (a map or sequence is possible, too)<br />
|-<br />
|''overwrite''<br />
|boolean<br />
|runtime<br />
|Indicates to overwrite any value that the attribute contains already (optional, defaults to false)<br />
|}<br />
<br />
=== Example ===<br />
<br />
This sets a map containing two values into attribute1, even if there is already a value in that attribute.<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="setMapForExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">attribute1</rec:Val><br />
<rec:Val key="overwrite" type="boolean">true</rec:Val><br />
<rec:Map key="value"><br />
<rec:Val key="key1">value1</rec:Val><br />
<rec:Val key="key2">value2</rec:Val><br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.RemoveAttributePipelet ==<br />
<br />
Removes an attribute from each record. <br />
<br />
=== Configuration ===<br />
<br />
The configuration property is either read from the <tt>_parameters</tt> attribute of a record or from the pipelet configuration. If not set at all, the record remains unchanged.<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''removeAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to remove<br />
|}<br />
<br />
=== Example === <br />
<br />
To remove the complete structure in attribute <tt>_parameters</tt>, use: <br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="removeParameters"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.RemoveAttributePipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="removeAttribute">_parameters</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FilterPipelet ==<br />
<br />
Copies only those record IDs to the result which match a configurable regular expression in a configurable single-valued attribute. This is useful for conditional processing while at the same time pushing multiple records through the pipeline in a single request: Instead of using BPEL conditions use a FilterPipelet to select only the matching records in a new variable and use the this variable as the input variable for the next pipelets. You can still use the original BPEL variable in the BPEL <tt><reply></tt> activity at the end of the pipeline to return all records as the final result.<br />
<br />
=== Configuration ===<br />
The configuration properties are read either from the <tt>_parameters</tt> attribute of each record or from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''filterAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to match<br />
|-<br />
|''filterExpression''<br />
|A string value<br />
|runtime<br />
|The regular expression to match the attribute value against<br />
|}<br />
<br />
=== Example === <br />
<br />
To get only those records in the <tt>textRecords</tt> BPEL variable that have a MimeType starting with <tt>text</tt> something like this could be used:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeFilterPipelet"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FilterPipelet" /><br />
<proc:variables input="request" output="textRecords" /><br />
<proc:configuration><br />
<rec:Val key="filterAttribute">MimeType</rec:Val><br />
<rec:Val key="filterExpression">text/.+</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.HtmlToTextPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extract plain text and metadata from an HTML document from an attribute or attachment of each record and writes the results to configurable attributes or attachments.<br />
<br />
The pipelet uses the CyberNeko HTML parser [http://nekohtml.sourceforge.net/ NekoHTML] to parse HTML documents.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the HTML input is found in an attachment or in an attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the plain text should be stored in an attachment or in an attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|Name of input attachment or path to input attribute (process literals of attribute)<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
|Name of output attachment or path to output attribute for plain text (store result as literals of attribute)<br />
|-<br />
|''defaultEncoding''<br />
|String<br />
|runtime<br />
|Optional, default encoding to apply to documents when not specified in the documents themselves<br />
|-<br />
|''removeContentTags''<br />
|String<br />
|runtime<br />
|Comma-separated list of HTML tags (case-insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to ''"applet,frame,object,script,style"''. If the value is set, you must add the default tags explicitly to have their contents removed, too.<br />
|-<br />
|''meta:<name>''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><META></tt> tag with ''name="<name>"'' (case insensitive) to the attribute named as the value of the property. E.g. a property named ''"meta:author"'' with value "authors" causes the content attributes of <tt><META name="author" content="..."></tt> tags to be stored in the attribute ''authors'' of the respective record.<br />
|-<br />
|''tag:title''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><TITLE></tt> tag with to the attribute named as the value of the property.<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration extracts plain text from the HTML document in attachment ''"html"'' and stores the results to the attribute ''"text"''. It removes the complete content of heading tags <tt><nowiki><h1>, ..., <h4></nowiki></tt>. In addition to that, it looks for <tt><meta></tt> tags with names ''"author"'' and ''"keywords"'' and stores their contents in attributes ''"authors"'' and ''"keywords"'', respectively:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeHtml2Txt"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">html</rec:Val><br />
<rec:Val key="outputName">text</rec:Val><br />
<rec:Val key="defaultEncoding">UTF-8</rec:Val><br />
<rec:Val key="meta:author">author</rec:Val><br />
<rec:Val key="meta:keywords">keywords</rec:Val><br />
<rec:Val key="meta:title">title</rec:Val><br />
<rec:Val key="removeContentTags">h1,h2,h3,h4</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.CopyPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to copy or move attribute values to other attributes or to copy or move a string value between attributes and/or attachments. It suppoprts two execution modes:<br />
* COPY: copy the value from the input attribute/attachment to the output attribute/attachment <br />
* MOVE: same as COPY, but after that delete the value from the input attribute/attachment<br />
When an attribute is copied to another attribute, the type remains the same. When copying an attachment to an attribute, a string value is created by assuming the the attachment is a text in UTF-8 encoding. When copying an attribute value to an attachment, the attribute must be single value which is interpreted as a string value and converted to a byte array using UTF-8 encoding.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if the input is found in an attachment or attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|name of input attachment or input attribute<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
| name of output attachment or output attribute<br />
|-<br />
|''mode''<br />
|String : ''COPY, MOVE''<br />
|runtime<br />
| execution mode. Copy the value or move (copy and delete) the value. Default is COPY.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':<br />
<br />
<source lang="xml"><br />
<!-- copy txt from attachment to attribute --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeCopyContent"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="outputName">TextContent</rec:Val><br />
<rec:Val key="mode">COPY</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extracts literal values from an attribute that has a nested map. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure, a path needs to be specified. The pipelet supports different execution modes: <br />
*FIRST: selects only the first literal of the specified attribute<br />
*LAST: selects only the last literal of the specified attribute<br />
*ALL_AS_LIST: selects all literal values of the specified attribute and returns a list<br />
*ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
<b>Note</b>:<br />
If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputPath''<br />
|String<br />
|runtime<br />
|the path to the input attribute with Literals<br />
|-<br />
|''outputPath''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently)<br />
|-<br />
|''mode''<br />
|String : ''FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE''<br />
|runtime<br />
| execution mode. See above for details.<br />
|-<br />
|''separator''<br />
|String<br />
|runtime<br />
| the separation string used for mode ALL_AS_ONE. Default is a blank<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:<br />
<br />
<source lang="xml"><br />
<!-- extract content --><br />
<extensionActivity><br />
<proc:invokePipelet name="extract content"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputPath">Contents/Value</rec:Val><br />
<rec:Val key="outputPath">Content</rec:Val><br />
<rec:Val key="mode">ALL_AS_ONE</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ReplacePipelet ==<br />
<br />
=== Description ===<br />
<br />
Searches for one or more patterns in the literal value of an attribute and substitutes the found occurrences by the configured replacements. <br />
<br />
You can choose from different matching types:<br />
<br />
* ''entity'': Every pattern is matched against the whole attribute value (with respect to the ''ignoreCase'' property) and the first matching pattern defines the new value of the attribute. If no pattern matches, the result is the current value of the attribute.<br />
* ''substring'': All patterns that are part of the attribute value are replaced.<br />
* ''regexp'': Interpret all patterns as [http://en.wikipedia.org/wiki/Regular_expression regular expression], see [http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String) Matcher#replaceAll(String)]<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute that contains the literal to search in<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the result value as string, defaults to the input attribute<br />
|-<br />
|''type''<br />
|String : ''entity'', ''substring'', ''regexp''<br />
|init<br />
|Identifies the type of the pattern, see above for details. Defaults to ''substring''.<br />
|-<br />
|''ignoreCase''<br />
|Boolean<br />
|init<br />
|indicates that the case is ignored when matching patterns, defaults to ''false''.<br />
|-<br />
|''mapping''<br />
|Map<br />
|init<br />
|A mapping of multiple patterns and replacements. Each key is a pattern and its value the replacement.<br />
|-<br />
|''pattern''<br />
|String<br />
|init<br />
|the pattern to apply to the literal value (see above for a description of possible types), required if no mapping is given<br />
|-<br />
|''replacement''<br />
|String<br />
|init<br />
|the substitution string used to replace all occurrences of the pattern, defaults to the empty string<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to map language ids to their label:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="set language label"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">Language</rec:Val><br />
<rec:Val key="outputAttribute">LanguageLabel</rec:Val><br />
<rec:Val key="type">entity</rec:Val><br />
<rec:Val key="ignoreCase" type="boolean">true</rec:Val><br />
<rec:Map key="mapping"><br />
<rec:Val key="de">German</rec:Val><br />
<rec:Val key="en">English</rec:Val><br />
<rec:Val key="es">Spanish</rec:Val><br />
<rec:Val key="fr">French</rec:Val><br />
...<br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to cut the time information from a timestamp:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="cut time"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">ModificationTime</rec:Val><br />
<rec:Val key="outputAttribute">ModificationDate</rec:Val><br />
<rec:Val key="type">regexp</rec:Val><br />
<rec:Val key="pattern">[T ].*</rec:Val><br />
<rec:Val key="replacement"></rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ScriptPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes a script for each record. <br />
<br />
For execution the [http://en.wikipedia.org/wiki/Scripting_for_the_Java_Platform Java Scripting API (JSR 223)] is responsible - thus any compatible scripting engine can be used. JavaScript is available "out of the box" and the default script language.<br />
<br />
The context of the script will contain four variables:<br />
* ''blackboard'': a reference to the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/blackboard/Blackboard.html blackboard]<br />
* ''id'': the ID of the current record<br />
* ''record'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/AnyMap.html metadata] of the current record<br />
* ''results'': a slightly modified version of a [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/util/ResultCollector.html result collector] that provides methods to add a new record id to the list of result ids (''results.addResult('...id...')'') and to drop the current record from the same list (''results.excludeCurrentRecord()'')<br />
* ''parameterAccessor'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/parameters/ParameterAccessor.html ParameterAccessor] instance for access to the configuration (e.g. ''parameterAccessor.getParameterAny("configMap").asMap().getLongValue("longValue")'').<br />
<br />
Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into [[SMILA/Development_Guidelines/How_to_write_a_Pipelet|your own pipelet]].<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String<br />
|init<br />
|the mime type of the scripting language, defaults to "text/javascript"<br />
|-<br />
|''scriptFile''<br />
|String<br />
|runtime<br />
|the path of the file that contains the script - modifications of this file are observed on every execution of the pipelet<br />
|-<br />
|''script''<br />
|String<br />
|init<br />
|The "inline" script, required unless ''scriptFile'' is specified (ignored in that case)<br />
|-<br />
|''resultAttribute''<br />
|String<br />
|runtime<br />
|The name of an attribute that will receive the result of the script (usually the result of the last expression)<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to concatenate the values of two attributes and save the result into a third one:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="create full name"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="script">record.get("firstName") + " " + record.get("lastName")</rec:Val><br />
<rec:Val key="resultAttribute">fullName</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to execute a java script file from $SMILA_PATH$/configuration/example.js:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="execute script"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="scriptFile">configuration/example.js</rec:Val> <br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ExecPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes an external program for each record. <br />
<br />
This pipelet may be used to integrate native programs into the pipeline. <br />
<br />
'''Attention''': This pipelet may lead to security issues! Please be aware that although one can not change the executed command during runtime (as this parameter is only evaluated at initialization time), it is possible to change the arguments and input of the command using values in the processed record. Every "pipeline developer" should ensure that only arguments in the expected value range are processed (especially if the program is allowing files from the file system as arguments).<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''command''<br />
|String<br />
|init<br />
|The program to execute (including its path in the file system).<br />
|-<br />
|''directory''<br />
|String<br />
|runtime<br />
|The (optional) working directory for the command. The SMILA directory is used if not given.<br />
|-<br />
|''parameters''<br />
|Sequence of strings<br />
|runtime<br />
|The optional parameters given to the program (ignored if the contents of the parameters attribute exists).<br />
|-<br />
|''parametersAttribute''<br />
|String<br />
|runtime<br />
|The optional name of the attribute that contains the sequence of parameters given to the program.<br />
|-<br />
|''inputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that contains the bytes to send as input for the program.<br />
|-<br />
|''outputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the standard output of the program.<br />
|-<br />
|''exitCodeAttribute''<br />
|String<br />
|runtime<br />
|The name of the attribute that is filled with the exit code of the program.<br />
|-<br />
|''errorAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the error output of the program.<br />
|-<br />
|''failOnError''<br />
|Either a boolean or a sequence of strings<br />
|runtime<br />
|Indicates to mark a record as failed if the program returns an error code. Either as a sequence of exit code ranges or as a boolean where "true" means that everything except 0 is an error code. Defaults to false.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to execute FFMPEG for transformation of an MP3 input file into a WAV output file:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="ConvertMP3"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ExecPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="command">.../ffmpeg</rec:Val><br />
<rec:Seq key="parameters"><br />
<rec:Val>-i</rec:Val><br />
<rec:Val>.../example.mp3</rec:Val><br />
<rec:Val>-ar</rec:Val><br />
<rec:Val>16000</rec:Val><br />
<rec:Val>.../example.wav</rec:Val><br />
</rec:Seq><br />
<rec:Val key="failOnError" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet is used to identify the MIME type of a document. <br />
It uses an <tt>[[SMILA/Documentation/MimeTypeIdentifier| org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier]]</tt> service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.<br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''FileExtensionAttribute''||String||init||Optional||Name of the attribute containing the file extension<br />
|-<br />
|''ContentAttachment''||String||init||Optional||Name of the attachment containing the file content<br />
|-<br />
|''MetaDataAttribute''||String||init||Optional||Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information<br />
|-<br />
|''MimeTypeAttribute''||String||init||Required||Name of the attribute to store the identified MIME type to<br />
|}<br />
Note that at least one of the properties ''FileExtensionAttribute'', ''ContentAttachment'', and ''MetaDataAttribute'' must be specified!<br />
<br />
=== Example ===<br />
<br />
The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect MimeType"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="FileExtensionAttribute">Extension</rec:Val><br />
<rec:Val key="MetaDataAttribute">MetaData</rec:Val><br />
<rec:Val key="MimeTypeAttribute">MimeType</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<br />
== org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet identifies the language of textual input and stores the returned ISO 639 language code to some target attribute. It uses an <tt>org.eclipse.smila.common.language.LanguageIdentifier</tt> service to perform the actual identification. If the identification does not return a language, the specified <tt>DefaultLanguage</tt> (or <tt>DefaultAlternativeName</tt>) is returned. If no defaults are specified, no value is set.<br />
<br />
The pipelet returns the detected language as an ISO 639 code. Where you need special language tags in your application, the pipelet is able to produce<br />
an alternative language code according to a configurable mapping. To define such a mapping, create the file <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt>. The following shows an exemplary mapping:<br />
<br />
<source lang="text"><br />
de=german<br />
en=english<br />
es=spanish<br />
fi=finnish<br />
fr=french<br />
</source><br />
<br />
The pipelet uses [http://tika.apache.org/ Apache Tika] technology for the actual language detection. <br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''ContentAttribute''||String||runtime||Required||Name of the attribute containing the text whose language should be identified<br />
|-<br />
|''LanguageAttribute''||String||runtime||Optional||Name of the attribute to store the code of the identified language to<br />
|-<br />
|''DefaultLanguage''||String||runtime||Optional||Language code to set if no language could be detected. If not set and no language could be identified, the <tt>LanguageAttribute</tt> attribute remains empty.<br />
|-<br />
|''AlternativeNameAttribute''||String||runtime||Optional||Name of the attribute to store the alternative language code of the identified language to. The mapping defining this alternative code must be located in <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt> (see above).<br />
|-<br />
|''DefaultAlternativeName''||String||runtime||Optional||Alternative language code to set if no language could be detected. If not set and no language could be identified, the <tt>DefaultAlternativeName</tt> attribute remains empty. <br />
|-<br />
|''UseCertainLanguagesOnly''||Boolean||runtime||Optional||Boolean flag indicating whether to apply only those languages that were identified with a reasonable certainty (true) or all (false). Default is false.<br />
|}<br />
<br />
<br />
=== Example ===<br />
<br />
The following example could be used to identify the language of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect Language"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="ContentAttribute">Content</rec:Val><br />
<rec:Val key="LanguageAttribute">Language</rec:Val><br />
<rec:Val key="DefaultLanguage">de</rec:Val><br />
<rec:Val key="AlternativeNameAttribute">AltLanguage</rec:Val><br />
<rec:Val key="DefaultAlternativeName">german</rec:Val><br />
<rec:Val key="UseCertainLanguagesOnly">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to read content from a file and add it as an attachment.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the file to read from<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to store the content <br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
<source lang="xml"><br />
<!-- read from file and add attachment --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeReadFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to write the content of an attachment to a file.<br />
<br />
If the attachment does not exist a warning is logged, but the record will not be dropped.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the target file<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to write to the file <br />
|-<br />
|''append''<br />
|Boolean<br />
|runtime<br />
|Indicates to append the attachment to the file (if it exists already), defaults to false<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example saves all bytes of the attachment "content" to the file path that is contained in the attribute "path".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.PushRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Sends all current records to another (asynchronous) job.<br />
<br />
The records are not removed from the pipeline - thus a following pipelet in the current pipeline will process the records as well.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String <br />
|init<br />
|The name of the target job.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example sends all current records to the job "TheOtherJob".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="callJob"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.PushRecordsPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="job">TheOtherJob</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
Fills attributes of the record from a JSON string.<br />
<br />
It is not possible to overwrite the record id of the record, even if a key "_recordid" exists in the JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is found in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|init<br />
|name of the input attachment or input attribute that contains the JSON string<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|init<br />
|the optional name of the attribute in the record where the generated object is put into. If no attribute is specified and the object is a map, all contained attributes are written to the current record.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
The following examples use this input object:<br />
<source lang="javascript"><br />
{ "jsonString": "{\"attribute1\": \"value1\"}" }<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the attribute "jsonObject":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
<rec:Val key="outputAttribute">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"jsonObject": { <br />
"attribute1": "value1"<br />
}<br />
}<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the object itself:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"attribute1": "value1"<br />
}<br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
Writes some or all attributes of the record into a JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttributes''<br />
|String/Sequence of String<br />
|init<br />
|the names of the attributes in the record that contain the objects to write into JSON. If nothing is given, the whole record is used. If only a string is given, the content of that attribute is used.<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is written to an attachment or attribute of the record<br />
|-<br />
|''outputName''<br />
|String<br />
|init<br />
|name of the target attachment or attribute<br />
|-<br />
|''printPretty''<br />
|Boolean<br />
|init<br />
|Indicates to format the output for better readability, defaults to true.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This example writes the content of attribute "a1" into the attribute "value" without any whitespaces:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttributes">a1</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="outputName">value</rec:Val><br />
<rec:Val key="printPretty" type="boolean">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<source lang="javascript"><br />
input : { "a1": [ 1 ], "a2": 2 }<br />
result : { "a1": [ 1 ], "a2": 2, "value": "[1]" }<br />
</source><br />
<br />
This example appends the whole object to the file "records.log":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONLogEntry"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">jsonLog</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONFileName"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">jsonFile</rec:Val><br />
<rec:Val key="value">records.log</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="appendToJSONLog"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">jsonFile</rec:Val><br />
<rec:Val key="contentAttachment">jsonLog</rec:Val><br />
<rec:Val key="append" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet ==<br />
<br />
Splits a single input document to multiple seperate output records. Just think of a book and its pages as an example. The splitter will copy the attributes values of the document to each record. However, if an attribute exists both in the area of the enclosing document and a sub record, the resulting record will carry its own attribute value instead of the document's one.<br />
<br />
=== Configuration ===<br />
<br />
The configuration property is read from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''partsAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute where the single pages of the document are contained. <br />
|}<br />
<br />
=== Example === <br />
<br />
Imagine this example input to be split by the pipelet:<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"subPages":<br />
[<br />
{<br />
"content": "public spending must be pro-cyclical",<br />
"author": "adam smith"<br />
},<br />
{<br />
"content": "public spending must be counter-cyclical"<br />
}<br />
]<br />
}<br />
</source><br />
<br />
The configuration must be as follows then:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="splitDocument"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="partsAttribute">subPages</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The output will be two seperate records:<br />
<br />
<source lang="javascript"><br />
<br />
[<br />
{<br />
"_recordid": "document0.pdf###0",<br />
"_documentid": "document0.pdf",<br />
"author": "adam smith",<br />
"content": "public spending must be pro-cyclical"<br />
},<br />
{<br />
"_recordid": "document0.pdf###1",<br />
"_documentid": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"content": "public spending must be counter-cyclical" <br />
}<br />
]<br />
<br />
</source><br />
<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets&diff=342803SMILA/Documentation/Bundle org.eclipse.smila.processing.pipelets2013-07-08T09:34:28Z<p>Marco.strack.empolis.com: added documentation for new DocumentSplitter pipelet</p>
<hr />
<div>This page describes the SMILA pipelets provided by bundle <tt>org.eclipse.smila.processing.pipelets</tt>.<br />
<br />
== General ==<br />
<br />
All pipelets in this bundle support the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in jobmanager workflows, records causing errors are dropped.<br />
<br />
''' Read Type '''<br />
* ''runtime'': Parameters are read when processing records. Parameter value can be set per Record.<br />
* ''init'': Parameters are read once from Pipelet configuration when initializing the Pipelet. Parameter value can not be overwritten in Record.<br />
<br />
== org.eclipse.smila.processing.pipelets.CommitRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Commits each record in the ''input'' variable on the blackboard to the storages. Can be used to save the records immediately during the workflow instead of only when a workflow has been finished.<br />
<br />
=== Configuration ===<br />
<br />
none.<br />
<br />
== org.eclipse.smila.processing.pipelets.AddValuesPipelet ==<br />
<br />
Adds something to an attribute in the processed records. If the attribute does not contain a sequence already, the current value will be wrapped in one before the new values are added.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to add values to<br />
|-<br />
|''valuesToAdd''<br />
|Anything, usually a value or a sequence of values<br />
|runtime<br />
|The values to add<br />
|}<br />
<br />
=== Example ===<br />
<br />
From a test pipeline: This adds two string values to whatever already exists in attribute "out" of the processed records.<br />
<br />
<source lang="xml"><br />
<proc:invokePipelet name="addValuesToNonExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.AddValuesPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">out</rec:Val><br />
<rec:Seq key="valuesToAdd"><br />
<rec:Val>value1</rec:Val><br />
<rec:Val>value2</rec:Val><br />
</rec:Seq><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SetValuePipelet ==<br />
<br />
Sets a value for an attribute in every processed records. If the attribute exists already, it is not change by default. Useful for initializations of required attributes.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''outputAttribute''<br />
|string<br />
|runtime<br />
|The name of the attribute to set the value for<br />
|-<br />
|''value''<br />
|anything<br />
|runtime<br />
|The constant value to set for the attribute (a map or sequence is possible, too)<br />
|-<br />
|''overwrite''<br />
|boolean<br />
|runtime<br />
|Indicates to overwrite any value that the attribute contains already (optional, defaults to false)<br />
|}<br />
<br />
=== Example ===<br />
<br />
This sets a map containing two values into attribute1, even if there is already a value in that attribute.<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="setMapForExistingAttribute"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">attribute1</rec:Val><br />
<rec:Val key="overwrite" type="boolean">true</rec:Val><br />
<rec:Map key="value"><br />
<rec:Val key="key1">value1</rec:Val><br />
<rec:Val key="key2">value2</rec:Val><br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.RemoveAttributePipelet ==<br />
<br />
Removes an attribute from each record. <br />
<br />
=== Configuration ===<br />
<br />
The configuration property is either read from the <tt>_parameters</tt> attribute of a record or from the pipelet configuration. If not set at all, the record remains unchanged.<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''removeAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to remove<br />
|}<br />
<br />
=== Example === <br />
<br />
To remove the complete structure in attribute <tt>_parameters</tt>, use: <br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="removeParameters"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.RemoveAttributePipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="removeAttribute">_parameters</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FilterPipelet ==<br />
<br />
Copies only those record IDs to the result which match a configurable regular expression in a configurable single-valued attribute. This is useful for conditional processing while at the same time pushing multiple records through the pipeline in a single request: Instead of using BPEL conditions use a FilterPipelet to select only the matching records in a new variable and use the this variable as the input variable for the next pipelets. You can still use the original BPEL variable in the BPEL <tt><reply></tt> activity at the end of the pipeline to return all records as the final result.<br />
<br />
=== Configuration ===<br />
The configuration properties are read either from the <tt>_parameters</tt> attribute of each record or from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''filterAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute to match<br />
|-<br />
|''filterExpression''<br />
|A string value<br />
|runtime<br />
|The regular expression to match the attribute value against<br />
|}<br />
<br />
=== Example === <br />
<br />
To get only those records in the <tt>textRecords</tt> BPEL variable that have a MimeType starting with <tt>text</tt> something like this could be used:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeFilterPipelet"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FilterPipelet" /><br />
<proc:variables input="request" output="textRecords" /><br />
<proc:configuration><br />
<rec:Val key="filterAttribute">MimeType</rec:Val><br />
<rec:Val key="filterExpression">text/.+</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.HtmlToTextPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extract plain text and metadata from an HTML document from an attribute or attachment of each record and writes the results to configurable attributes or attachments.<br />
<br />
The pipelet uses the CyberNeko HTML parser [http://nekohtml.sourceforge.net/ NekoHTML] to parse HTML documents.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the HTML input is found in an attachment or in an attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|Defines whether the plain text should be stored in an attachment or in an attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|Name of input attachment or path to input attribute (process literals of attribute)<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
|Name of output attachment or path to output attribute for plain text (store result as literals of attribute)<br />
|-<br />
|''defaultEncoding''<br />
|String<br />
|runtime<br />
|Optional, default encoding to apply to documents when not specified in the documents themselves<br />
|-<br />
|''removeContentTags''<br />
|String<br />
|runtime<br />
|Comma-separated list of HTML tags (case-insensitive) for which the complete content should be removed from the resulting plain text. If not set, it defaults to ''"applet,frame,object,script,style"''. If the value is set, you must add the default tags explicitly to have their contents removed, too.<br />
|-<br />
|''meta:<name>''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><META></tt> tag with ''name="<name>"'' (case insensitive) to the attribute named as the value of the property. E.g. a property named ''"meta:author"'' with value "authors" causes the content attributes of <tt><META name="author" content="..."></tt> tags to be stored in the attribute ''authors'' of the respective record.<br />
|-<br />
|''tag:title''<br />
|String: attribute path<br />
|init<br />
|Store the content of the <tt><TITLE></tt> tag with to the attribute named as the value of the property.<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration extracts plain text from the HTML document in attachment ''"html"'' and stores the results to the attribute ''"text"''. It removes the complete content of heading tags <tt><nowiki><h1>, ..., <h4></nowiki></tt>. In addition to that, it looks for <tt><meta></tt> tags with names ''"author"'' and ''"keywords"'' and stores their contents in attributes ''"authors"'' and ''"keywords"'', respectively:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeHtml2Txt"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.HtmlToTextPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">html</rec:Val><br />
<rec:Val key="outputName">text</rec:Val><br />
<rec:Val key="defaultEncoding">UTF-8</rec:Val><br />
<rec:Val key="meta:author">author</rec:Val><br />
<rec:Val key="meta:keywords">keywords</rec:Val><br />
<rec:Val key="meta:title">title</rec:Val><br />
<rec:Val key="removeContentTags">h1,h2,h3,h4</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.CopyPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to copy or move attribute values to other attributes or to copy or move a string value between attributes and/or attachments. It suppoprts two execution modes:<br />
* COPY: copy the value from the input attribute/attachment to the output attribute/attachment <br />
* MOVE: same as COPY, but after that delete the value from the input attribute/attachment<br />
When an attribute is copied to another attribute, the type remains the same. When copying an attachment to an attribute, a string value is created by assuming the the attachment is a text in UTF-8 encoding. When copying an attribute value to an attachment, the attribute must be single value which is interpreted as a string value and converted to a byte array using UTF-8 encoding.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if the input is found in an attachment or attribute of the record<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|runtime<br />
|selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|runtime<br />
|name of input attachment or input attribute<br />
|-<br />
|''outputName''<br />
|String<br />
|runtime<br />
| name of output attachment or output attribute<br />
|-<br />
|''mode''<br />
|String : ''COPY, MOVE''<br />
|runtime<br />
| execution mode. Copy the value or move (copy and delete) the value. Default is COPY.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration shows how to copy the value of attachment 'Content' into the attribute 'TextContent':<br />
<br />
<source lang="xml"><br />
<!-- copy txt from attachment to attribute --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeCopyContent"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.CopyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="outputName">TextContent</rec:Val><br />
<rec:Val key="mode">COPY</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet ==<br />
<br />
=== Description ===<br />
<br />
Extracts literal values from an attribute that has a nested map. The attributes in the nested map can have nested maps themselves. To address a attribute in the nested structure, a path needs to be specified. The pipelet supports different execution modes: <br />
*FIRST: selects only the first literal of the specified attribute<br />
*LAST: selects only the last literal of the specified attribute<br />
*ALL_AS_LIST: selects all literal values of the specified attribute and returns a list<br />
*ALL_AS_ONE: selects all literal values of the specified attribute and concatenates them to a single string, using a separator (default is blank)<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
<b>Note</b>:<br />
If the maps on the path are nested in sequences, the pipelet uses the first element of such a sequence.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputPath''<br />
|String<br />
|runtime<br />
|the path to the input attribute with Literals<br />
|-<br />
|''outputPath''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the extracted value(s) as Literals in (not a path, only a top-level attribute, currently)<br />
|-<br />
|''mode''<br />
|String : ''FIRST, LAST, ALL_AS_LIST, ALL_AS_ONE''<br />
|runtime<br />
| execution mode. See above for details.<br />
|-<br />
|''separator''<br />
|String<br />
|runtime<br />
| the separation string used for mode ALL_AS_ONE. Default is a blank<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This configuration can be applied to records provided by the FeedAgent. It shows how to access the subattribute 'Value' of attribute 'Contents', concatenating all values to one:<br />
<br />
<source lang="xml"><br />
<!-- extract content --><br />
<extensionActivity><br />
<proc:invokePipelet name="extract content"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SubAttributeExtractorPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputPath">Contents/Value</rec:Val><br />
<rec:Val key="outputPath">Content</rec:Val><br />
<rec:Val key="mode">ALL_AS_ONE</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ReplacePipelet ==<br />
<br />
=== Description ===<br />
<br />
Searches for one or more patterns in the literal value of an attribute and substitutes the found occurrences by the configured replacements. <br />
<br />
You can choose from different matching types:<br />
<br />
* ''entity'': Every pattern is matched against the whole attribute value (with respect to the ''ignoreCase'' property) and the first matching pattern defines the new value of the attribute. If no pattern matches, the result is the current value of the attribute.<br />
* ''substring'': All patterns that are part of the attribute value are replaced.<br />
* ''regexp'': Interpret all patterns as [http://en.wikipedia.org/wiki/Regular_expression regular expression], see [http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html#replaceAll(java.lang.String) Matcher#replaceAll(String)]<br />
<br />
This pipelet works only on attributes, not on attachments!<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute that contains the literal to search in<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|runtime<br />
|the name of the attribute to store the result value as string, defaults to the input attribute<br />
|-<br />
|''type''<br />
|String : ''entity'', ''substring'', ''regexp''<br />
|init<br />
|Identifies the type of the pattern, see above for details. Defaults to ''substring''.<br />
|-<br />
|''ignoreCase''<br />
|Boolean<br />
|init<br />
|indicates that the case is ignored when matching patterns, defaults to ''false''.<br />
|-<br />
|''mapping''<br />
|Map<br />
|init<br />
|A mapping of multiple patterns and replacements. Each key is a pattern and its value the replacement.<br />
|-<br />
|''pattern''<br />
|String<br />
|init<br />
|the pattern to apply to the literal value (see above for a description of possible types), required if no mapping is given<br />
|-<br />
|''replacement''<br />
|String<br />
|init<br />
|the substitution string used to replace all occurrences of the pattern, defaults to the empty string<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to map language ids to their label:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="set language label"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">Language</rec:Val><br />
<rec:Val key="outputAttribute">LanguageLabel</rec:Val><br />
<rec:Val key="type">entity</rec:Val><br />
<rec:Val key="ignoreCase" type="boolean">true</rec:Val><br />
<rec:Map key="mapping"><br />
<rec:Val key="de">German</rec:Val><br />
<rec:Val key="en">English</rec:Val><br />
<rec:Val key="es">Spanish</rec:Val><br />
<rec:Val key="fr">French</rec:Val><br />
...<br />
</rec:Map><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to cut the time information from a timestamp:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="cut time"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ReplacePipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttribute">ModificationTime</rec:Val><br />
<rec:Val key="outputAttribute">ModificationDate</rec:Val><br />
<rec:Val key="type">regexp</rec:Val><br />
<rec:Val key="pattern">[T ].*</rec:Val><br />
<rec:Val key="replacement"></rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ScriptPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes a script for each record. <br />
<br />
For execution the [http://en.wikipedia.org/wiki/Scripting_for_the_Java_Platform Java Scripting API (JSR 223)] is responsible - thus any compatible scripting engine can be used. JavaScript is available "out of the box" and the default script language.<br />
<br />
The context of the script will contain four variables:<br />
* ''blackboard'': a reference to the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/blackboard/Blackboard.html blackboard]<br />
* ''id'': the ID of the current record<br />
* ''record'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/AnyMap.html metadata] of the current record<br />
* ''results'': a slightly modified version of a [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/util/ResultCollector.html result collector] that provides methods to add a new record id to the list of result ids (''results.addResult('...id...')'') and to drop the current record from the same list (''results.excludeCurrentRecord()'')<br />
* ''parameterAccessor'': the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/processing/parameters/ParameterAccessor.html ParameterAccessor] instance for access to the configuration (e.g. ''parameterAccessor.getParameterAny("configMap").asMap().getLongValue("longValue")'').<br />
<br />
Please be aware that the intention of this pipelet is to write pipelines fast, but not to write fast pipelines - the script is parsed for every record. Don't use it for production environments where performance matters, but use it to develop an algorithm that you can put into [[SMILA/Development_Guidelines/How_to_write_a_Pipelet|your own pipelet]].<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String<br />
|init<br />
|the mime type of the scripting language, defaults to "text/javascript"<br />
|-<br />
|''scriptFile''<br />
|String<br />
|runtime<br />
|the path of the file that contains the script - modifications of this file are observed on every execution of the pipelet<br />
|-<br />
|''script''<br />
|String<br />
|init<br />
|The "inline" script, required unless ''scriptFile'' is specified (ignored in that case)<br />
|-<br />
|''resultAttribute''<br />
|String<br />
|runtime<br />
|The name of an attribute that will receive the result of the script (usually the result of the last expression)<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to concatenate the values of two attributes and save the result into a third one:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="create full name"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="script">record.get("firstName") + " " + record.get("lastName")</rec:Val><br />
<rec:Val key="resultAttribute">fullName</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
This configuration can be used to execute a java script file from $SMILA_PATH$/configuration/example.js:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="execute script"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ScriptPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="scriptFile">configuration/example.js</rec:Val> <br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.ExecPipelet ==<br />
<br />
=== Description ===<br />
<br />
Executes an external program for each record. <br />
<br />
This pipelet may be used to integrate native programs into the pipeline. <br />
<br />
'''Attention''': This pipelet may lead to security issues! Please be aware that although one can not change the executed command during runtime (as this parameter is only evaluated at initialization time), it is possible to change the arguments and input of the command using values in the processed record. Every "pipeline developer" should ensure that only arguments in the expected value range are processed (especially if the program is allowing files from the file system as arguments).<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''command''<br />
|String<br />
|init<br />
|The program to execute (including its path in the file system).<br />
|-<br />
|''directory''<br />
|String<br />
|runtime<br />
|The (optional) working directory for the command. The SMILA directory is used if not given.<br />
|-<br />
|''parameters''<br />
|Sequence of strings<br />
|runtime<br />
|The optional parameters given to the program (ignored if the contents of the parameters attribute exists).<br />
|-<br />
|''parametersAttribute''<br />
|String<br />
|runtime<br />
|The optional name of the attribute that contains the sequence of parameters given to the program.<br />
|-<br />
|''inputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that contains the bytes to send as input for the program.<br />
|-<br />
|''outputAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the standard output of the program.<br />
|-<br />
|''exitCodeAttribute''<br />
|String<br />
|runtime<br />
|The name of the attribute that is filled with the exit code of the program.<br />
|-<br />
|''errorAttachment''<br />
|String<br />
|runtime<br />
|The optional name of the attachment that is filled with the error output of the program.<br />
|-<br />
|''failOnError''<br />
|Either a boolean or a sequence of strings<br />
|runtime<br />
|Indicates to mark a record as failed if the program returns an error code. Either as a sequence of exit code ranges or as a boolean where "true" means that everything except 0 is an error code. Defaults to false.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This configuration can be used to execute FFMPEG for transformation of an MP3 input file into a WAV output file:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="ConvertMP3"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.ExecPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="command">.../ffmpeg</rec:Val><br />
<rec:Seq key="parameters"><br />
<rec:Val>-i</rec:Val><br />
<rec:Val>.../example.mp3</rec:Val><br />
<rec:Val>-ar</rec:Val><br />
<rec:Val>16000</rec:Val><br />
<rec:Val>.../example.wav</rec:Val><br />
</rec:Seq><br />
<rec:Val key="failOnError" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet is used to identify the MIME type of a document. <br />
It uses an <tt>[[SMILA/Documentation/MimeTypeIdentifier| org.eclipse.smila.processing.pipelets.mimetype.MimeTypeIdentifier]]</tt> service to perform the actual identification of the MIME type. Depending on the specified properties, the MIME type is detected from the file content, from the file extension, or from both. If the identification does not return a MIME type - and if configured accordingly - the service will search the metadata for this information. The identified MIME type is then stored to an attribute in the record.<br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''FileExtensionAttribute''||String||init||Optional||Name of the attribute containing the file extension<br />
|-<br />
|''ContentAttachment''||String||init||Optional||Name of the attachment containing the file content<br />
|-<br />
|''MetaDataAttribute''||String||init||Optional||Name of the attribute containing metadata information, e.g. a Web Crawler returns a response header containing applicable MIME type information<br />
|-<br />
|''MimeTypeAttribute''||String||init||Required||Name of the attribute to store the identified MIME type to<br />
|}<br />
Note that at least one of the properties ''FileExtensionAttribute'', ''ContentAttachment'', and ''MetaDataAttribute'' must be specified!<br />
<br />
=== Example ===<br />
<br />
The following example is used in the SMILA example application to identify the MIME types of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect MimeType"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.MimeTypeIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="FileExtensionAttribute">Extension</rec:Val><br />
<rec:Val key="MetaDataAttribute">MetaData</rec:Val><br />
<rec:Val key="MimeTypeAttribute">MimeType</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<br />
== org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet ==<br />
<br />
=== Description ===<br />
This pipelet identifies the language of textual input and stores the returned ISO 639 language code to some target attribute. It uses an <tt>org.eclipse.smila.common.language.LanguageIdentifier</tt> service to perform the actual identification. If the identification does not return a language, the specified <tt>DefaultLanguage</tt> (or <tt>DefaultAlternativeName</tt>) is returned. If no defaults are specified, no value is set.<br />
<br />
The pipelet returns the detected language as an ISO 639 code. Where you need special language tags in your application, the pipelet is able to produce<br />
an alternative language code according to a configurable mapping. To define such a mapping, create the file <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt>. The following shows an exemplary mapping:<br />
<br />
<source lang="text"><br />
de=german<br />
en=english<br />
es=spanish<br />
fi=finnish<br />
fr=french<br />
</source><br />
<br />
The pipelet uses [http://tika.apache.org/ Apache Tika] technology for the actual language detection. <br />
<br />
=== Configuration ===<br />
<br />
The pipelet is configured using the <tt><configuration></tt> section inside the <tt><invokePipelet></tt> activity of the corresponding BPEL file. It provides the following properties:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Usage!!Description<br />
|-<br />
|''ContentAttribute''||String||runtime||Required||Name of the attribute containing the text whose language should be identified<br />
|-<br />
|''LanguageAttribute''||String||runtime||Optional||Name of the attribute to store the code of the identified language to<br />
|-<br />
|''DefaultLanguage''||String||runtime||Optional||Language code to set if no language could be detected. If not set and no language could be identified, the <tt>LanguageAttribute</tt> attribute remains empty.<br />
|-<br />
|''AlternativeNameAttribute''||String||runtime||Optional||Name of the attribute to store the alternative language code of the identified language to. The mapping defining this alternative code must be located in <tt>SMILA/configuration/org.eclipse.smila.tika/languageMapping.properties</tt> (see above).<br />
|-<br />
|''DefaultAlternativeName''||String||runtime||Optional||Alternative language code to set if no language could be detected. If not set and no language could be identified, the <tt>DefaultAlternativeName</tt> attribute remains empty. <br />
|-<br />
|''UseCertainLanguagesOnly''||Boolean||runtime||Optional||Boolean flag indicating whether to apply only those languages that were identified with a reasonable certainty (true) or all (false). Default is false.<br />
|}<br />
<br />
<br />
=== Example ===<br />
<br />
The following example could be used to identify the language of documents that are delivered by the File System Crawler or Web Crawler.<br />
<br />
'''addpipeline.bpel'''<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="detect Language"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.LanguageIdentifyPipelet" /><br />
<proc:variables input="request" output="request" /><br />
<proc:configuration><br />
<rec:Val key="ContentAttribute">Content</rec:Val><br />
<rec:Val key="LanguageAttribute">Language</rec:Val><br />
<rec:Val key="DefaultLanguage">de</rec:Val><br />
<rec:Val key="AlternativeNameAttribute">AltLanguage</rec:Val><br />
<rec:Val key="DefaultAlternativeName">german</rec:Val><br />
<rec:Val key="UseCertainLanguagesOnly">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to read content from a file and add it as an attachment.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the file to read from<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to store the content <br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
<source lang="xml"><br />
<!-- read from file and add attachment --><br />
<extensionActivity><br />
<proc:invokePipelet name="invokeReadFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.FileWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
This pipelet can be used to write the content of an attachment to a file.<br />
<br />
If the attachment does not exist a warning is logged, but the record will not be dropped.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''pathAttribute''<br />
|String <br />
|runtime<br />
|The name of the attribute with the path of the target file<br />
|-<br />
|''contentAttachment''<br />
|String<br />
|runtime<br />
|The name of the attachment to write to the file <br />
|-<br />
|''append''<br />
|Boolean<br />
|runtime<br />
|Indicates to append the attachment to the file (if it exists already), defaults to false<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example saves all bytes of the attachment "content" to the file path that is contained in the attribute "path".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeFile"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">path</rec:Val><br />
<rec:Val key="contentAttachment">content</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.PushRecordsPipelet ==<br />
<br />
=== Description ===<br />
<br />
Sends all current records to another (asynchronous) job.<br />
<br />
The records are not removed from the pipeline - thus a following pipelet in the current pipeline will process the records as well.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''type''<br />
|String <br />
|init<br />
|The name of the target job.<br />
|-<br />
|}<br />
<br />
=== Example ===<br />
<br />
This example sends all current records to the job "TheOtherJob".<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="callJob"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.PushRecordsPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="job">TheOtherJob</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONReaderPipelet ==<br />
<br />
=== Description ===<br />
<br />
Fills attributes of the record from a JSON string.<br />
<br />
It is not possible to overwrite the record id of the record, even if a key "_recordid" exists in the JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is found in an attachment or attribute of the record<br />
|-<br />
|''inputName''<br />
|String<br />
|init<br />
|name of the input attachment or input attribute that contains the JSON string<br />
|-<br />
|''outputAttribute''<br />
|String<br />
|init<br />
|the optional name of the attribute in the record where the generated object is put into. If no attribute is specified and the object is a map, all contained attributes are written to the current record.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
The following examples use this input object:<br />
<source lang="javascript"><br />
{ "jsonString": "{\"attribute1\": \"value1\"}" }<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the attribute "jsonObject":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
<rec:Val key="outputAttribute">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"jsonObject": { <br />
"attribute1": "value1"<br />
}<br />
}<br />
</source><br />
<br />
<br />
This example unwraps the contents of the attribute "jsonString" into the object itself:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="readJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONReaderPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="inputName">jsonString</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The result would be:<br />
<source lang="javascript"><br />
{ <br />
"jsonString": "{\"attribute1\": \"value1\"}",<br />
"attribute1": "value1"<br />
}<br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.JSONWriterPipelet ==<br />
<br />
=== Description ===<br />
<br />
Writes some or all attributes of the record into a JSON string.<br />
<br />
=== Configuration ===<br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''inputAttributes''<br />
|String/Sequence of String<br />
|init<br />
|the names of the attributes in the record that contain the objects to write into JSON. If nothing is given, the whole record is used. If only a string is given, the content of that attribute is used.<br />
|-<br />
|''outputType''<br />
|String : ''ATTACHMENT, ATTRIBUTE''<br />
|init<br />
|selects if the JSON string is written to an attachment or attribute of the record<br />
|-<br />
|''outputName''<br />
|String<br />
|init<br />
|name of the target attachment or attribute<br />
|-<br />
|''printPretty''<br />
|Boolean<br />
|init<br />
|Indicates to format the output for better readability, defaults to true.<br />
|-<br />
|}<br />
<br />
=== Examples ===<br />
<br />
This example writes the content of attribute "a1" into the attribute "value" without any whitespaces:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="writeJSON"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="inputAttributes">a1</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="outputName">value</rec:Val><br />
<rec:Val key="printPretty" type="boolean">false</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
<source lang="javascript"><br />
input : { "a1": [ 1 ], "a2": 2 }<br />
result : { "a1": [ 1 ], "a2": 2, "value": "[1]" }<br />
</source><br />
<br />
This example appends the whole object to the file "records.log":<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONLogEntry"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.JSONWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">jsonLog</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="createJSONFileName"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.SetValuePipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="outputAttribute">jsonFile</rec:Val><br />
<rec:Val key="value">records.log</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
<extensionActivity><br />
<proc:invokePipelet name="appendToJSONLog"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" /><br />
<proc:variables input="request" /><br />
<proc:configuration><br />
<rec:Val key="pathAttribute">jsonFile</rec:Val><br />
<rec:Val key="contentAttachment">jsonLog</rec:Val><br />
<rec:Val key="append" type="boolean">true</rec:Val><br />
</proc:configuration><br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
== org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet ==<br />
<br />
Splits a single input document to multiple seperate output records. Just think of a book and it's pages as an example. The splitter will copy the attributes values of the document to each record. However, if an attribute exists both in the area of the enclosing document and a sub record, the resulting record will carry it's own attribute value instead of the document one.<br />
<br />
=== Configuration ===<br />
<br />
The configuration property is read from the pipelet configuration. <br />
<br />
{| border="1"<br />
!Property<br />
!Type<br />
!Read Type<br />
!Description<br />
|-<br />
|''partsAttribute''<br />
|A string value<br />
|runtime<br />
|The name of the attribute where the single pages of the document are contained. <br />
|}<br />
<br />
=== Example === <br />
<br />
Imagine this example input to be split by the pipelet:<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid": "document0.pdf",<br />
"author": "john maynard keynes",<br />
"subPages":<br />
[<br />
{<br />
"content": "public spending must be pro-cyclic",<br />
"author": "this page was written by adam smith"<br />
},<br />
{<br />
"content": "public spending must be anti-cyclic"<br />
}<br />
]<br />
}<br />
</source><br />
<br />
The configuration must be as follows then:<br />
<br />
<source lang="xml"><br />
<extensionActivity><br />
<proc:invokePipelet name="splitDocument"><br />
<proc:pipelet class="org.eclipse.smila.processing.pipelets.DocumentSplitterPipelet" /><br />
<proc:variables input="result" output="result" /><br />
<proc:configuration><br />
<rec:Val key="partsAttribute">subPages</rec:Val><br />
</proc:configuration> <br />
</proc:invokePipelet><br />
</extensionActivity><br />
</source><br />
<br />
The output will be two seperate records.<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/TikaPipelet&diff=338784SMILA/Documentation/TikaPipelet2013-06-03T13:15:38Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>== Bundle: <tt>org.eclipse.smila.tika</tt> ==<br />
<br />
=== Description ===<br />
<br />
The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''. <br />
<br />
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.<br />
<br />
==== Supported document types ====<br />
<br />
By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports ''all'' document formats, see [[SMILA/Documentation/TikaPipelet#Extending Tika | "Extending Tika"]] section below.<br />
<br />
{| border = 1<br />
!Document format!!supported out-of-the-box!!supported by using!!Hints<br />
|-<br />
|''Microsoft Office''||yes||TikaPipelet||---<br />
|-<br />
|''OpenOffice (OpenDocument formats)''||yes||TikaPipelet||---<br />
|-<br />
|''RTF''||yes||TikaPipelet||---<br />
|-<br />
|''Plain text''||yes||---||no conversion, given input text is used as "converted" text<br />
|-<br />
|''HTML/XML''||yes||[[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]|| [[ SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe|BoilerpipePipelet]] can also be used for HTML text extraction <br />
|-<br />
|''PDF''||no||[[SMILA/Documentation/TikaPipelet#Extending Tika|Tika extension]] || converted text will be empty with out-of-the-box SMILA, a warning will be written to the log<br />
|-<br />
|}<br />
<br />
As you see, SMILA (resp. its 'AddPipeline' which is the default indexing pipeline) per default uses the TikaPipelet only for converting ''binary'' document formats. When indexing text based documents another piplelet ([[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]) is used. However after [[SMILA/Documentation/TikaPipelet#Extending Tika | extending Tika]] this can be simplified by using TikaPipelet for ''all'' document formats.<br />
<br />
=== Configuration ===<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.<br />
|-<br />
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''||String||runtime||yes||Name of input attachment or path to input attribute (process a String literal of attribute)<br />
|-<br />
|''outputName''||String||runtime|| yes||Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)<br />
|-<br />
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.<br />
|-<br />
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.<br />
|-<br />
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.<br />
|-<br />
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.<br />
|-<br />
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by ''#'', e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is ''false''. Default is (false).<br />
|-<br />
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is ''true''. If not set, the page number is not set (default)<br />
|-<br />
|''partsAttribute''||String||runtime||no||Setting that enables the output of multi-segment entities. This switches the behaviour of the pipelet to emit page-broken input objects not as multiple output records but rather as one ouput record with multiple parts. The parts are represented as a sequence of maps in the output record. The value of the ''partsAttribute'' settings serves as key for this sequence. Default is (not set).<br />
|-<br />
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled. Default is (false).<br />
|-<br />
|-<br />
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).<br />
|-<br />
|}<br />
<br />
Some notes on "maxLength" in combination with other parameters: <br />
* If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended. <br />
* The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.<br />
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.<br />
<br />
==== Using the multi-segment output record feature ====<br />
<br />
To turn on the multi-segment output use the following configuration (See below for a more complete example with more options set). <br />
<br />
<source lang="xml"><br />
<proc:configuration><br />
...<br />
<rec:Val key="pageBreak">true</rec:Val><br />
<rec:Val key="partsAttribute">pages</rec:Val><br />
... <br />
</proc:configuration><br />
</source><br />
<br />
The parameter ''partsAttribute'' is set and uses the value ''pages''. The output of an example pdf with three pages may look like this<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid": "file:/home/user/example.pdf",<br />
"filename": "example.pdf",<br />
"pages": <br />
[<br />
{<br />
"text": "this is the content of page 1.",<br />
"pageNo": "1"<br />
},<br />
{<br />
"text": "this is the content of page 2.",<br />
"pageNo": "2"<br />
},<br />
...<br />
],<br />
"_attachments": ["content"]<br />
}<br />
</source><br />
<br />
==== Configuring the Property Mapping ====<br />
<br />
In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the [[SMILA/Glossary#Tika|Tika]] documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.<br />
<br />
To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ''extractProperties'' parameter. Usually this parameter contains a sequence of maps. The map values have the following format:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''metadataName''||String||runtime||yes||The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.<br />
|-<br />
|''targetAttribute''||String||runtime||no||The name of Record attribute to store metadata value(s) in. If not set the string provided in the ''metadataName'' will be used as attribute name.<br />
|-<br />
|''singleResult''||Boolean||runtime||no|| Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.<br />
|-<br />
|''storeMode''||String ||runtime||no|| Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".<br />
|-<br />
|}<br />
<br />
===== Example =====<br />
<br />
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.<br />
<br />
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.<br />
<br />
<source lang="xml"><br />
<proc:configuration><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">Text</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="contentTypeAttribute">MimeType</rec:Val><br />
<rec:Val key="fileNameAttribute">FileName</rec:Val><br />
<rec:Val key="exportAsHtml">false</rec:Val><br />
<rec:Val key="pageBreak">false</rec:Val><br />
<rec:Val key="keepHyphens">false</rec:Val><br />
<rec:Val key="maxLength">100000</rec:Val><br />
<rec:Seq key="extractProperties"> <br />
<rec:Map> <br />
<rec:Val key="metadataName">company</rec:Val><br />
<rec:Val key="targetAttribute">Company</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">creator</rec:Val><br />
<rec:Val key="targetAttribute">Creator</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">title</rec:Val><br />
<rec:Val key="targetAttribute">Title</rec:Val> <br />
<rec:Val key="singleResult">true</rec:Val> <br />
</rec:Map> <br />
</rec:Seq><br />
</proc:configuration><br />
</source><br />
<br />
===== Typical Property-Names =====<br />
* Generic<br />
**"contributor"<br />
**"coverage"<br />
**"creator"<br />
**"description"<br />
**"format"<br />
**"identifier"<br />
**"language"<br />
**"modified"<br />
**"publisher"<br />
**"relation"<br />
**"rights"<br />
**"source"<br />
**"subject"<br />
**"title"<br />
**"type"<br />
<br />
* MS- Office<br />
**"Application-Name"<br />
**"Application-Version"<br />
**"Author"<br />
**"Category"<br />
**"Comments"<br />
**"Company"<br />
**"Content-Status"<br />
**"Edit-Time"<br />
**"Keywords"<br />
**"Last-Author"<br />
**"Manager"<br />
**"Notes"<br />
**"Presentation-Format"<br />
**"Revision-Number"<br />
**"Security"<br />
**"Template"<br />
**"Total-Time"<br />
**"custom:"<br />
**"Version"<br />
<br />
<br />
=== Extending Tika ===<br />
<br />
SMILA does not contain the complete Tika distribution, because some converters need third party libraries with licenses that we are not allowed to distribute. However, it is easy (and absolutely legal!) to include those parts of Tika into your SMILA installation yourself:<br />
<br />
* Download org.eclipse.smila.tika.deps bundle from [http://ubuntuone.com/1n9PNxx6akZ0X1Bc7ahYrm here]<br />
* Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <tt><path-to-your-SMILA>/plugins</tt> folder.<br />
<br />
That's it! After SMILA restart, all document formats supported by Tika will be also be supported by SMILA's TikaPipelet.<br />
<br />
===== For Developers =====<br />
<br />
When working with SMILA in eclipse IDE:<br />
* Remove the <tt>org.eclipse.smila.tika.deps</tt> bundle from your workspace by deleting the project. (You can leave the project contents)<br />
* Put the downloaded org.eclipse.smila.tika.deps.jar in your <tt>SMILA.extensions</tt> project and reload your target platform.<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/TikaPipelet&diff=338783SMILA/Documentation/TikaPipelet2013-06-03T13:08:06Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>== Bundle: <tt>org.eclipse.smila.tika</tt> ==<br />
<br />
=== Description ===<br />
<br />
The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''. <br />
<br />
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.<br />
<br />
==== Supported document types ====<br />
<br />
By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports ''all'' document formats, see [[SMILA/Documentation/TikaPipelet#Extending Tika | "Extending Tika"]] section below.<br />
<br />
{| border = 1<br />
!Document format!!supported out-of-the-box!!supported by using!!Hints<br />
|-<br />
|''Microsoft Office''||yes||TikaPipelet||---<br />
|-<br />
|''OpenOffice (OpenDocument formats)''||yes||TikaPipelet||---<br />
|-<br />
|''RTF''||yes||TikaPipelet||---<br />
|-<br />
|''Plain text''||yes||---||no conversion, given input text is used as "converted" text<br />
|-<br />
|''HTML/XML''||yes||[[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]|| [[ SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe|BoilerpipePipelet]] can also be used for HTML text extraction <br />
|-<br />
|''PDF''||no||[[SMILA/Documentation/TikaPipelet#Extending Tika|Tika extension]] || converted text will be empty with out-of-the-box SMILA, a warning will be written to the log<br />
|-<br />
|}<br />
<br />
As you see, SMILA (resp. its 'AddPipeline' which is the default indexing pipeline) per default uses the TikaPipelet only for converting ''binary'' document formats. When indexing text based documents another piplelet ([[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]) is used. However after [[SMILA/Documentation/TikaPipelet#Extending Tika | extending Tika]] this can be simplified by using TikaPipelet for ''all'' document formats.<br />
<br />
=== Configuration ===<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.<br />
|-<br />
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''||String||runtime||yes||Name of input attachment or path to input attribute (process a String literal of attribute)<br />
|-<br />
|''outputName''||String||runtime|| yes||Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)<br />
|-<br />
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.<br />
|-<br />
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.<br />
|-<br />
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.<br />
|-<br />
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.<br />
|-<br />
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by ''#'', e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is ''false''. Default is (false).<br />
|-<br />
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is ''true''. If not set, the page number is not set (default)<br />
|-<br />
|''partsAttribute''||String||runtime||no||Setting that enables the output of multi-segment entities. This switches the behaviour of the pipelet to emit page-broken input objects not as multiple output records but rather as one ouput record with multiple parts. The parts are represented as a sequence of maps in the output record. The value of the ''partsAttribute'' settings serves as key for this sequence. Default is (not set).<br />
|-<br />
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled. Default is (false).<br />
|-<br />
|-<br />
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).<br />
|-<br />
|}<br />
<br />
Some notes on "maxLength" in combination with other parameters: <br />
* If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended. <br />
* The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.<br />
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.<br />
<br />
==== Using the multi-segment output record feature ====<br />
<br />
Assuming the parameter ''partsAttributeName'' is set and uses the value ''pages'', an example output record may look like this<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid": "file:/home/user/example.pdf",<br />
"filename": "example.pdf",<br />
"pages": <br />
[<br />
{<br />
"text": "this is the content of page 1.",<br />
"pageNo": "1"<br />
},<br />
{<br />
"text": "this is the content of page 2.",<br />
"pageNo": "2"<br />
},<br />
...<br />
],<br />
"_attachments": ["content"]<br />
}<br />
</source><br />
<br />
==== Configuring the Property Mapping ====<br />
<br />
In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the [[SMILA/Glossary#Tika|Tika]] documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.<br />
<br />
To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ''extractProperties'' parameter. Usually this parameter contains a sequence of maps. The map values have the following format:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''metadataName''||String||runtime||yes||The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.<br />
|-<br />
|''targetAttribute''||String||runtime||no||The name of Record attribute to store metadata value(s) in. If not set the string provided in the ''metadataName'' will be used as attribute name.<br />
|-<br />
|''singleResult''||Boolean||runtime||no|| Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.<br />
|-<br />
|''storeMode''||String ||runtime||no|| Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".<br />
|-<br />
|}<br />
<br />
===== Example =====<br />
<br />
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.<br />
<br />
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.<br />
<br />
<source lang="xml"><br />
<proc:configuration><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">Text</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="contentTypeAttribute">MimeType</rec:Val><br />
<rec:Val key="fileNameAttribute">FileName</rec:Val><br />
<rec:Val key="exportAsHtml">false</rec:Val><br />
<rec:Val key="pageBreak">falsec</rec:Val><br />
<rec:Val key="keepHyphens">false</rec:Val><br />
<rec:Val key="maxLength">100000</rec:Val><br />
<rec:Seq key="extractProperties"> <br />
<rec:Map> <br />
<rec:Val key="metadataName">company</rec:Val><br />
<rec:Val key="targetAttribute">Company</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">creator</rec:Val><br />
<rec:Val key="targetAttribute">Creator</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">title</rec:Val><br />
<rec:Val key="targetAttribute">Title</rec:Val> <br />
<rec:Val key="singleResult">true</rec:Val> <br />
</rec:Map> <br />
</rec:Seq><br />
</proc:configuration><br />
</source><br />
<br />
===== Typical Property-Names =====<br />
* Generic<br />
**"contributor"<br />
**"coverage"<br />
**"creator"<br />
**"description"<br />
**"format"<br />
**"identifier"<br />
**"language"<br />
**"modified"<br />
**"publisher"<br />
**"relation"<br />
**"rights"<br />
**"source"<br />
**"subject"<br />
**"title"<br />
**"type"<br />
<br />
* MS- Office<br />
**"Application-Name"<br />
**"Application-Version"<br />
**"Author"<br />
**"Category"<br />
**"Comments"<br />
**"Company"<br />
**"Content-Status"<br />
**"Edit-Time"<br />
**"Keywords"<br />
**"Last-Author"<br />
**"Manager"<br />
**"Notes"<br />
**"Presentation-Format"<br />
**"Revision-Number"<br />
**"Security"<br />
**"Template"<br />
**"Total-Time"<br />
**"custom:"<br />
**"Version"<br />
<br />
<br />
=== Extending Tika ===<br />
<br />
SMILA does not contain the complete Tika distribution, because some converters need third party libraries with licenses that we are not allowed to distribute. However, it is easy (and absolutely legal!) to include those parts of Tika into your SMILA installation yourself:<br />
<br />
* Download org.eclipse.smila.tika.deps bundle from [http://ubuntuone.com/1n9PNxx6akZ0X1Bc7ahYrm here]<br />
* Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <tt><path-to-your-SMILA>/plugins</tt> folder.<br />
<br />
That's it! After SMILA restart, all document formats supported by Tika will be also be supported by SMILA's TikaPipelet.<br />
<br />
===== For Developers =====<br />
<br />
When working with SMILA in eclipse IDE:<br />
* Remove the <tt>org.eclipse.smila.tika.deps</tt> bundle from your workspace by deleting the project. (You can leave the project contents)<br />
* Put the downloaded org.eclipse.smila.tika.deps.jar in your <tt>SMILA.extensions</tt> project and reload your target platform.<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/TikaPipelet&diff=338782SMILA/Documentation/TikaPipelet2013-06-03T12:50:14Z<p>Marco.strack.empolis.com: /* Using the multi-segment output record feature */</p>
<hr />
<div>== Bundle: <tt>org.eclipse.smila.tika</tt> ==<br />
<br />
=== Description ===<br />
<br />
The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''. <br />
<br />
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.<br />
<br />
==== Supported document types ====<br />
<br />
By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports ''all'' document formats, see [[SMILA/Documentation/TikaPipelet#Extending Tika | "Extending Tika"]] section below.<br />
<br />
{| border = 1<br />
!Document format!!supported out-of-the-box!!supported by using!!Hints<br />
|-<br />
|''Microsoft Office''||yes||TikaPipelet||---<br />
|-<br />
|''OpenOffice (OpenDocument formats)''||yes||TikaPipelet||---<br />
|-<br />
|''RTF''||yes||TikaPipelet||---<br />
|-<br />
|''Plain text''||yes||---||no conversion, given input text is used as "converted" text<br />
|-<br />
|''HTML/XML''||yes||[[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]|| [[ SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe|BoilerpipePipelet]] can also be used for HTML text extraction <br />
|-<br />
|''PDF''||no||[[SMILA/Documentation/TikaPipelet#Extending Tika|Tika extension]] || converted text will be empty with out-of-the-box SMILA, a warning will be written to the log<br />
|-<br />
|}<br />
<br />
As you see, SMILA (resp. its 'AddPipeline' which is the default indexing pipeline) per default uses the TikaPipelet only for converting ''binary'' document formats. When indexing text based documents another piplelet ([[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]) is used. However after [[SMILA/Documentation/TikaPipelet#Extending Tika | extending Tika]] this can be simplified by using TikaPipelet for ''all'' document formats.<br />
<br />
=== Configuration ===<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.<br />
|-<br />
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''||String||runtime||yes||Name of input attachment or path to input attribute (process a String literal of attribute)<br />
|-<br />
|''outputName''||String||runtime|| yes||Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)<br />
|-<br />
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.<br />
|-<br />
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.<br />
|-<br />
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.<br />
|-<br />
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.<br />
|-<br />
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by ''#'', e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is ''false''. Default is (false).<br />
|-<br />
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is ''true''. If not set, the page number is not set (default)<br />
|-<br />
|''partsAttribute''||String||runtime||no||Setting that enables the output of multi-segment entities. This switches the behaviour of the pipelet to emit page-broken input objects not as multiple output records but rather as one ouput record with multiple parts. The parts are represented as a sequence of maps in the output record. The value of the ''partsAttribute'' settings serves as key for this sequence. Default is (not set).<br />
|-<br />
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled. Default is (false).<br />
|-<br />
|-<br />
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).<br />
|-<br />
|}<br />
<br />
Some notes on "maxLength" in combination with other parameters: <br />
* If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended. <br />
* The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.<br />
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.<br />
<br />
==== Using the multi-segment output record feature ====<br />
<br />
Assuming the parameter ''partsAttributeName'' is set and uses the value ''pages'', an example output record may look like this<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid": "file:/home/user/example.pdf",<br />
"filename": "example.pdf",<br />
"pages": <br />
[<br />
{<br />
"text": "this is the content of page 1.",<br />
"pageNo": "1"<br />
},<br />
{<br />
"text": "this is the content of page 2.",<br />
"pageNo": "2"<br />
},<br />
...<br />
],<br />
"_attachments": ["content"]<br />
}<br />
</source><br />
<br />
==== Configuring the Property Mapping ====<br />
<br />
In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the [[SMILA/Glossary#Tika|Tika]] documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.<br />
<br />
To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ''extractProperties'' parameter. Usually this parameter contains a sequence of maps. The map values have the following format:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''metadataName''||String||runtime||yes||The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.<br />
|-<br />
|''targetAttribute''||String||runtime||no||The name of Record attribute to store metadata value(s) in. If not set the string provided in the ''metadataName'' will be used as attribute name.<br />
|-<br />
|''singleResult''||Boolean||runtime||no|| Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.<br />
|-<br />
|''storeMode''||String ||runtime||no|| Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".<br />
|-<br />
|}<br />
<br />
==== Example ====<br />
<br />
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.<br />
<br />
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.<br />
<br />
<source lang="xml"><br />
<proc:configuration><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">Text</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="contentTypeAttribute">MimeType</rec:Val><br />
<rec:Val key="fileNameAttribute">FileName</rec:Val><br />
<rec:Val key="exportAsHtml">false</rec:Val><br />
<rec:Val key="pageBreak">falsec</rec:Val><br />
<rec:Val key="keepHyphens">false</rec:Val><br />
<rec:Val key="maxLength">100000</rec:Val><br />
<rec:Seq key="extractProperties"> <br />
<rec:Map> <br />
<rec:Val key="metadataName">company</rec:Val><br />
<rec:Val key="targetAttribute">Company</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">creator</rec:Val><br />
<rec:Val key="targetAttribute">Creator</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">title</rec:Val><br />
<rec:Val key="targetAttribute">Title</rec:Val> <br />
<rec:Val key="singleResult">true</rec:Val> <br />
</rec:Map> <br />
</rec:Seq><br />
</proc:configuration><br />
</source><br />
<br />
==== Typical Property-Names ====<br />
* Generic<br />
**"contributor"<br />
**"coverage"<br />
**"creator"<br />
**"description"<br />
**"format"<br />
**"identifier"<br />
**"language"<br />
**"modified"<br />
**"publisher"<br />
**"relation"<br />
**"rights"<br />
**"source"<br />
**"subject"<br />
**"title"<br />
**"type"<br />
<br />
* MS- Office<br />
**"Application-Name"<br />
**"Application-Version"<br />
**"Author"<br />
**"Category"<br />
**"Comments"<br />
**"Company"<br />
**"Content-Status"<br />
**"Edit-Time"<br />
**"Keywords"<br />
**"Last-Author"<br />
**"Manager"<br />
**"Notes"<br />
**"Presentation-Format"<br />
**"Revision-Number"<br />
**"Security"<br />
**"Template"<br />
**"Total-Time"<br />
**"custom:"<br />
**"Version"<br />
<br />
<br />
=== Extending Tika ===<br />
<br />
SMILA does not contain the complete Tika distribution, because some converters need third party libraries with licenses that we are not allowed to distribute. However, it is easy (and absolutely legal!) to include those parts of Tika into your SMILA installation yourself:<br />
<br />
* Download org.eclipse.smila.tika.deps bundle from [http://ubuntuone.com/1n9PNxx6akZ0X1Bc7ahYrm here]<br />
* Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <tt><path-to-your-SMILA>/plugins</tt> folder.<br />
<br />
That's it! After SMILA restart, all document formats supported by Tika will be also be supported by SMILA's TikaPipelet.<br />
<br />
===== For Developers =====<br />
<br />
When working with SMILA in eclipse IDE:<br />
* Remove the <tt>org.eclipse.smila.tika.deps</tt> bundle from your workspace by deleting the project. (You can leave the project contents)<br />
* Put the downloaded org.eclipse.smila.tika.deps.jar in your <tt>SMILA.extensions</tt> project and reload your target platform.<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/TikaPipelet&diff=338780SMILA/Documentation/TikaPipelet2013-06-03T12:12:11Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>== Bundle: <tt>org.eclipse.smila.tika</tt> ==<br />
<br />
=== Description ===<br />
<br />
The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''. <br />
<br />
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.<br />
<br />
==== Supported document types ====<br />
<br />
By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports ''all'' document formats, see [[SMILA/Documentation/TikaPipelet#Extending Tika | "Extending Tika"]] section below.<br />
<br />
{| border = 1<br />
!Document format!!supported out-of-the-box!!supported by using!!Hints<br />
|-<br />
|''Microsoft Office''||yes||TikaPipelet||---<br />
|-<br />
|''OpenOffice (OpenDocument formats)''||yes||TikaPipelet||---<br />
|-<br />
|''RTF''||yes||TikaPipelet||---<br />
|-<br />
|''Plain text''||yes||---||no conversion, given input text is used as "converted" text<br />
|-<br />
|''HTML/XML''||yes||[[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]|| [[ SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe|BoilerpipePipelet]] can also be used for HTML text extraction <br />
|-<br />
|''PDF''||no||[[SMILA/Documentation/TikaPipelet#Extending Tika|Tika extension]] || converted text will be empty with out-of-the-box SMILA, a warning will be written to the log<br />
|-<br />
|}<br />
<br />
As you see, SMILA (resp. its 'AddPipeline' which is the default indexing pipeline) per default uses the TikaPipelet only for converting ''binary'' document formats. When indexing text based documents another piplelet ([[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]) is used. However after [[SMILA/Documentation/TikaPipelet#Extending Tika | extending Tika]] this can be simplified by using TikaPipelet for ''all'' document formats.<br />
<br />
=== Configuration ===<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.<br />
|-<br />
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''||String||runtime||yes||Name of input attachment or path to input attribute (process a String literal of attribute)<br />
|-<br />
|''outputName''||String||runtime|| yes||Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)<br />
|-<br />
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.<br />
|-<br />
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.<br />
|-<br />
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.<br />
|-<br />
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.<br />
|-<br />
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by ''#'', e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is ''false''. Default is (false).<br />
|-<br />
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is ''true''. If not set, the page number is not set (default)<br />
|-<br />
|''partsAttribute''||String||runtime||no||Setting that enables the output of multi-segment entities. This switches the behaviour of the pipelet to emit page-broken input objects not as multiple output records but rather as one ouput record with multiple parts. The parts are represented as a sequence of maps in the output record. The value of the ''partsAttribute'' settings serves as key for this sequence. Default is (not set).<br />
|-<br />
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled. Default is (false).<br />
|-<br />
|-<br />
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).<br />
|-<br />
|}<br />
<br />
Some notes on "maxLength" in combination with other parameters: <br />
* If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended. <br />
* The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.<br />
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.<br />
<br />
==== Using the multi-segment output record feature ====<br />
<br />
Assuming the parameter ''partsAttributeName'' is set and uses the value ''pages'', an example output record may look like this<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid" : "web:http://example.org/something",<br />
"_source" : "web",<br />
"url" : "web:http://example.org/something",<br />
"filesize" : 1234,<br />
"sizeInKb" : 1.2,<br />
"checked" : true,<br />
"created" : "2010-12-02",<br />
"lastModified" : "2010-12-02T16:20:54.123+0100",<br />
"trustee" : [ "group1", "group2" ],<br />
"author" : <br />
[ {<br />
"firstname" : "John",<br />
"lastname" : "Doe"<br />
},<br />
{<br />
"firstname" : "Lisa",<br />
"lastname" : "Müller"<br />
} ],<br />
"contact" : <br />
{<br />
"email" : "Homer.Simpson@powerplant.com",<br />
"address" : <br />
{<br />
"street" : "742 Evergreen Terrace",<br />
"city" : "Springfield"<br />
}<br />
},<br />
"pages": <br />
[<br />
{<br />
"text": "this is an address content. it provides streetnames and -numbers.",<br />
"pageNo": "1"<br />
},<br />
{<br />
"text": "this is the content of page 3.",<br />
"pageNo": "2"<br />
},<br />
{<br />
"text": "this is the content of page 4.",<br />
"pageNo": "3"<br />
}<br />
]<br />
"_attachments": ["content", "fulltext"]<br />
}<br />
</source><br />
<br />
==== Configuring the Property Mapping ====<br />
<br />
In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the [[SMILA/Glossary#Tika|Tika]] documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.<br />
<br />
To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ''extractProperties'' parameter. Usually this parameter contains a sequence of maps. The map values have the following format:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''metadataName''||String||runtime||yes||The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.<br />
|-<br />
|''targetAttribute''||String||runtime||no||The name of Record attribute to store metadata value(s) in. If not set the string provided in the ''metadataName'' will be used as attribute name.<br />
|-<br />
|''singleResult''||Boolean||runtime||no|| Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.<br />
|-<br />
|''storeMode''||String ||runtime||no|| Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".<br />
|-<br />
|}<br />
<br />
==== Example ====<br />
<br />
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.<br />
<br />
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.<br />
<br />
<source lang="xml"><br />
<proc:configuration><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">Text</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="contentTypeAttribute">MimeType</rec:Val><br />
<rec:Val key="fileNameAttribute">FileName</rec:Val><br />
<rec:Val key="exportAsHtml">false</rec:Val><br />
<rec:Val key="pageBreak">falsec</rec:Val><br />
<rec:Val key="keepHyphens">false</rec:Val><br />
<rec:Val key="maxLength">100000</rec:Val><br />
<rec:Seq key="extractProperties"> <br />
<rec:Map> <br />
<rec:Val key="metadataName">company</rec:Val><br />
<rec:Val key="targetAttribute">Company</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">creator</rec:Val><br />
<rec:Val key="targetAttribute">Creator</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">title</rec:Val><br />
<rec:Val key="targetAttribute">Title</rec:Val> <br />
<rec:Val key="singleResult">true</rec:Val> <br />
</rec:Map> <br />
</rec:Seq><br />
</proc:configuration><br />
</source><br />
<br />
==== Typical Property-Names ====<br />
* Generic<br />
**"contributor"<br />
**"coverage"<br />
**"creator"<br />
**"description"<br />
**"format"<br />
**"identifier"<br />
**"language"<br />
**"modified"<br />
**"publisher"<br />
**"relation"<br />
**"rights"<br />
**"source"<br />
**"subject"<br />
**"title"<br />
**"type"<br />
<br />
* MS- Office<br />
**"Application-Name"<br />
**"Application-Version"<br />
**"Author"<br />
**"Category"<br />
**"Comments"<br />
**"Company"<br />
**"Content-Status"<br />
**"Edit-Time"<br />
**"Keywords"<br />
**"Last-Author"<br />
**"Manager"<br />
**"Notes"<br />
**"Presentation-Format"<br />
**"Revision-Number"<br />
**"Security"<br />
**"Template"<br />
**"Total-Time"<br />
**"custom:"<br />
**"Version"<br />
<br />
<br />
=== Extending Tika ===<br />
<br />
SMILA does not contain the complete Tika distribution, because some converters need third party libraries with licenses that we are not allowed to distribute. However, it is easy (and absolutely legal!) to include those parts of Tika into your SMILA installation yourself:<br />
<br />
* Download org.eclipse.smila.tika.deps bundle from [http://ubuntuone.com/1n9PNxx6akZ0X1Bc7ahYrm here]<br />
* Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <tt><path-to-your-SMILA>/plugins</tt> folder.<br />
<br />
That's it! After SMILA restart, all document formats supported by Tika will be also be supported by SMILA's TikaPipelet.<br />
<br />
===== For Developers =====<br />
<br />
When working with SMILA in eclipse IDE:<br />
* Remove the <tt>org.eclipse.smila.tika.deps</tt> bundle from your workspace by deleting the project. (You can leave the project contents)<br />
* Put the downloaded org.eclipse.smila.tika.deps.jar in your <tt>SMILA.extensions</tt> project and reload your target platform.<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/TikaPipelet&diff=338778SMILA/Documentation/TikaPipelet2013-06-03T12:00:39Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>== Bundle: <tt>org.eclipse.smila.tika</tt> ==<br />
<br />
=== Description ===<br />
<br />
The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''. <br />
<br />
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.<br />
<br />
==== Supported document types ====<br />
<br />
By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports ''all'' document formats, see [[SMILA/Documentation/TikaPipelet#Extending Tika | "Extending Tika"]] section below.<br />
<br />
{| border = 1<br />
!Document format!!supported out-of-the-box!!supported by using!!Hints<br />
|-<br />
|''Microsoft Office''||yes||TikaPipelet||---<br />
|-<br />
|''OpenOffice (OpenDocument formats)''||yes||TikaPipelet||---<br />
|-<br />
|''RTF''||yes||TikaPipelet||---<br />
|-<br />
|''Plain text''||yes||---||no conversion, given input text is used as "converted" text<br />
|-<br />
|''HTML/XML''||yes||[[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]|| [[ SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe|BoilerpipePipelet]] can also be used for HTML text extraction <br />
|-<br />
|''PDF''||no||[[SMILA/Documentation/TikaPipelet#Extending Tika|Tika extension]] || converted text will be empty with out-of-the-box SMILA, a warning will be written to the log<br />
|-<br />
|}<br />
<br />
As you see, SMILA (resp. its 'AddPipeline' which is the default indexing pipeline) per default uses the TikaPipelet only for converting ''binary'' document formats. When indexing text based documents another piplelet ([[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]) is used. However after [[SMILA/Documentation/TikaPipelet#Extending Tika | extending Tika]] this can be simplified by using TikaPipelet for ''all'' document formats.<br />
<br />
=== Configuration ===<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.<br />
|-<br />
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''||String||runtime||yes||Name of input attachment or path to input attribute (process a String literal of attribute)<br />
|-<br />
|''outputName''||String||runtime|| yes||Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)<br />
|-<br />
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.<br />
|-<br />
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.<br />
|-<br />
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.<br />
|-<br />
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.<br />
|-<br />
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by ''#'', e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is ''false''. Default is (false).<br />
|-<br />
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is ''true''. If not set, the page number is not set (default)<br />
|-<br />
|''partsAttributeName''||String||runtime||no||Setting that enables the output of multi-segment entities. This switches the behaviour of the pipelet to emit page-broken input objects not as multiple output records but rather as one ouput record with multiple parts. The parts are represented as a sequence of maps in the output record. The value of the ''partsAttributeName'' settings serves as key for this sequence. Default is (not set).<br />
|-<br />
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled. Default is (false).<br />
|-<br />
|-<br />
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).<br />
|-<br />
|}<br />
<br />
Some notes on "maxLength" in combination with other parameters: <br />
* If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended. <br />
* The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.<br />
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.<br />
<br />
==== Using the multi-segment output record feature ====<br />
<br />
Assuming the parameter ''partsAttributeName'' is set and uses the value ''pages'', an example output record may look like this<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid" : "web:http://example.org/something",<br />
"_source" : "web",<br />
"url" : "web:http://example.org/something",<br />
"filesize" : 1234,<br />
"sizeInKb" : 1.2,<br />
"checked" : true,<br />
"created" : "2010-12-02",<br />
"lastModified" : "2010-12-02T16:20:54.123+0100",<br />
"trustee" : [ "group1", "group2" ],<br />
"author" : <br />
[ {<br />
"firstname" : "John",<br />
"lastname" : "Doe"<br />
},<br />
{<br />
"firstname" : "Lisa",<br />
"lastname" : "Müller"<br />
} ],<br />
"contact" : <br />
{<br />
"email" : "Homer.Simpson@powerplant.com",<br />
"address" : <br />
{<br />
"street" : "742 Evergreen Terrace",<br />
"city" : "Springfield"<br />
}<br />
},<br />
"_attachments": ["content", "fulltext"]<br />
}<br />
</source><br />
<br />
==== Configuring the Property Mapping ====<br />
<br />
In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the [[SMILA/Glossary#Tika|Tika]] documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.<br />
<br />
To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ''extractProperties'' parameter. Usually this parameter contains a sequence of maps. The map values have the following format:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''metadataName''||String||runtime||yes||The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.<br />
|-<br />
|''targetAttribute''||String||runtime||no||The name of Record attribute to store metadata value(s) in. If not set the string provided in the ''metadataName'' will be used as attribute name.<br />
|-<br />
|''singleResult''||Boolean||runtime||no|| Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.<br />
|-<br />
|''storeMode''||String ||runtime||no|| Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".<br />
|-<br />
|}<br />
<br />
==== Example ====<br />
<br />
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.<br />
<br />
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.<br />
<br />
<source lang="xml"><br />
<proc:configuration><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">Text</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="contentTypeAttribute">MimeType</rec:Val><br />
<rec:Val key="fileNameAttribute">FileName</rec:Val><br />
<rec:Val key="exportAsHtml">false</rec:Val><br />
<rec:Val key="pageBreak">falsec</rec:Val><br />
<rec:Val key="keepHyphens">false</rec:Val><br />
<rec:Val key="maxLength">100000</rec:Val><br />
<rec:Seq key="extractProperties"> <br />
<rec:Map> <br />
<rec:Val key="metadataName">company</rec:Val><br />
<rec:Val key="targetAttribute">Company</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">creator</rec:Val><br />
<rec:Val key="targetAttribute">Creator</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">title</rec:Val><br />
<rec:Val key="targetAttribute">Title</rec:Val> <br />
<rec:Val key="singleResult">true</rec:Val> <br />
</rec:Map> <br />
</rec:Seq><br />
</proc:configuration><br />
</source><br />
<br />
==== Typical Property-Names ====<br />
* Generic<br />
**"contributor"<br />
**"coverage"<br />
**"creator"<br />
**"description"<br />
**"format"<br />
**"identifier"<br />
**"language"<br />
**"modified"<br />
**"publisher"<br />
**"relation"<br />
**"rights"<br />
**"source"<br />
**"subject"<br />
**"title"<br />
**"type"<br />
<br />
* MS- Office<br />
**"Application-Name"<br />
**"Application-Version"<br />
**"Author"<br />
**"Category"<br />
**"Comments"<br />
**"Company"<br />
**"Content-Status"<br />
**"Edit-Time"<br />
**"Keywords"<br />
**"Last-Author"<br />
**"Manager"<br />
**"Notes"<br />
**"Presentation-Format"<br />
**"Revision-Number"<br />
**"Security"<br />
**"Template"<br />
**"Total-Time"<br />
**"custom:"<br />
**"Version"<br />
<br />
<br />
=== Extending Tika ===<br />
<br />
SMILA does not contain the complete Tika distribution, because some converters need third party libraries with licenses that we are not allowed to distribute. However, it is easy (and absolutely legal!) to include those parts of Tika into your SMILA installation yourself:<br />
<br />
* Download org.eclipse.smila.tika.deps bundle from [http://ubuntuone.com/1n9PNxx6akZ0X1Bc7ahYrm here]<br />
* Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <tt><path-to-your-SMILA>/plugins</tt> folder.<br />
<br />
That's it! After SMILA restart, all document formats supported by Tika will be also be supported by SMILA's TikaPipelet.<br />
<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/TikaPipelet&diff=338767SMILA/Documentation/TikaPipelet2013-06-03T09:42:58Z<p>Marco.strack.empolis.com: </p>
<hr />
<div>== Bundle: <tt>org.eclipse.smila.tika</tt> ==<br />
<br />
=== Description ===<br />
<br />
The TikaPipelet converts various document formats (such as PDF, Microsoft Office, OpenOffice, etc.) to plain text using [[SMILA/Glossary#Tika|Tika]] technology: A record attachment containing the binary content can thus be converted to plain text and stored in an attribute. In addition to that, metadata properties of the document (like title, author, etc) can be extracted and written to record attibutes. To improve the Tika parsing process, it is possible to optionally pass the content-type and filename of the document stored in other record attributes via parameters ''contentTypeAttribute'' and ''fileNameAttribute''. <br />
<br />
The TikaPipelet supports the configurable error handling as described in [[SMILA/Development_Guidelines/How_to_write_a_Pipelet#Implementation]]. When used in JobManager workflows, records causing errors are dropped.<br />
<br />
==== Supported document types ====<br />
<br />
By default, SMILA contains only a subset of Tika. Therefore not all documents formats can be converted out-of-the-box by using the TikaPipelet. However it's easy to extend SMILA so that the TikaPipelet supports ''all'' document formats, see [[SMILA/Documentation/TikaPipelet#Extending Tika | "Extending Tika"]] section below.<br />
<br />
{| border = 1<br />
!Document format!!supported out-of-the-box!!supported by using!!Hints<br />
|-<br />
|''Microsoft Office''||yes||TikaPipelet||---<br />
|-<br />
|''OpenOffice (OpenDocument formats)''||yes||TikaPipelet||---<br />
|-<br />
|''RTF''||yes||TikaPipelet||---<br />
|-<br />
|''Plain text''||yes||---||no conversion, given input text is used as "converted" text<br />
|-<br />
|''HTML/XML''||yes||[[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]|| [[ SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe|BoilerpipePipelet]] can also be used for HTML text extraction <br />
|-<br />
|''PDF''||no||[[SMILA/Documentation/TikaPipelet#Extending Tika|Tika extension]] || converted text will be empty with out-of-the-box SMILA, a warning will be written to the log<br />
|-<br />
|}<br />
<br />
As you see, SMILA (resp. its 'AddPipeline' which is the default indexing pipeline) per default uses the TikaPipelet only for converting ''binary'' document formats. When indexing text based documents another piplelet ([[SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.HtmlToTextPipelet|HtmlToTextPipelet]]) is used. However after [[SMILA/Documentation/TikaPipelet#Extending Tika | extending Tika]] this can be simplified by using TikaPipelet for ''all'' document formats.<br />
<br />
=== Configuration ===<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''inputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if the input is found in an attachment or attribute of the record. Usually it doesn't make sense to use "ATTRIBUTE" here because the documents to convert are binary content.<br />
|-<br />
|''outputType''||String : ''ATTACHMENT, ATTRIBUTE''||runtime||yes||Selects if output should be stored in an attachment or attribute of the record<br />
|-<br />
|''inputName''||String||runtime||yes||Name of input attachment or path to input attribute (process a String literal of attribute)<br />
|-<br />
|''outputName''||String||runtime|| yes||Name of output attachment or path to output attribute for plain text (store result as String literal of attribute)<br />
|-<br />
|''extractProperties''||String||runtime||no||Specifies which metadata properties reported by Tika for the document should be written to which record attribute. See below for details.<br />
|-<br />
|''contentTypeAttribute''||String||runtime||no||Parameter referencing the attribute that contains the content-type of the document. If specified the content-type is used to better guide the Tika parsing process. Tika also performs a MimeType detection and the resulting value is stored in this attribute.<br />
|-<br />
|''fileNameAttribute''||String||runtime||no||Parameter referencing the attribute that contains the name of the file that was the source of the attachment content. If specified the filename is used to better guide the Tika parsing process.<br />
|-<br />
|''exportAsHtml''||Boolean||runtime||no||Flag that specifies if the output should be in HTML format (true) or not (false). Plain text output (false) is default.<br />
|-<br />
|''pageBreak''||Boolean||runtime||no||Flag that specifies if pageBreaks should be used to split the content into multiple output records (true) or not (false). The recordId of the output records is generated by concatenating the recordId of the input record with the pageNumber, seperated by ''#'', e.g. (testdoc.pdf#1).This parameter is only interpreted if exportAsHtml is ''false''. Default is (false).<br />
|-<br />
|''pageNumberAttribute''||String||runtime||no||Parameter that specifies the name of the attribute that should contain the extracted page number. This parameter is only interpreted if pageBreak is ''true''. If not set, the page number is not set (default)<br />
|-<br />
|''partsAttributeName''||String||runtime||no||Setting that enables the output of multi-segment entities. This switches the behaviour of the pipelet to emit page-broken input objects not as multiple output records but rather as one ouput record with multiple parts. The parts are represented as a sequence of maps in the output record. The value of the ''partsAttributeName'' settings serves as key for this sequence. Default is (not set).<br />
|-<br />
|''keepHyphens''||Boolean||runtime||no||If set to "false", hyphens are removed from words at line breaks so that the separated syllables are contracted to one word (“charac-<newline>teristics” gets "characteristics"). If set to "true", this dehyphenation is disabled. Default is (false).<br />
|-<br />
|-<br />
|''maxLength''||Long||runtime||no||The maximum number of characters to extract. If a document contains more characters than specified all remaining characters are omitted. To get all available characters just omit this Parameter. This may lead to OutOfMemory Exceptions with big documents. Default is -1 (unlimited).<br />
|-<br />
|}<br />
<br />
Some notes on "maxLength" in combination with other parameters: <br />
* If "exportAsHTML" is set to "true", the HTML tags will not be counted when checking the limit, so the actual output will be longer than maxLength characters: The output creation stops when the "real" text content of the HTML reaches maxLength characters. After this, also no additional tags will be appended. <br />
* The extracted text is "trimmed" so the actual output can be shorter than maxLength characters cause leading and trailing whitespaces are removed.<br />
* When "outputHyphen" and "exportAsHTML" are set to "false", the actual output can be shorter than maxLength characters, because the hyphens and linebreaks are removed from the limited output. With "exportAsHTML=true", this effect will probably not notable because usually the output will get longer because of the HTML tags.<br />
<br />
<br />
==== Configuring the Property Mapping ====<br />
<br />
In addition to the plain text content, Tika can extract metadata properties from documents like title, author, publisher, dates of publication etc, ... The names of these properties depend very much on the documents and what is actually extracted. Some well known names like Dublin Core (dc, dcterms) are used. For a complete list please refer to the [[SMILA/Glossary#Tika|Tika]] documentation. To check with your documents you can download Tika and use the Tika Application to see all extracted metadata.<br />
<br />
To store such metadata properties in SMILA records, you must specify the names of the properties you want to store in the ''extractProperties'' parameter. Usually this parameter contains a sequence of maps. The map values have the following format:<br />
<br />
{| border = 1<br />
!Property!!Type!!Read Type!!Required!!Description<br />
|-<br />
|''metadataName''||String||runtime||yes||The name of the metadata property. This will be matched with the extracted metadata property names in a case-insensitive manner.<br />
|-<br />
|''targetAttribute''||String||runtime||no||The name of Record attribute to store metadata value(s) in. If not set the string provided in the ''metadataName'' will be used as attribute name.<br />
|-<br />
|''singleResult''||Boolean||runtime||no|| Flag that specifies if only the first value (if multiple values exists) is used in the result (true) or if all values are used (false). Default is false.<br />
|-<br />
|''storeMode''||String ||runtime||no|| Specifies whether attributes already stored in the record target attribute will be left unchanged ("leave"), overwritten ("overwrite") or if the extracted properties will be added to potentially existing ones ("add"). Default is "add".<br />
|-<br />
|}<br />
<br />
==== Example ====<br />
<br />
The following example shows how to configure the pipelet to extract the text from the attachment called ''Content'' and stores the extracted text in the attribute ''Text''. Additionally the eventually contained metadata properties Company, Creator and Title will be stored in properties.<br />
<br />
E.g. if a word document with the value "ACME" as company and "John Doe" as creator, the resulting record would contain the plain text in the attribute <tt>Text</tt>, the value <tt>ACME</tt> in the attribute <tt>Company</tt>, as well as the value <tt>John Doe</tt> in an attribute <tt>Creator</tt>.<br />
<br />
<source lang="xml"><br />
<proc:configuration><br />
<rec:Val key="inputName">Content</rec:Val><br />
<rec:Val key="inputType">ATTACHMENT</rec:Val><br />
<rec:Val key="outputName">Text</rec:Val><br />
<rec:Val key="outputType">ATTRIBUTE</rec:Val><br />
<rec:Val key="contentTypeAttribute">MimeType</rec:Val><br />
<rec:Val key="fileNameAttribute">FileName</rec:Val><br />
<rec:Val key="exportAsHtml">false</rec:Val><br />
<rec:Val key="pageBreak">falsec</rec:Val><br />
<rec:Val key="keepHyphens">false</rec:Val><br />
<rec:Val key="maxLength">100000</rec:Val><br />
<rec:Seq key="extractProperties"> <br />
<rec:Map> <br />
<rec:Val key="metadataName">company</rec:Val><br />
<rec:Val key="targetAttribute">Company</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">creator</rec:Val><br />
<rec:Val key="targetAttribute">Creator</rec:Val> <br />
<rec:Val key="singleResult">false</rec:Val> <br />
</rec:Map><br />
<rec:Map> <br />
<rec:Val key="metadataName">title</rec:Val><br />
<rec:Val key="targetAttribute">Title</rec:Val> <br />
<rec:Val key="singleResult">true</rec:Val> <br />
</rec:Map> <br />
</rec:Seq><br />
</proc:configuration><br />
</source><br />
<br />
==== Typical Property-Names ====<br />
* Generic<br />
**"contributor"<br />
**"coverage"<br />
**"creator"<br />
**"description"<br />
**"format"<br />
**"identifier"<br />
**"language"<br />
**"modified"<br />
**"publisher"<br />
**"relation"<br />
**"rights"<br />
**"source"<br />
**"subject"<br />
**"title"<br />
**"type"<br />
<br />
* MS- Office<br />
**"Application-Name"<br />
**"Application-Version"<br />
**"Author"<br />
**"Category"<br />
**"Comments"<br />
**"Company"<br />
**"Content-Status"<br />
**"Edit-Time"<br />
**"Keywords"<br />
**"Last-Author"<br />
**"Manager"<br />
**"Notes"<br />
**"Presentation-Format"<br />
**"Revision-Number"<br />
**"Security"<br />
**"Template"<br />
**"Total-Time"<br />
**"custom:"<br />
**"Version"<br />
<br />
<br />
=== Extending Tika ===<br />
<br />
SMILA does not contain the complete Tika distribution, because some converters need third party libraries with licenses that we are not allowed to distribute. However, it is easy (and absolutely legal!) to include those parts of Tika into your SMILA installation yourself:<br />
<br />
* Download org.eclipse.smila.tika.deps bundle from [http://ubuntuone.com/1n9PNxx6akZ0X1Bc7ahYrm here]<br />
* Replace the appropriate bundle of your SMILA distribution with the downloaded bundle by just copying the downloaded bundle to <tt><path-to-your-SMILA>/plugins</tt> folder.<br />
<br />
That's it! After SMILA restart, all document formats supported by Tika will be also be supported by SMILA's TikaPipelet.<br />
<br />
<br />
[[Category:SMILA]] [[Category:SMILA/Pipelet]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/Data_Model_and_Serialization_Formats&diff=333600SMILA/Documentation/Data Model and Serialization Formats2013-04-11T12:23:29Z<p>Marco.strack.empolis.com: included ISO8601 time zone designators</p>
<hr />
<div>== SMILA Data Model == <br />
<br />
* Implementation bundle: <tt>org.eclipse.smila.datamodel</tt><br />
* Current Version: 1.0.0<br />
<br />
=== Concepts ===<br />
<br />
The data to be processed in SMILA is represented as '''records'''. For example, one record could correspond to one document or to any resource which should be indexed or found in a search. A record consists of '''metadata''' and optional '''attachments'''. <br />
<br />
[[Image:SMILA-datamodel-1.0.png|800px|SMILA data model version 1.0]]<br />
<br />
;Metadata:<br />
Metadata contains typed '''values''' (literals) arranged in '''maps''' (key-anything associations) and '''sequences''' (lists of anything). Values can be strings, long integers, double precision floating point numbers, booleans, dates (year, month, day) or datetimes (date + time of day, down to seconds). Maps and sequences can be nested arbitrarily, map keys are always strings. All metadata of one record is arranged in a single Map. <br />
;Attachments:<br />
Attachments can contain any binary content ("byte arrays"), possibly of larger size. If the content is kept in-memory or read from a persistence service on-demand depends on the implementation of the interface. Currently the size is limited to 2 GB (maximum size of a Java <tt>byte[]</tt>), but we are planning to extend this in the future.<br />
<br />
A single entry in a record's metadata map is called '''Metadata element'''.<br />
According to the use case, metadata elements can be semantically interpreted as:<br />
<br />
;Attributes: Usually, attributes are used when referring to the metadata of an object which is to be processed from a given data source or which is retrieved as the result of a search request. For example, typical attributes characterizing a web page to be indexed are its URL, the size in bytes, the MIME type, the title, and the plain-text content. These attributes are defined by the application domain. <br />
<br />
;Parameters: Attributes may not be adequate or sufficient for all record types. For example, in search processing, a record represents not a single object from some data source but rather a search request object. In such a case, the record's metadata does not contain attributes from the application domain on top-level but rather ''request parameters'' that configure and influence the request execution. These parameters are defined by the pipelets which are used in the workflow that was triggered by the search request. Also, their names do not start with underscores. However, a request or result record may contain application-specific attributes on deeper nested levels. Find an example, hopefully illustrating the difference between attributes and parameters, in [[SMILA/Documentation/2011.Simplification/Search|Search API]].<br />
<br />
;Annotations: An annotation can be used to add a data structure to the record which was generated as the result of some processing step. E.g., a named-entity-recognition pipelet could add an annotation describing at which character position some entity was found, meaning that the record was ''annotated'' with this additional information. If annotations appear in the same maps as attributes, their names should be chosen in such a way that they will not conflict with attribute names from the application, e.g. by prefixing them with an underscore "_".<br />
<br />
;System attributes: These attributes are needed by SMILA in order to coordinate the processing of a record (see below). Their names start with an underscore "_", so that they will not conflict with names from the application domain.<br />
<br />
==== System attributes ====<br />
<br />
;RecordID: Every record must contain a single-valued string attribute named "_recordid" which is required to identify the record. It must be unique for all processed records. This must be ensured by whoever created and submitted the record it to the system (this would be crawlers or agents, usually). There is no predefined format of the record ID, hence it can contain any string. So, creating UUIDs or something similar would be entirely sufficient. Also, the producer must place any information needed to access the original data from which the record was produced into explicitly named attributes.<br />
;Source: Every record should also contain a second system attribute named "_source" which contains the ID of the data source (e.g. crawler definition) that produced it. This is used by DeltaIndexing or RecordStorage to perform operations on all records from the same source.<br />
<br />
==== Date and DateTime formats ====<br />
<br />
Internally, date and datetime values are represented as instances of [http://download.oracle.com/javase/7/docs/api/java/util/Date.html <code>java.util.Date</code>], which means that they are stored as the number of milliseconds since January 1, 1970, 00:00:00 GMT. For the string serialization used in XML, JSON or BON (see below) the following rules apply:<br />
<br />
* The format of date values is "yyyy-MM-dd" (see [http://download.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html SimpleDateFormat] for the meaning of the format string). The year must have exactly 4 digits, the month and day must have 2 digits.<br />
* The format of datetime values is either "yyyy-MM-dd'T'HH:mm:ss<TZ>" or "yyyy-MM-dd'T'HH:mm:ss.SSS<TZ>". With "<TZ>" being either "X", "XX" or "XXX". <br />
** For the date part the date value rules apply.<br />
** Milliseconds are optional when parsing datetime values from strings, but if given, they must have exactly 3 digits.<br />
** The time zone information must be included and must conform to [http://en.wikipedia.org/wiki/ISO_8601#Time_zone_designators ISO 8601 time zone designators]. Which may be either "Z" for GMT/UTC/ZuluTime, or one of the forms "[+-]hh", "[+-]hhmm", "[+-]hh:mm", denoting the offset from UTC. Examples would be "+0100" for Central European Time (CET, MEZ) or "-0600" for Eastern Standard Time (EST). Of course, using "+00", "+0000" or "+00:00" for GMT/UTC/ZuluTime is fine, too. An exception is the negative sign with a zero offset, which is no valid time zone in ISO 8601 (like "-00").<br />
** The default time zone designator format in SMILA is of the form "XX" which is represented by "Sign TwoDigitHours Minutes" (like "-0830"). <br />
** When a datetime value is created by parsing from a string (e.g. by parsing XML, JSON or BON, or using the [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/DataFactory.html <code>DataFactory.parseFromString</code>] methods, it will be printed in the exactly same way when serialized again when written to XML, JSON or BON (see [http://build.eclipse.org/rt/smila/javadoc/current/org/eclipse/smila/datamodel/ValueFormatHelper.html <code>ValueFormatHelper.getDefaultDateTimeFormat</code>]).<br />
** When a datetime value was created in Java from an instance of <code>java.util.Date</code> immediately, it will be serialized using the default timezone of the creating JVM. The milliseconds will be included, too, even if they are just 000.<br />
<br />
=== XML format ===<br />
<br />
The XML format of a record is designed to be quite compact:<br />
<br />
<source lang="xml"><br />
<Record xmlns="http://www.eclipse.org/smila/record" version="2.0"><br />
<Val key="_recordid">web:http://example.org/something</Val><br />
<Val key="_source">web</Val><br />
<Val key="url">http://example.org/something</Val><br />
<Val key="filesize" type="long">1234</Val><br />
<Val key="sizeInKb" type="double">1.2</Val><br />
<Val key="checked" type="boolean">true</Val><br />
<Val key="created" type="date">2010-12-02</Val><br />
<Val key="lastModified" type="datetime">2010-12-02T16:20:54.123+0100</Val><br />
<Seq key="trustee"><br />
<Val>group1</Val><br />
<Val>group2</Val><br />
</Seq><br />
<Seq key="author"><br />
<Map><br />
<Val key="firstname">John</Val><br />
<Val key="lastname">Doe</Val><br />
</Map><br />
<Map><br />
<Val key="firstname">Lisa</Val><br />
<Val key="lastname">Müller</Val><br />
</Map><br />
</Seq> <br />
<Map key="contact"><br />
<Val key="email">Homer.Simpson@powerplant.com</Val> <br />
<Map key="address"><br />
<Val key="street">742 Evergreen Terrace</Val><br />
<Val key="city">Springfield</Val><br />
</Map><br />
</Map><br />
<Seq key="emptylist" /><br />
<Map key="emptymap" /><br />
<br />
<Attachment>content</Attachment><br />
<Attachment>fulltext</Attachment><br />
</Record><br />
</source><br />
<br />
'''Notes:'''<br />
* The Any objects are represented by <tt><Val></tt>, <tt><Map></tt>, and <tt><Seq></tt> elements.<br />
* An object that is part of a map must have an additional ''key'' attribute. Elements of sequences must not have the ''key'' attribute.<br />
* The type of a value is defined by an optional ''type'' attribute, the default is "string".<br />
* See above for description of date and datetime formats.<br />
* The top-level <tt><Map></tt> element of a record is omitted from the XML.<br />
* In XML, the record does not contain the attachment values, but only their names so that a reader knows that there are attachments to be processed.<br />
* Attachments are not supported in the XML format, only the names of attachments are preserved, the attachments themselves (the bytes) are lost<br />
<br />
See package <tt>org.eclipse.smila.datamodel.xml</tt> for serialization helper classes.<br />
<br />
{{Bug2|351704|Due to a bug in the JDK's default implementation of XMLStreamReader you should only use xml in version 1.0. When deserializing you either dont specify the XML declaration at all or you must use <source lang='xml'><?xml version="1.0" encoding="utf-8"?></source>}}<br />
<br />
=== JSON format ===<br />
<br />
The JSON format of a record looks like this:<br />
<br />
<source lang="javascript"><br />
{<br />
"_recordid" : "web:http://example.org/something",<br />
"_source" : "web",<br />
"url" : "web:http://example.org/something",<br />
"filesize" : 1234,<br />
"sizeInKb" : 1.2,<br />
"checked" : true,<br />
"created" : "2010-12-02",<br />
"lastModified" : "2010-12-02T16:20:54.123+0100",<br />
"trustee" : [ "group1", "group2" ],<br />
"author" : <br />
[ {<br />
"firstname" : "John",<br />
"lastname" : "Doe"<br />
},<br />
{<br />
"firstname" : "Lisa",<br />
"lastname" : "Müller"<br />
} ],<br />
"contact" : <br />
{<br />
"email" : "Homer.Simpson@powerplant.com",<br />
"address" : <br />
{<br />
"street" : "742 Evergreen Terrace",<br />
"city" : "Springfield"<br />
}<br />
},<br />
"_attachments": ["content", "fulltext"]<br />
}<br />
</source><br />
<br />
'''Notes:'''<br />
* Number value types are determined implicitly when parsing JSON:<br />
** If a number value can be parsed as a long integer, a long value will be created, else it will become a double value. <br />
* Date and DateTime are not supported by JSON natively, therefore date and datetime values are printed to JSON as simple strings using the format rules described above. On the other hand, when the JSON parser finds a string value that has a correct date or datetime format, it creates a date or datetime value. The original string is preserved, so when accessing the value "as a string" the client will get the original string. Also, when the object is written to JSON (or BON or XML) again, the original string will be used. So this autodetection should not cause problems even if some string value has the correct format, but is not meant to be a date or datetime.<br />
* Map keys are always strings and must be enclosed in quotes.<br />
* Attachments are not supported in the JSON format, only the names of attachments are preserved, the attachments themselves (the bytes) are lost<br />
<br />
See package <tt>org.eclipse.smila.datamodel.ipc</tt> for serialization helper classes.<br />
<br />
=== BON Binary Object Notation Format ===<br />
<br />
==== Format introduction ====<br />
<br />
The format consists of a sequence of tokens and data with two different types of tokens:<br />
<br />
* Event tokens are single bytes which are describing an event (e.g. OBJECT-START, SEQUENCE-START).<br />
* Data tokens are the first part of an entity.<br />
<br />
Every entity consists of up to three parts. The first part is a one byte token which describes the following data type and in case of a string type this token is followed by a data length information (second part). The last part is the information itself (except for the boolean type which is stored within the token).<br />
<br />
Integer values are stored in a compressed format. The sign and the integer length (number of bytes) are stored in the token byte. Strings are generally stored in UTF-8 format.<br />
<br />
The handling of date and datetime values is exactly as in JSON. See above for detais.<br />
<br />
Attachments are fully supported.<br />
<br />
==== Scalar Types ==== <br />
<br />
The current release features the following scalar types:<br />
<br />
* Integer: [http://en.wikipedia.org/wiki/Integer_(computer_science) signed int64]<br />
** compressed, bytes are stored in [http://en.wikipedia.org/wiki/Network_byte_order#Endianness_in_networking network byte order (big endian)]<br />
** −9,223,372,036,854,775,808 to +9,223,372,036,854,775,807<br />
* Floating point values:<br />
** double (8 bytes in network byte order IEEE format (java default))<br />
* Bool<br />
* String:<br />
** UTF-8 coded Text Strings, max 2^31-1 bytes<br />
<br />
==== Integer compressing ====<br />
<br />
The token bytes 0..15 defines the sign of the number (0-7, positive, 8-15 negative) and the number of the necessary bytes, to store the number. The bytes are stored in network byte order.<br />
<br />
{|border=1<br />
|+Examples for integer compression<br />
! value !! token !! data<br />
|-<br />
|17 || 0 (positiv 1 byte) || 17 (0x 11)<br />
|-<br />
|-17 || 8 (negative 1 bytes) || 17 (0x 11)<br />
|-<br />
|17985 || 1 (positiv 2 bytes) || 0x 46 41<br />
|-<br />
|-9.876.543.210 || 12 (negative 5 bytes) || 0x 02 4C B0 16 EA<br />
|}<br />
<br />
==== Binary Type ==== <br />
<br />
The Binary type is used for arbitrary binary content in attachments. A single binary is currently limited to a size of max 2^31-1 bytes.<br />
<br />
==== Token Bytes ====<br />
<br />
There are two different types of tokens. Here is a complete list of all tokens which are supported by the current release of the format:<br />
<br />
{|border=1<br />
|+ List of event tokens<br />
! token !! description !! byte<br />
|-<br />
|OBJECT-START||No version string||25<br />
|-<br />
|OBJECT-START||Followed by version (reserved, not implemented)||26<br />
|-<br />
|OBJECT-END|| ||28<br />
|-<br />
|SEQUENCE-START|| ||29<br />
|-<br />
|SEQUENCE-END|| ||30<br />
|-<br />
|MAPPING-START|| ||31<br />
|-<br />
|MAPPING-END|| ||32<br />
|-<br />
|ATTACHMENTS-START|| ||33<br />
|-<br />
|ATTACHMENTS-END|| ||34<br />
|-<br />
|CUSTOM-TYPE|| ||43<br />
|}<br />
<br />
{|border=1<br />
|+List of data tokens<br />
! token !! description !! byte<br />
|-<br />
|SCALAR-INT||positiv length 1||0<br />
|-<br />
| ||positiv length 2||1<br />
|-<br />
| ||...||...<br />
|-<br />
| ||positiv length 8||7<br />
|-<br />
| ||negative length 1||8<br />
|-<br />
| ||negative length 2||9<br />
|-<br />
| ||...||...<br />
|-<br />
| ||negative length 8||15<br />
|-<br />
|SCALAR-BOOL||true||16<br />
|-<br />
| ||false||17<br />
|-<br />
|SCALAR-FLOAT||float (32 bit)||18 (reserved, not implemented)<br />
|-<br />
|SCALAR-FLOAT||double (64 bit)||19<br />
|-<br />
|SCALAR-FLOAT||long double (80 bit)||20 (reserved, not implemented)<br />
|-<br />
|SCALAR-STRING||1 length byte||21<br />
|-<br />
| ||2 length byte||22<br />
|-<br />
| ||3 length byte||23<br />
|-<br />
| ||4 length byte||24<br />
|-<br />
|BINARY||length 1||35<br />
|-<br />
| ||length 2||36<br />
|-<br />
| ||length 3||37<br />
|-<br />
| ||length 4||38<br />
|-<br />
| ||length 5||39 (reserved, not implemented)<br />
|-<br />
| ||...||...<br />
|-<br />
| ||length 8||42 (reserved, not implemented)<br />
|}<br />
<br />
==== Backward compatible extension concept ====<br />
<br />
If we need a BON format extension (= a new token), we pick an unused token number. Token 26 and 27 are reserved to store additional version information, but this is currently not implemented.<br />
<br />
==== Custom Type ====<br />
<br />
The token CUSTOM-TYPE (43), followed by a type identifier (token SCALAR-STRING/21 + string), marks the following map or sequence as a special custom type. Appropriate parser code can create corresponding user objects.<br />
<br />
There is no mapping of this BON format to JSON. The representation in JSON is just the same as if the CUSTOM_TYPE token and the following string would be skipped.<br />
<br />
So parsing such a BON without knowing an appropriate custom type is just skipping these extra information and continue.<br />
<br />
==== Examples ====<br />
<br />
===== Integer =====<br />
<br />
Sample integer value: '''-36364'''<br />
<br />
The BON representation:<br />
<br />
{|border=1<br />
!Value (decimal)!!Info!!Comment<br />
|-<br />
|9||SCALAR-INT||negative int value with 2 bytes length<br />
|-<br />
|36364||int int value without sign||<br />
|}<br />
<br />
and the hex representation: <br />
<source lang="text"><br />
09 8E 0C<br />
</source><br />
<br />
===== String =====<br />
<br />
Sample text: '''ähnlich'''<br />
<br />
The BON representation:<br />
<br />
{|border=1<br />
!Value (decimal)!!Info!!Comment<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|08||length info||the string follows<br />
|-<br />
|ähnlich||the string content||<br />
|}<br />
<br />
and the hex representation:<br />
<br />
<source lang="text"> <br />
15 08 c3 a4 68 6e 6c 69 63 68<br />
</source><br />
<br />
===== Complex example =====<br />
<br />
A complex example: This could be some text annotation or highlighting structure. The JSON representation is:<br />
<br />
<source lang="javascript"><br />
{<br />
"title": [<br />
["STEM","the",0,2],<br />
["STEM","title",4,8]<br />
]<br />
}<br />
</source><br />
<br />
{|border=1<br />
!Value (decimal)!!Info!!Comment<br />
|-<br />
|25||OBJECT-START|| "---" (here: without Type:version)<br />
|-<br />
|31||MAPPING-START||<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|5|| || length info of the string<br />
|-<br />
|title|| || the string content<br />
|-<br />
|29||SEQUENCE-START||start of the sequence "STEM,the,0,2"<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|4|| || length info for "STEM"<br />
|-<br />
|STEM|| ||the string content<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|3|| || length info for "the"<br />
|-<br />
|the|| || the string content<br />
|-<br />
|0||SCALAR-INT (positive)||with one byte length<br />
|-<br />
|0|| || the INT value<br />
|-<br />
|0||SCALAR-INT (positive)||with one byte length<br />
|-<br />
|2|| || the INT value<br />
|-<br />
|30||SEQUENCE-END||end of the sequence "STEM,the,0,2"<br />
|-<br />
|29||SEQUENCE-START||start of the sequence "STEM,title,4,8"<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|4|| || length info for "STEM"<br />
|-<br />
|STEM|| || the string content<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|5|| || length info for "title"<br />
|-<br />
|title|| || the string content<br />
|-<br />
|0||SCALAR-INT (positive)||with one byte length<br />
|-<br />
|4|| || the INT value<br />
|-<br />
|0||SCALAR-INT (positive)||with one byte length<br />
|-<br />
|8|| || the INT value<br />
|-<br />
|30||SEQUENCE-END||end of the sequence "STEM,title,4,8"<br />
|-<br />
|32||MAPPING-END||<br />
|-<br />
|28||OBJECT-END||<br />
|}<br />
<br />
<br />
===== Complex example with Custom Type =====<br />
<br />
If there is a custom type "text":<br />
<br />
{|border=1<br />
!Value (decimal)!!Info!!Comment<br />
|-<br />
|25||OBJECT-START|| "---" (here: without Type:version)<br />
|-<br />
|31||MAPPING-START||<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|5|| || length info of the string<br />
|-<br />
|title|| || the string content<br />
|-<br />
|43||CUSTOM-TYPE|| a custom type follows<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|4|| || length info for "text"<br />
|-<br />
|text|| ||the string content<br />
|-<br />
|29||SEQUENCE-START||start of the sequence "STEM,the,0,2"<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|4|| || length info for "STEM"<br />
|-<br />
|STEM|| ||the string content<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|3|| || length info for "the"<br />
|-<br />
|the|| || the string content<br />
|-<br />
|0||SCALAR-INT (positive)||with one byte length<br />
|-<br />
|0|| || the INT value<br />
|-<br />
|0||SCALAR-INT (positive)||with one byte length<br />
|-<br />
|2|| || the INT value<br />
|-<br />
|30||SEQUENCE-END||end of the sequence "STEM,the,0,2"<br />
|-<br />
|29||SEQUENCE-START||start of the sequence "STEM,title,4,8"<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|4|| || length info for "STEM"<br />
|-<br />
|STEM|| || the string content<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|5|| || length info for "title"<br />
|-<br />
|title|| || the string content<br />
|-<br />
|0||SCALAR-INT (positive)||with one byte length<br />
|-<br />
|4|| || the INT value<br />
|-<br />
|0||SCALAR-INT (positive)||with one byte length<br />
|-<br />
|8|| || the INT value<br />
|-<br />
|30||SEQUENCE-END||end of the sequence "STEM,title,4,8"<br />
|-<br />
|32||MAPPING-END||<br />
|-<br />
|28||OBJECT-END||<br />
|}<br />
<br />
<br />
===== Complex example with attachments =====<br />
<br />
Another example with attachments: This could be some input record generated by a crawler (e.g. a mail crawler). The JSON representation is:<br />
<br />
<source lang="javascript"><br />
{<br />
"subject": "a test mail",<br />
"_attachments" : ["pdfFile", "zipFile"]<br />
}<br />
</source><br />
<br />
Note that "_attachments" is not a regular metadata field but contains the name of the attachments. Also note that the JSON representation does not contain the attachments themselves. This is only for documentation purpose.<br />
<br />
<br />
{|border=1<br />
!Value (decimal)!!Info!!Comment<br />
|-<br />
|25||OBJECT-START|| "---" (here: without Type:version)<br />
|-<br />
|31||MAPPING-START||<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|7|| || length info for the string<br />
|-<br />
|subject|| || the string content<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|11|| || length info for "a test mail"<br />
|-<br />
|a test mail|| || the string content<br />
|-<br />
|32||MAPPING-END||<br />
|-<br />
|33||ATTACHMENTS-START||<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|7|| || length info the string<br />
|-<br />
|pdfFile|| || the string content<br />
|-<br />
|35||BINARY|| binary with 1 byte length info<br />
|-<br />
|12345|| || length info for the binary content<br />
|-<br />
| 03x0815 .... || || the binary content<br />
|-<br />
|21||SCALAR-STRING||string with one byte length info<br />
|-<br />
|7|| || length info for the string<br />
|-<br />
|zipFile|| || the string content<br />
|-<br />
|35||BINARY|| binary with 1 byte length info<br />
|-<br />
|98765|| || length info for the binary content<br />
|-<br />
| 08x4711 .... || || the binary content<br />
|-<br />
|34||ATTACHMENTS-END||<br />
|-<br />
|28||OBJECT-END||<br />
|}<br />
<br />
=== Record Filters ===<br />
<br />
'''Record filters''' produce reduced copies of a record: A record filter has a name and contains a list of metadata element names. When applied to a record, it produces a copy of the record that contains only the elements of the list.<br />
<br />
Record filters are described in a simple XML format:<br />
<br />
<source lang="xml"><br />
<RecordFilters><br />
<Filter name="filter0" /><br />
<Filter name="filter1"><br />
<Element name="attribute" /><br />
</Filter><br />
<Filter name="filter3"><br />
<Element name="attribute1" /><br />
<Element name="attribute2" /><br />
<Element name="attribute3" /><br />
</Filter><br />
<Filter name="filter-all"><br />
<Element name="*" /><br />
</Filter><br />
</RecordFilters><br />
</source><br />
<br />
'''Notes:'''<br />
* A filter always copies the system elements "_recordid" and "_source". Therefore, the apparently empty "filter0" in this definition produces records that still contain these system elements.<br />
* A filter may contain arbitrary numbers of element names. It's fine if an element does not appear in the record to copy, it's just ignored.<br />
* A filter always removes attachments: The "filter-all" in this definition produces a copy of the record with all metadata elements, but not attachments.<br />
<br />
Filters are usually applied by asking the blackboard for a filtered copy of the record's metadata. See Blackboard service API for details. To work with filters directly, see package <tt>org.eclipse.smila.datamodel.filter</tt> for utility classes.<br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/HowTo/Howto_set_up_dev_environment&diff=333292SMILA/Documentation/HowTo/Howto set up dev environment2013-04-09T13:55:16Z<p>Marco.strack.empolis.com: clarified download for eclipse delta pack</p>
<hr />
<div><br> This HowTo describes the necessary steps for setting up a SMILA development environment. <br />
<br />
==== Preconditions ====<br />
<br />
Here is the list of things that you will definitely need for developing SMILA components: <br />
<br />
* JDK 1.7<br />
* Recent Eclipse SDK - This HowTo was tested with [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/ Eclipse Classic SDK 4.2] (Juno Release) <br> <br />
<br />
==== Getting the source code ====<br />
<br />
There is more than one way of getting the code into your Eclipse workspace. The following sections will describe how to get the source code via SVN (recommended!). <br />
<br />
As an alternative, you could download the complete source code from the [http://www.eclipse.org/smila/downloads.php release download page] or the [http://build.eclipse.org/rt/smila/nightly/ nightly build downloads] and unpack the archive into your workspace. <br />
<br />
===== Installing SVN Provider =====<br />
''(skip this section if SVN Team Provider is already installed in your eclipse IDE)''<br />
<br />
* Install ''Subversive SVN Team Provider'' and ''Subversive SVN JDT Ignore Extensions'' from the Eclipse software repository.<br> <br />
* Restart Eclipse. <br />
* Select ''Windows &gt; Preferences &gt; Team &gt; SVN''. This should open the ''Subversive Connector Discovery'' window. <br />
* Select the Subversive SVN Connector that you wish to use. We suggest to take the latest SVN Kit that is offered. At the time of writing it was SVN Kit 1.3.5. <br />
<br />
===== Get source code from SVN =====<br />
<br />
There are two ways for this, automatically by using the ''Project Set File'' or manually. Both are described in the following: <br />
<br />
''Manually checking out and importing the projects into eclipse afterwards:''<br />
* Use your favorite SVN client (''except the eclipse SVN client'') to check out SMILA's source code from the repository located at:<br> <tt>https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core</tt>. If you later want to be able to build a SMILA distribution, all SMILA projects should be located in the same directory.<br />
:: <pre>svn co https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core</pre><br />
::'''Note:''' ''The upside of doing so is that you can easily get new projects just by updating your working copy and reimporting the sources into eclipse. Removed projects will be deleted on update. Eclipse will indicate this to the user by displaying an empty project.'' <br />
* Import all SMILA project into your workspace: <br />
** Click ''File'' &gt; ''Import'' &gt; ''General'' &gt; ''Existing Projects into Workspace'' &gt; ''Next.'' <br />
** Select the folder that contains all SMILA projects --&gt; (all projects should be selected automatically) &gt; ''Finish''.<br />
<br />
''Automatic checkout and import by using the Project Set File:''<br />
* In eclipse, create an SVN repository location with URL <tt>https://dev.eclipse.org/svnroot/rt/org.eclipse.smila</tt><br />
* Checkout <tt>trunk/releng</tt> <br />
* Right click on <tt>SMILA.releng/devenv.SMILA-core.psf</tt><br />
* Click ''Import Project Set...'' and choose "No To All"<br />
::'''Hint:''': ''New projects should always be added to the .psf file so you can import them (as before): right click on .psf file and click on "Import Project Set...", be sure to click "No To All" to the question whether to overwrite existing projects in the workspace, otherwise it will check out everything again instead of ignoring the projects, that are already checked out. If projects are removed you have to remove them manually from the workspace, this can't be handled via .psf file.'' <br />
<br />
After having imported the source code into your workspace, it will show up a lot of errors. Don't worry, they'll disappear after the next steps below.<br />
<br />
==== Defining the target platform ====<br />
<br />
The target platform defines the set of bundles and features that you are developing against. SMILA ships a ''Target Definition File'' that you can open in your IDE to configure the target platform automatically. This file contains all the references needed for developing SMILA with Eclipse Juno (Release 4.2).<br />
<br />
===== Using the target platform provided by SMILA =====<br />
<br />
* Checkout <tt>../org.eclipse.smila/trunk/releng</tt> (''if you haven't already done before'')<br />
* Open the file <tt>SMILA.releng/devenv/SMILA.target</tt> with the ''Target Definition'' editor. <br>Eclipse starts downloading the referenced bundles/features which it tells you by stating "Resolving Target Definition" in its status bar. Be patient, this will take quite a while. After it has finished, you can click the link "Set as Target Platform" on the top right of the ''Target Definition'' editor. Doing so will cause Eclipse to start re-compiling the sources and all error markers should be gone when finished.<br />
<br />
===== Defining the target platform manually =====<br />
<br />
* Instead of using the target definition file provided by SMILA (see above) you can also [[SMILA/Development Guidelines/Howto set up target platform|manually set your own target platform]].<br />
<br />
==== Launching SMILA in Eclipse IDE ====<br />
<br />
If you've checked out SMILA's trunk correctly, you should have a project called '''SMILA.launch''' in your workspace. This project contains the SMILA's launch configuration for Eclipse IDE. To start SMILA directly in your Eclipse IDE, just follow the steps below: <br />
<br />
* Click <span style="font-style: italic;">Run</span>--&gt; ''Debug Configurations'' and expand '''''OSGI Framework'''''<b>.</b> <br />
* Select the ''SMILA'' launch file. <br />
* Click '''Debug'''. <br> If everything works fine, you will get an output in the '''Console''' view similar to the following:<br />
<br />
<source lang="text"><br />
osgi> Persistence bundle starting...<br />
ProviderTracker: New service detected...<br />
ProviderTracker: Added service org.eclipse.persistence.jpa.osgi.PersistenceProviderOSGi<br />
Persistence bundle started.<br />
[INFO ] Context /zookeeper: Registered handler(1) ZooKeeperAdminHandler, pattern /(.*)$<br />
[INFO ] Added worker webFetcher to WorkerManager.<br />
...<br />
[INFO ] HTTP server has SMILA handler RequestDispatcher for context /smila.<br />
[INFO ] HTTP server started successfully on port 8080.<br />
</source><br />
<br />
==== You're done ====<br />
<br />
Congratulations! You've just successfully checked out and configured your SMILA development environment and you can now start [[SMILA/Development Guidelines/Create a bundle (plug-in)|developing your own bundles]].<br />
<br />
==== Additional steps ====<br />
<br />
The following steps may be needed for special purposes. If you are a SMILA user who only wants to integrate an own component you won't need them. <br />
<br />
===== Delta Pack =====<br />
''(only needed for building the software outside of eclipse IDE)''<br />
<br />
For building the software you may need to add a "Delta Pack" to an Eclipse SDK installation. You can download it from [http://download.eclipse.org/eclipse/downloads/ here] by selecting the corresponding eclipse version that you have in use. After downloading, you can copy the contained plugins and features in your eclipse installation.<br />
<br />
===== Checkstyle configuration =====<br />
<br />
If you have the [http://eclipse-cs.sourceforge.net/ Eclipse Checkstyle plugin] installed, you will get a lot of error messages complaining about missing check configurations when Eclipse builds the workspace.<br />
(''Hint: For installing the Checkstyle plugin, use location: "http://eclipse-cs.sf.net/update/"'')<br />
<br />
<source lang="text"><br />
Errors running builder 'Checkstyle Builder' on project 'org.eclipse.smila.utils'.<br />
Fileset from project "org.eclipse.smila.utils" has no valid check configuration.<br />
...<br />
</source><br />
<br />
You can solve this by importing them: <br />
* Open ''Window -> Preferences'' and go to ''Checkstyle''.<br />
* Click ''New...'', enter <tt>SMILA Checkstyle</tt> as the name, click ''Import...'', and select ''SMILA.builder/checkstyle/smila_checkstyle-5.xml'' from your workspace. Click ''OK''.<br />
* Click ''New...'' again, enter <tt>SMILA Test Checkstyle</tt> as the name, click ''Import...'', and select ''SMILA.builder/checkstyle/smila-test_checkstyle-5.xml'' from your workspace. Click ''OK''.<br />
* Select <tt>SMILA Checkstyle</tt> and click ''Set as Default''.<br />
* Click ''OK''. <br> Now you should not get those error messages again.<br />
<br />
===== Enabling the BPEL Designer =====<br />
<br />
If you want to work with the SMILA extensions for Eclipse BPEL designer, you need to check out the bundles from <tt>trunk/tooling</tt>. Currently, the required bundles are: <br />
<br />
*<tt>org.eclipse.smila.processing.designer.model</tt> <br />
*<tt>org.eclipse.smila.processing.designer.ui</tt><br />
<br />
To compile them you need additional bundles from the [http://www.eclipse.org/bpel Eclipse BPEL Designer] in your target platform. See [[SMILA/BPEL Designer]] for more information.<br />
<br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/HowTo/Howto_build_a_SMILA-Distribution&diff=333262SMILA/Documentation/HowTo/Howto build a SMILA-Distribution2013-04-09T11:36:08Z<p>Marco.strack.empolis.com: Added remark to not recommend paths with softlinks in build config</p>
<hr />
<div>This HowTo describes how to build a SMILA distribution. <br />
<br />
==== Build Requirements ====<br />
<br />
The build process uses Eclipse's PDE Build tools to build all the bundles, run all unit tests, and create a ZIP archive with a complete SMILA application that can be installed and run independently from any development environment. To run this build process, you should first install the following software: <br />
<br />
*'''Eclipse SDK 4.2''' for your operating system: We recommend installing a fresh Eclipse instance independently from the one you might already be using and use this solely for the purpose of building SMILA. This makes sure that any potential additional Eclipse plugins installed on your existing installation won't interfere with the build process (this shouldn't happen, usually - but just to be safe). You can find the download on [http://download.eclipse.org/eclipse/downloads/ http://download.eclipse.org/eclipse/downloads/]. This HowTo was tested with [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/ Eclipse Classic SDK 4.2].<br />
<br />
*'''DeltaPack''' matching your Eclipse version: The DeltaPack contains some additional bundles needed in the build, mainly for creating the SMILA executable for different platforms. You'll find the download on [http://download.eclipse.org/eclipse/downloads/ http://download.eclipse.org/eclipse/downloads/]. Install it by unpacking it into you Eclipse SDK installation. This HowTo was tested with [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/#DeltaPack DeltaPack 4.2].<br />
<br />
*'''Sun Java Development Kit''': You need a full JDK, version 7, to build SMILA, not just a JRE. You can get it at [http://www.oracle.com/technetwork/java/javase/downloads/index.html]<br />
<br />
*'''Apache Ant''': The build process is executed by Ant, which you can download here: [http://ant.apache.org/ http://ant.apache.org/]. At least version 1.7 is needed (and tested).<br />
<br />
*'''Additional Libraries''' for building which are not included in SMILA repository. The build scripts assume the following directory structure for these libraries. You can either create this structure in your working copy of the SMILA repository next to all the SMILA bundles, or somewhere else on your hard disk and configure the build process to find them there (see below). For your convenience, we have put together a package with current versions of these libraries for download: [http://ubuntuone.com/312LdFKCXLYfapEpxLMR9v smila-build-libraries-1.2.0.zip]. The structure of this package is:<br />
<div style="margin-left: 1.5em"><br />
<source lang="text"><br />
lib/<br />
ant-contrib/<br />
ant-contrib-1.x.jar<br />
checkstyle/<br />
checkstyle-all-5.x.jar<br />
jacoco/<br />
jacocoagent.jar, jacocoant.jar<br />
pmd/<br />
asm-3.2.jar, jaxen-1.1.1.jar, pmd-4.3.jar<br />
xjc/<br />
InterfacesXJCPlugin.jar, jaxb-api.jar, jaxb-impl.jar, jaxb-xjc.jar<br />
</source> <br />
</div> <br />
**ant-contrib: This is required to run the build. You may download it from: [http://sourceforge.net/projects/ant-contrib/files/ant-contrib/1.0b3/ ant-contrib]. You can use the binary versions available there. (Tested with ant-contrib 1.0b3) <br />
**Furthermore our build process optionally generates reports for checkstyle, jacoco (code coverage) and pmd (static code analysis) if these libraries are present. The build is configured to run without these libraries and will just not create the respective reports, but everything else will be OK. To generate these reports you may download these files from: <br />
***[http://checkstyle.sourceforge.net/ checkstyle] (use Checkstyle 5.5 or higher, older versions will not handle Java-7 source code correctly). <br />
***[http://eclemma.org/jacoco/ jacoco]. (Tested with jacoco 0.5.7 and higher) <br />
***[http://pmd.sourceforge.net/ pmd]. (Tested with pmd 4.3) <br />
**xjc: is only needed if you need to generate JAXB classes from an XML schema, see [[SMILA/Development Guidelines/Setup for JAXB code generation]] for details.<br />
<br />
==== Configuring the Build ====<br />
<br />
The folder <tt>SMILA.builder</tt> contains everything needed to build SMILA and/or run all tests locally. The default settings are set to build against Eclipse 4.2 and build a product for Win 32bit and 64bit, Linux 32bit and 64bit as well as MacOS x86 64bit. But it is also possible to build other platforms. &nbsp; <br />
<br />
Whether you build from command line or from Eclipse, in both cases the <tt>make.xml</tt> ant script is executed. Before execution certain properties need to be set to meet the local setup. <br />
<br />
===== Setting the Target Build Platform =====<br />
<br />
''First, [[SMILA/Development Guidelines/Howto set up dev environment|setup a development environment]].'' When finished copy the file <tt>SMILA.builder/build.properties.template</tt> to <tt>SMILA.builder/build.properties</tt> and adapt the copied file: Add the platforms that you want to build as value triplets to the <tt>configs</tt> property and comment out or remove those that you don't need. The available platform triplets are:<br> <br />
<br />
{| border="1"<br />
|+ <br> <br />
|-<br />
! Windows 32bit <br />
| {{Codeblock|<pre>...<br />
configs=win32,win32,x86 <br />
# ... </pre><br />
}}<br />
|-<br />
! Windows 64bit <br />
| {{Codeblock|<pre>...<br />
configs=win32,win32,x86_64 <br />
# ... </pre><br />
}}<br />
|-<br />
! Linux 32bit <br />
| {{Codeblock|<pre>...<br />
configs=linux,gtk,x86<br />
# ... </pre><br />
}}<br />
|-<br />
! Linux 64bit <br />
| {{Codeblock|<pre>...<br />
configs=linux, gtk, x86_64<br />
# ... </pre><br />
}}<br />
|-<br />
! Solaris SPARC <br />
| {{Codeblock|<pre>...<br />
configs=solaris, gtk, sparc<br />
# ... </pre><br />
}}<br />
|}<br />
<br />
If you want to provide several distributions at once e.g one for Windows 32bit and one for Linux 32bit (default build plan), concatenate the platform triplets with the '&amp;' character:<br> <br />
<br />
{| border="1"<br />
|-<br />
! Example: <br />
| {{Codeblock|<pre>configs=win32, win32, x86 & \<br />
linux, gtk, x86 </pre><br />
}}<br />
|}<br />
<br />
The archive files of the application distribution are created in the <tt>Application</tt> directory below the specified build directory (see below). For each platform triplet in the <tt>configs</tt> property (<tt>$os, $ws, $arch</tt>) a ZIP file named <tt>SMILA-incubation-$os.$ws.$arch.zip</tt> is built. <br />
<br />
===== Setting Build Properties =====<br />
<br />
These are the main properties that can be used to configure the build process executed by <tt>make.xml</tt>. If you run the build from within Eclipse you must add them to the Ant launch configuration (see [[#Executing_make.xml_from_within_Eclipse|Executing make.xml from within Eclipse]] below), for running from command line we have included templates that you can adapt to your local setup (see [[#Executing_make.xml_from_command_line|Executing make.xml from command line]] below). <br />
<br />
<br> Note: When using linux, make sure not to use paths containing softlinks, because they may not be correctly resolved during the build process. Use the full qualified path instead.&nbsp; <br />
<br />
{| cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! Property <br />
! Default <br />
! Comment<br />
|-<br />
| <tt>buildDirectory</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/eclipse.build</tt> <br />
| Directory where built output will be created. This should be always a subdirectory of &lt;SMILA_HOME&gt;. The application distribution's ZIP files will be created in the subdirectory <tt>Application</tt> of this directory.<br />
|-<br />
| <tt>builder</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/SMILA.builder</tt> <br />
| Directory where <tt>make.xml</tt> is locate.<br />
|-<br />
| <tt>eclipse.home</tt> <br />
| <tt>&lt;ECLIPSE_HOME&gt;</tt> <br />
| Location of the [[#Build_Requirements|Eclipse instance]] used to build SMILA.<br />
|-<br />
| <tt>lib.dir</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/lib</tt> <br />
| Location of the additional build libs (ant-contrib, etc.).<br />
|-<br />
| <tt>os</tt> <br />
| win32 <br />
| rowspan="3" | These properties merely control under which platform the test will run. It must be one of the [[#Setting_the_Target_Build_Platform|target platforms]] you have built.<br />
|-<br />
| <tt>ws</tt> <br />
| win32<br />
|-<br />
| <tt>arch</tt> <br />
| x86<br />
|-<br />
| <tt>test.java.home</tt> <br />
| <tt>&lt;JAVA_HOME&gt;</tt> <br />
| A Java 1.7 SDK instance.<br />
|}<br />
<br />
==== Executing the make.xml ====<br />
<br />
The default target is <tt>all</tt>, building the application and running all unit tests. Note that this can take quite a while. To build the distribution archives only, use the targets <tt>clean</tt> and <tt>final-application</tt>. See [[SMILA/Development Guidelines/Introduction to make.xml|Introduction to make.xml]] for more details. <br />
<br />
===== Executing make.xml from within Eclipse =====<br />
<br />
Steps: <br />
<br />
#Select the <tt>SMILA.builder</tt> bundle. <br />
#Open the ''External Tools Configuration'' dialog (select ''Run -&gt; External Tools -&gt; External Tools Configuration''). <br />
#Create a new ''Ant Build'' configuration. <br />
#In the ''Buildfile'' field, enter: <tt>${workspace_loc:/SMILA.builder/make.xml}</tt>. <br />
#In the ''Base Directory'' field, enter: <tt>${workspace_loc:/SMILA.builder}</tt>. <br />
#Add all properties from [[#Setting_Build_Properties|above]] into the ''Arguments'' field (and adapt them to meet your setup) but prepend each with <tt>-D</tt> so each is passed into <tt>ant</tt> as a property (note that <tt>buildDirectory</tt> should be a subdirectory of your SMILA workspace directory), e.g. when using Eclipse 3.7.2 to build: <br />
#:-DbuildDirectory=D:/workspace/SMILA/eclipse.build <br />
#:&nbsp;-Declipse.home=D:/eclipse42 <br />
#:-Dbuilder=D:/workspace/SMILA/SMILA.builder <br />
#:-Declipse.running=true <br />
#:-Dos=win32 -Dws=win32 -Darch=x86 <br />
#:-Dlib.dir=D:/workspace/SMILA/lib <br />
#Apply, and run the Ant build.<br>'''Note:''' To start another than the default target select the targets of your choice on the ''Targets'' tab.<br />
<br />
===== Executing make.xml from command line =====<br />
<br />
The <tt>make.bat</tt> and <tt>make.sh</tt> files are just shell scripts setting the properties that are needed for the Ant script. These files exist only as templates in SVN with <tt>.#~#~#</tt> appended to denote their template nature. Copy one of them matching your system and rename them as you like, but note that the names <tt>make.bat</tt> and <tt>make.sh</tt> are already in the svn:ignore list to prevent them from beeing committed accidentally, so it is recommended to use them. <br />
<br />
Both scripts are very similar, they start with setting some environment variables which are then used to create the build configuration properties and eventually feed them into an Ant call. There are the variables you usually need to check and adapt: <br />
<br />
{| cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! Variable <br />
! Comment<br />
|-<br />
| <tt>SMILA_HOME </tt> <br />
| Location of your SVN working copy. May be derived automatically in the <tt>.sh</tt> script, in the batch file, however, you must set it yourself.<br />
|-<br />
| <tt>ECLIPSE_HOME </tt> <br />
| Location of the [[#Build_Requirements|Eclipse instance]] used to build SMILA.<br />
|-<br />
| <tt>ARCH </tt> <br />
| Operating system and platform settings for running the tests. See description of <tt>os</tt>, <tt>ws</tt> and <tt>arch</tt> properties above.<br />
|-<br />
| <tt>JAVA_HOME </tt> <br />
| Location of the JDK to build and run tests in. Must match the <tt>ARCH</tt> setting. <br />
'''Tip:''' If your compile log complaints about a non-1.6 compatible JVM despite the correct settings, you must also add the {{path|JAVA_HOME}} location to the {{Path|ECLIPSE_HOME/eclipse.ini}} like so:<br> {{Code|<br />
-vm<br/><br />
<JAVA_HOME>\bin}} <br />
<br />
|-<br />
| <tt>ANT_HOME </tt> <br />
| Location of your ANT installation.<br />
|-<br />
| <tt>BUILDLIB_DIR </tt> <br />
| Location of your build files. If you placed them in the SMILA_HOME you can leave this untouched and comment out the <tt>libDir</tt> setting in the next line.<br />
|-<br />
| <tt>buildOpts </tt> <br />
| Use the default <tt>buildOpts</tt> for Eclipse pdebuild or adapt them if you have another version installed.<br />
|}<br />
<br />
Usually you don't need to change anything below the line setting <tt>buildOpts</tt>. <br />
<br />
To run a build with the default target (<tt>all</tt>), open a command prompt or shell in the <tt>SMILA.builder</tt> directory and just enter. <br />
<br />
<source lang="text"><br />
make<br />
</source> <br />
<br />
To execute another than the default target, just pass it (or them) as an argument: <br />
<br />
<source lang="text"><br />
make build<br />
</source> <br />
<br />
For example, to build the application distribution ZIPs without running the tests (which can take quite long), you can use: <br />
<br />
<source lang="text"><br />
make clean final-application<br />
</source> <br />
<br />
In Windows you will not see much output in the command prompt window, because the batch file redirects it to a logfile (named <tt>log.make</tt>, if the batch file is <tt>make.bat</tt>) so that you can check for error details after the build. You can install the [http://gnuwin32.sourceforge.net/ GnuWin32] or [http://www.cygwin.com/ Cygwin] tools and use <tt>tee</tt> to have the output written to both console and logfile. The template contains the changed ANT call as a comment quite at the end of the script. <br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/HowTo/Howto_build_a_SMILA-Distribution&diff=333226SMILA/Documentation/HowTo/Howto build a SMILA-Distribution2013-04-09T08:39:32Z<p>Marco.strack.empolis.com: updated ant-contrib link</p>
<hr />
<div>This HowTo describes how to build a SMILA distribution.<br />
<br />
==== Build Requirements ====<br />
<br />
The build process uses Eclipse's PDE Build tools to build all the bundles, run all unit tests, and create a ZIP archive with a complete SMILA application that can be installed and run independently from any development environment. To run this build process, you should first install the following software: <br />
<br />
*'''Eclipse SDK 4.2''' for your operating system: We recommend installing a fresh Eclipse instance independently from the one you might already be using and use this solely for the purpose of building SMILA. This makes sure that any potential additional Eclipse plugins installed on your existing installation won't interfere with the build process (this shouldn't happen, usually - but just to be safe). You can find the download on [http://download.eclipse.org/eclipse/downloads/ http://download.eclipse.org/eclipse/downloads/]. This HowTo was tested with [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/ Eclipse Classic SDK 4.2].<br />
<br />
*'''DeltaPack''' matching your Eclipse version: The DeltaPack contains some additional bundles needed in the build, mainly for creating the SMILA executable for different platforms. You'll find the download on [http://download.eclipse.org/eclipse/downloads/ http://download.eclipse.org/eclipse/downloads/]. Install it by unpacking it into you Eclipse SDK installation. This HowTo was tested with [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/#DeltaPack DeltaPack 4.2].<br />
<br />
*'''Sun Java Development Kit''': You need a full JDK, version 7, to build SMILA, not just a JRE. You can get it at [http://www.oracle.com/technetwork/java/javase/downloads/index.html]<br />
<br />
*'''Apache Ant''': The build process is executed by Ant, which you can download here: [http://ant.apache.org/ http://ant.apache.org/]. At least version 1.7 is needed (and tested).<br />
<br />
*'''Additional Libraries''' for building which are not included in SMILA repository. The build scripts assume the following directory structure for these libraries. You can either create this structure in your working copy of the SMILA repository next to all the SMILA bundles, or somewhere else on your hard disk and configure the build process to find them there (see below).<br />
<div style="margin-left: 1.5em"><br />
<source lang="text"><br />
lib/<br />
ant-contrib/<br />
ant-contrib-1.x.jar<br />
checkstyle/<br />
checkstyle-all-5.x.jar<br />
jacoco/<br />
jacocoagent.jar, jacocoant.jar<br />
pmd/<br />
asm-3.2.jar, jaxen-1.1.1.jar, pmd-4.3.jar<br />
</source> <br />
</div><br />
**ant-contrib: This is required to run the build. You may download it from: [http://sourceforge.net/projects/ant-contrib/files/ant-contrib/1.0b3/ ant-contrib]. You can use the binary versions available there. (Tested with ant-contrib 1.0b3)<br />
**Furthermore our build process optionally generates reports for checkstyle, jacoco (code coverage) and pmd (static code analysis) if these libraries are present. The build is configured to run without these libraries and will just not create the respective reports, but everything else will be OK. To generate these reports you may download these files from:<br />
***[http://checkstyle.sourceforge.net/ checkstyle] (use Checkstyle 5.5, older versions will not handle Java-7 source code correctly).<br />
***[http://eclemma.org/jacoco/ jacoco]. (Tested with jacoco 0.5.7 and higher)<br />
***[http://pmd.sourceforge.net/ pmd]. (Tested with pmd 4.3)<br />
<br />
==== Configuring the Build ====<br />
<br />
The folder <tt>SMILA.builder</tt> contains everything needed to build SMILA and/or run all tests locally. The default settings are set to build against Eclipse 4.2 and build a product for Win 32bit and 64bit, Linux 32bit and 64bit as well as MacOS x86 64bit. But it is also possible to build other platforms. &nbsp; <br />
<br />
Whether you build from command line or from Eclipse, in both cases the <tt>make.xml</tt> ant script is executed. Before execution certain properties need to be set to meet the local setup. <br />
<br />
===== Setting the Target Build Platform =====<br />
<br />
''First, [[SMILA/Development Guidelines/Howto set up dev environment|setup a development environment]].'' When finished copy the file <tt>SMILA.builder/build.properties.template</tt> to <tt>SMILA.builder/build.properties</tt> and adapt the copied file: Add the platforms that you want to build as value triplets to the <tt>configs</tt> property and comment out or remove those that you don't need. The available platform triplets are:<br> <br />
<br />
{| border="1"<br />
|+ <br> <br />
|-<br />
! Windows 32bit <br />
| {{Codeblock|<pre>...<br />
configs=win32,win32,x86 <br />
# ... </pre><br />
}}<br />
|-<br />
! Windows 64bit <br />
| {{Codeblock|<pre>...<br />
configs=win32,win32,x86_64 <br />
# ... </pre><br />
}}<br />
|-<br />
! Linux 32bit <br />
| {{Codeblock|<pre>...<br />
configs=linux,gtk,x86<br />
# ... </pre><br />
}}<br />
|-<br />
! Linux 64bit <br />
| {{Codeblock|<pre>...<br />
configs=linux, gtk, x86_64<br />
# ... </pre><br />
}}<br />
|-<br />
! Solaris SPARC <br />
| {{Codeblock|<pre>...<br />
configs=solaris, gtk, sparc<br />
# ... </pre><br />
}}<br />
|}<br />
<br />
If you want to provide several distributions at once e.g one for Windows 32bit and one for Linux 32bit (default build plan), concatenate the platform triplets with the '&amp;' character:<br> <br />
<br />
{| border="1"<br />
|-<br />
! Example: <br />
| {{Codeblock|<pre>configs=win32, win32, x86 & \<br />
linux, gtk, x86 </pre><br />
}}<br />
|}<br />
<br />
The archive files of the application distribution are created in the <tt>Application</tt> directory below the specified build directory (see below). For each platform triplet in the <tt>configs</tt> property (<tt>$os, $ws, $arch</tt>) a ZIP file named <tt>SMILA-incubation-$os.$ws.$arch.zip</tt> is built.<br />
<br />
===== Setting Build Properties =====<br />
<br />
These are the main properties that can be used to configure the build process executed by <tt>make.xml</tt>. If you run the build from within Eclipse you must add them to the Ant launch configuration (see [[#Executing_make.xml_from_within_Eclipse|Executing make.xml from within Eclipse]] below), for running from command line we have included templates that you can adapt to your local setup (see [[#Executing_make.xml_from_command_line|Executing make.xml from command line]] below). <br />
<br />
{| cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! Property <br />
! Default <br />
! Comment<br />
|-<br />
| <tt>buildDirectory</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/eclipse.build</tt> <br />
| Directory where built output will be created. This should be always a subdirectory of &lt;SMILA_HOME&gt;. The application distribution's ZIP files will be created in the subdirectory <tt>Application</tt> of this directory.<br />
|-<br />
| <tt>builder</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/SMILA.builder</tt> <br />
| Directory where <tt>make.xml</tt> is locate.<br />
|-<br />
| <tt>eclipse.home</tt> <br />
| <tt>&lt;ECLIPSE_HOME&gt;</tt> <br />
| Location of the [[#Build_Requirements|Eclipse instance]] used to build SMILA.<br />
|-<br />
| <tt>lib.dir</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/lib</tt> <br />
| Location of the additional build libs (ant-contrib, etc.).<br />
|-<br />
| <tt>os</tt> <br />
| win32 <br />
| rowspan="3" | These properties merely control under which platform the test will run. It must be one of the [[#Setting_the_Target_Build_Platform|target platforms]] you have built.<br />
|-<br />
| <tt>ws</tt> <br />
| win32<br />
|-<br />
| <tt>arch</tt> <br />
| x86<br />
|-<br />
| <tt>test.java.home</tt> <br />
| <tt>&lt;JAVA_HOME&gt;</tt> <br />
| A Java 1.7 SDK instance.<br />
|}<br />
<br />
==== Executing the make.xml ====<br />
<br />
The default target is <tt>all</tt>, building the application and running all unit tests. Note that this can take quite a while. To build the distribution archives only, use the targets <tt>clean</tt> and <tt>final-application</tt>. See [[SMILA/Development_Guidelines/Introduction_to_make.xml|Introduction to make.xml]] for more details.<br />
<br />
===== Executing make.xml from within Eclipse =====<br />
<br />
Steps: <br />
<br />
#Select the <tt>SMILA.builder</tt> bundle. <br />
#Open the ''External Tools Configuration'' dialog (select ''Run -&gt; External Tools -&gt; External Tools Configuration''). <br />
#Create a new ''Ant Build'' configuration. <br />
#In the ''Buildfile'' field, enter: <tt>${workspace_loc:/SMILA.builder/make.xml}</tt>. <br />
#In the ''Base Directory'' field, enter: <tt>${workspace_loc:/SMILA.builder}</tt>. <br />
#Add all properties from [[#Setting_Build_Properties|above]] into the ''Arguments'' field (and adapt them to meet your setup) but prepend each with <tt>-D</tt> so each is passed into <tt>ant</tt> as a property (note that <tt>buildDirectory</tt> should be a subdirectory of your SMILA workspace directory), e.g. when using Eclipse 3.7.2 to build: <br />
#:-DbuildDirectory=D:/workspace/SMILA/eclipse.build <br />
#:&nbsp;-Declipse.home=D:/eclipse42<br />
#:-Dbuilder=D:/workspace/SMILA/SMILA.builder <br />
#:-Declipse.running=true <br />
#:-Dos=win32 -Dws=win32 -Darch=x86 <br />
#:-Dlib.dir=D:/workspace/SMILA/lib <br />
#Apply, and run the Ant build.<br>'''Note:''' To start another than the default target select the targets of your choice on the ''Targets'' tab. <br />
<br />
===== Executing make.xml from command line =====<br />
<br />
The <tt>make.bat</tt> and <tt>make.sh</tt> files are just shell scripts setting the properties that are needed for the Ant script. These files exist only as templates in SVN with <tt>.#~#~#</tt> appended to denote their template nature. Copy one of them matching your system and rename them as you like, but note that the names <tt>make.bat</tt> and <tt>make.sh</tt> are already in the svn:ignore list to prevent them from beeing committed accidentally, so it is recommended to use them. <br />
<br />
Both scripts are very similar, they start with setting some environment variables which are then used to create the build configuration properties and eventually feed them into an Ant call. There are the variables you usually need to check and adapt: <br />
<br />
{| cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! Variable <br />
! Comment<br />
|-<br />
| <tt>SMILA_HOME </tt><br />
| Location of your SVN working copy. May be derived automatically in the <tt>.sh</tt> script, in the batch file, however, you must set it yourself.<br />
|-<br />
| <tt>ECLIPSE_HOME </tt><br />
| Location of the [[#Build_Requirements|Eclipse instance]] used to build SMILA.<br />
|-<br />
| <tt>ARCH </tt><br />
| Operating system and platform settings for running the tests. See description of <tt>os</tt>, <tt>ws</tt> and <tt>arch</tt> properties above.<br />
|-<br />
| <tt>JAVA_HOME </tt><br />
| Location of the JDK to build and run tests in. Must match the <tt>ARCH</tt> setting. <br />
'''Tip:''' If your compile log complaints about a non-1.6 compatible JVM despite the correct settings, you must also add the {{path|JAVA_HOME}} location to the {{Path|ECLIPSE_HOME/eclipse.ini}} like so:<br> {{Code|<br />
-vm<br/><br />
<JAVA_HOME>\bin}} <br />
<br />
|-<br />
| <tt>ANT_HOME </tt><br />
| Location of your ANT installation.<br />
|-<br />
| <tt>BUILDLIB_DIR </tt><br />
| Location of your build files. If you placed them in the SMILA_HOME you can leave this untouched and comment out the <tt>libDir</tt> setting in the next line.<br />
|-<br />
| <tt>buildOpts </tt><br />
| Use the default <tt>buildOpts</tt> for Eclipse pdebuild or adapt them if you have another version installed.<br />
|}<br />
<br />
Usually you don't need to change anything below the line setting <tt>buildOpts</tt>. <br />
<br />
To run a build with the default target (<tt>all</tt>), open a command prompt or shell in the <tt>SMILA.builder</tt> directory and just enter. <br />
<br />
<source lang="text"><br />
make<br />
</source> <br />
<br />
To execute another than the default target, just pass it (or them) as an argument: <br />
<br />
<source lang="text"><br />
make build<br />
</source> <br />
<br />
For example, to build the application distribution ZIPs without running the tests (which can take quite long), you can use: <br />
<br />
<source lang="text"><br />
make clean final-application<br />
</source> <br />
<br />
In Windows you will not see much output in the command prompt window, because the batch file redirects it to a logfile (named <tt>log.make</tt>, if the batch file is <tt>make.bat</tt>) so that you can check for error details after the build. You can install the [http://gnuwin32.sourceforge.net/ GnuWin32] or [http://www.cygwin.com/ Cygwin] tools and use <tt>tee</tt> to have the output written to both console and logfile. The template contains the changed ANT call as a comment quite at the end of the script. <br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/HowTo/Howto_build_a_SMILA-Distribution&diff=333087SMILA/Documentation/HowTo/Howto build a SMILA-Distribution2013-04-08T13:44:38Z<p>Marco.strack.empolis.com: Replaced minor versions with placeholders</p>
<hr />
<div>This HowTo describes how to build a SMILA distribution.<br />
<br />
==== Build Requirements ====<br />
<br />
The build process uses Eclipse's PDE Build tools to build all the bundles, run all unit tests, and create a ZIP archive with a complete SMILA application that can be installed and run independently from any development environment. To run this build process, you should first install the following software: <br />
<br />
*'''Eclipse SDK 4.2''' for your operating system: We recommend installing a fresh Eclipse instance independently from the one you might already be using and use this solely for the purpose of building SMILA. This makes sure that any potential additional Eclipse plugins installed on your existing installation won't interfere with the build process (this shouldn't happen, usually - but just to be safe). You can find the download on [http://download.eclipse.org/eclipse/downloads/ http://download.eclipse.org/eclipse/downloads/]. This HowTo was tested with [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/ Eclipse Classic SDK 4.2].<br />
<br />
*'''DeltaPack''' matching your Eclipse version: The DeltaPack contains some additional bundles needed in the build, mainly for creating the SMILA executable for different platforms. You'll find the download on [http://download.eclipse.org/eclipse/downloads/ http://download.eclipse.org/eclipse/downloads/]. Install it by unpacking it into you Eclipse SDK installation. This HowTo was tested with [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/#DeltaPack DeltaPack 4.2].<br />
<br />
*'''Sun Java Development Kit''': You need a full JDK, version 7, to build SMILA, not just a JRE. You can get it at [http://www.oracle.com/technetwork/java/javase/downloads/index.html]<br />
<br />
*'''Apache Ant''': The build process is executed by Ant, which you can download here: [http://ant.apache.org/ http://ant.apache.org/]. At least version 1.7 is needed (and tested).<br />
<br />
*'''Additional Libraries''' for building which are not included in SMILA repository. The build scripts assume the following directory structure for these libraries. You can either create this structure in your working copy of the SMILA repository next to all the SMILA bundles, or somewhere else on your hard disk and configure the build process to find them there (see below).<br />
<div style="margin-left: 1.5em"><br />
<source lang="text"><br />
lib/<br />
ant-contrib/<br />
ant-contrib-1.x.jar<br />
checkstyle/<br />
checkstyle-all-5.x.jar<br />
jacoco/<br />
jacocoagent.jar, jacocoant.jar<br />
pmd/<br />
asm-3.2.jar, jaxen-1.1.1.jar, pmd-4.3.jar<br />
</source> <br />
</div><br />
**ant-contrib: This is required to run the build. You may download it from: [http://sourceforge.net/projects/ant-contrib ant-contrib]. (Tested with ant-contrib 1.0b3)<br />
**Furthermore our build process optionally generates reports for checkstyle, jacoco (code coverage) and pmd (static code analysis) if these libraries are present. The build is configured to run without these libraries and will just not create the respective reports, but everything else will be OK. To generate these reports you may download these files from:<br />
***[http://checkstyle.sourceforge.net/ checkstyle] (use Checkstyle 5.5, older versions will not handle Java-7 source code correctly).<br />
***[http://eclemma.org/jacoco/ jacoco]. (Tested with jacoco 0.5.7 and higher)<br />
***[http://pmd.sourceforge.net/ pmd]. (Tested with pmd 4.3)<br />
<br />
==== Configuring the Build ====<br />
<br />
The folder <tt>SMILA.builder</tt> contains everything needed to build SMILA and/or run all tests locally. The default settings are set to build against Eclipse 4.2 and build a product for Win 32bit and 64bit, Linux 32bit and 64bit as well as MacOS x86 64bit. But it is also possible to build other platforms. &nbsp; <br />
<br />
Whether you build from command line or from Eclipse, in both cases the <tt>make.xml</tt> ant script is executed. Before execution certain properties need to be set to meet the local setup. <br />
<br />
===== Setting the Target Build Platform =====<br />
<br />
''First, [[SMILA/Development Guidelines/Howto set up dev environment|setup a development environment]].'' When finished copy the file <tt>SMILA.builder/build.properties.template</tt> to <tt>SMILA.builder/build.properties</tt> and adapt the copied file: Add the platforms that you want to build as value triplets to the <tt>configs</tt> property and comment out or remove those that you don't need. The available platform triplets are:<br> <br />
<br />
{| border="1"<br />
|+ <br> <br />
|-<br />
! Windows 32bit <br />
| {{Codeblock|<pre>...<br />
configs=win32,win32,x86 <br />
# ... </pre><br />
}}<br />
|-<br />
! Windows 64bit <br />
| {{Codeblock|<pre>...<br />
configs=win32,win32,x86_64 <br />
# ... </pre><br />
}}<br />
|-<br />
! Linux 32bit <br />
| {{Codeblock|<pre>...<br />
configs=linux,gtk,x86<br />
# ... </pre><br />
}}<br />
|-<br />
! Linux 64bit <br />
| {{Codeblock|<pre>...<br />
configs=linux, gtk, x86_64<br />
# ... </pre><br />
}}<br />
|-<br />
! Solaris SPARC <br />
| {{Codeblock|<pre>...<br />
configs=solaris, gtk, sparc<br />
# ... </pre><br />
}}<br />
|}<br />
<br />
If you want to provide several distributions at once e.g one for Windows 32bit and one for Linux 32bit (default build plan), concatenate the platform triplets with the '&amp;' character:<br> <br />
<br />
{| border="1"<br />
|-<br />
! Example: <br />
| {{Codeblock|<pre>configs=win32, win32, x86 & \<br />
linux, gtk, x86 </pre><br />
}}<br />
|}<br />
<br />
The archive files of the application distribution are created in the <tt>Application</tt> directory below the specified build directory (see below). For each platform triplet in the <tt>configs</tt> property (<tt>$os, $ws, $arch</tt>) a ZIP file named <tt>SMILA-incubation-$os.$ws.$arch.zip</tt> is built.<br />
<br />
===== Setting Build Properties =====<br />
<br />
These are the main properties that can be used to configure the build process executed by <tt>make.xml</tt>. If you run the build from within Eclipse you must add them to the Ant launch configuration (see [[#Executing_make.xml_from_within_Eclipse|Executing make.xml from within Eclipse]] below), for running from command line we have included templates that you can adapt to your local setup (see [[#Executing_make.xml_from_command_line|Executing make.xml from command line]] below). <br />
<br />
{| cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! Property <br />
! Default <br />
! Comment<br />
|-<br />
| <tt>buildDirectory</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/eclipse.build</tt> <br />
| Directory where built output will be created. This should be always a subdirectory of &lt;SMILA_HOME&gt;. The application distribution's ZIP files will be created in the subdirectory <tt>Application</tt> of this directory.<br />
|-<br />
| <tt>builder</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/SMILA.builder</tt> <br />
| Directory where <tt>make.xml</tt> is locate.<br />
|-<br />
| <tt>eclipse.home</tt> <br />
| <tt>&lt;ECLIPSE_HOME&gt;</tt> <br />
| Location of the [[#Build_Requirements|Eclipse instance]] used to build SMILA.<br />
|-<br />
| <tt>lib.dir</tt> <br />
| <tt>&lt;SMILA_HOME&gt;/lib</tt> <br />
| Location of the additional build libs (ant-contrib, etc.).<br />
|-<br />
| <tt>os</tt> <br />
| win32 <br />
| rowspan="3" | These properties merely control under which platform the test will run. It must be one of the [[#Setting_the_Target_Build_Platform|target platforms]] you have built.<br />
|-<br />
| <tt>ws</tt> <br />
| win32<br />
|-<br />
| <tt>arch</tt> <br />
| x86<br />
|-<br />
| <tt>test.java.home</tt> <br />
| <tt>&lt;JAVA_HOME&gt;</tt> <br />
| A Java 1.7 SDK instance.<br />
|}<br />
<br />
==== Executing the make.xml ====<br />
<br />
[[Image:Smila.build.all.png|thumb|right|all dependency graph]] <br />
<br />
The default target is <tt>all</tt>, building the application and running all unit tests. Note that this can take quite a while. To build the distribution archives only, use the targets <tt>clean</tt> and <tt>final-application</tt>. The [[:image:Smila.build.all.png|dependency graph]] explains what will happen and shows the relevant targets you may call instead. <br />
<br />
===== Executing make.xml from within Eclipse =====<br />
<br />
Steps: <br />
<br />
#Select the <tt>SMILA.builder</tt> bundle. <br />
#Open the ''External Tools Configuration'' dialog (select ''Run -&gt; External Tools -&gt; External Tools Configuration''). <br />
#Create a new ''Ant Build'' configuration. <br />
#In the ''Buildfile'' field, enter: <tt>${workspace_loc:/SMILA.builder/make.xml}</tt>. <br />
#In the ''Base Directory'' field, enter: <tt>${workspace_loc:/SMILA.builder}</tt>. <br />
#Add all properties from [[#Setting_Build_Properties|above]] into the ''Arguments'' field (and adapt them to meet your setup) but prepend each with <tt>-D</tt> so each is passed into <tt>ant</tt> as a property (note that <tt>buildDirectory</tt> should be a subdirectory of your SMILA workspace directory), e.g. when using Eclipse 3.7.2 to build: <br />
#:-DbuildDirectory=D:/workspace/SMILA/eclipse.build <br />
#:&nbsp;-Declipse.home=D:/eclipse42<br />
#:-Dbuilder=D:/workspace/SMILA/SMILA.builder <br />
#:-Declipse.running=true <br />
#:-Dos=win32 -Dws=win32 -Darch=x86 <br />
#:-Dlib.dir=D:/workspace/SMILA/lib <br />
#Apply, and run the Ant build.<br>'''Note:''' To start another than the default target select the targets of your choice on the ''Targets'' tab. <br />
<br />
===== Executing make.xml from command line =====<br />
<br />
The <tt>make.bat</tt> and <tt>make.sh</tt> files are just shell scripts setting the properties that are needed for the Ant script. These files exist only as templates in SVN with <tt>.#~#~#</tt> appended to denote their template nature. Copy one of them matching your system and rename them as you like, but note that the names <tt>make.bat</tt> and <tt>make.sh</tt> are already in the svn:ignore list to prevent them from beeing committed accidentally, so it is recommended to use them. <br />
<br />
Both scripts are very similar, they start with setting some environment variables which are then used to create the build configuration properties and eventually feed them into an Ant call. There are the variables you usually need to check and adapt: <br />
<br />
{| cellspacing="0" cellpadding="5" border="1"<br />
|-<br />
! Variable <br />
! Comment<br />
|-<br />
| <tt>SMILA_HOME </tt><br />
| Location of your SVN working copy. May be derived automatically in the <tt>.sh</tt> script, in the batch file, however, you must set it yourself.<br />
|-<br />
| <tt>ECLIPSE_HOME </tt><br />
| Location of the [[#Build_Requirements|Eclipse instance]] used to build SMILA.<br />
|-<br />
| <tt>ARCH </tt><br />
| Operating system and platform settings for running the tests. See description of <tt>os</tt>, <tt>ws</tt> and <tt>arch</tt> properties above.<br />
|-<br />
| <tt>JAVA_HOME </tt><br />
| Location of the JDK to build and run tests in. Must match the <tt>ARCH</tt> setting. <br />
'''Tip:''' If your compile log complaints about a non-1.6 compatible JVM despite the correct settings, you must also add the {{path|JAVA_HOME}} location to the {{Path|ECLIPSE_HOME/eclipse.ini}} like so:<br> {{Code|<br />
-vm<br/><br />
<JAVA_HOME>\bin}} <br />
<br />
|-<br />
| <tt>ANT_HOME </tt><br />
| Location of your ANT installation.<br />
|-<br />
| <tt>BUILDLIB_DIR </tt><br />
| Location of your build files. If you placed them in the SMILA_HOME you can leave this untouched and comment out the <tt>libDir</tt> setting in the next line.<br />
|-<br />
| <tt>buildOpts </tt><br />
| Use the default <tt>buildOpts</tt> for Eclipse pdebuild or adapt them if you have another version installed.<br />
|}<br />
<br />
Usually you don't need to change anything below the line setting <tt>buildOpts</tt>. <br />
<br />
To run a build with the default target (<tt>all</tt>), open a command prompt or shell in the <tt>SMILA.builder</tt> directory and just enter. <br />
<br />
<source lang="text"><br />
make<br />
</source> <br />
<br />
To execute another than the default target, just pass it (or them) as an argument: <br />
<br />
<source lang="text"><br />
make build<br />
</source> <br />
<br />
For example, to build the application distribution ZIPs without running the tests (which can take quite long), you can use: <br />
<br />
<source lang="text"><br />
make clean final-application<br />
</source> <br />
<br />
In Windows you will not see much output in the command prompt window, because the batch file redirects it to a logfile (named <tt>log.make</tt>, if the batch file is <tt>make.bat</tt>) so that you can check for error details after the build. You can install the [http://gnuwin32.sourceforge.net/ GnuWin32] or [http://www.cygwin.com/ Cygwin] tools and use <tt>tee</tt> to have the output written to both console and logfile. The template contains the changed ANT call as a comment quite at the end of the script. <br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.comhttps://wiki.eclipse.org/index.php?title=SMILA/Documentation/HowTo/Howto_set_up_dev_environment&diff=333064SMILA/Documentation/HowTo/Howto set up dev environment2013-04-08T09:42:43Z<p>Marco.strack.empolis.com: Some changes to "getting the source code from svn". Example included.</p>
<hr />
<div><br> This HowTo describes the necessary steps for setting up a SMILA development environment. <br />
<br />
==== Preconditions ====<br />
<br />
Here is the list of things that you will definitely need for developing SMILA components: <br />
<br />
* JDK 1.7<br />
* Recent Eclipse SDK - This HowTo was tested with [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/ Eclipse Classic SDK 4.2] (Juno Release) <br> <br />
<br />
==== Getting the source code ====<br />
<br />
There is more than one way of getting the code into your Eclipse workspace. The following sections will describe how to get the source code via SVN (recommended!). <br />
<br />
As an alternative, you could download the complete source code from the [http://www.eclipse.org/smila/downloads.php release download page] or the [http://build.eclipse.org/rt/smila/nightly/ nightly build downloads] and unpack the archive into your workspace. <br />
<br />
===== Installing SVN Provider =====<br />
''(skip this section if SVN Team Provider is already installed in your eclipse IDE)''<br />
<br />
* Install ''Subversive SVN Team Provider'' and ''Subversive SVN JDT Ignore Extensions'' from the Eclipse software repository.<br> <br />
* Restart Eclipse. <br />
* Select ''Windows &gt; Preferences &gt; Team &gt; SVN''. This should open the ''Subversive Connector Discovery'' window. <br />
* Select the Subversive SVN Connector that you wish to use. We suggest to take the latest SVN Kit that is offered. At the time of writing it was SVN Kit 1.3.5. <br />
<br />
===== Get source code from SVN =====<br />
<br />
There are two ways for this, automatically by using the ''Project Set File'' or manually. Both are described in the following: <br />
<br />
''Manually checking out and importing the projects into eclipse afterwards:''<br />
* Use your favorite SVN client (''except the eclipse SVN client'') to check out SMILA's source code from the repository located at:<br> <tt>https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core</tt>. If you later want to be able to build a SMILA distribution, all SMILA projects should be located in the same directory.<br />
:: <pre>svn co https://dev.eclipse.org/svnroot/rt/org.eclipse.smila/trunk/core</pre><br />
::'''Note:''' ''The upside of doing so is that you can easily get new projects just by updating your working copy and reimporting the sources into eclipse. Removed projects will be deleted on update. Eclipse will indicate this to the user by displaying an empty project.'' <br />
* Import all SMILA project into your workspace: <br />
** Click ''File'' &gt; ''Import'' &gt; ''General'' &gt; ''Existing Projects into Workspace'' &gt; ''Next.'' <br />
** Select the folder that contains all SMILA projects --&gt; (all projects should be selected automatically) &gt; ''Finish''.<br />
<br />
''Automatic checkout and import by using the Project Set File:''<br />
* In eclipse, create an SVN repository location with URL <tt>https://dev.eclipse.org/svnroot/rt/org.eclipse.smila</tt><br />
* Checkout <tt>trunk/releng</tt> <br />
* Right click on <tt>SMILA.releng/devenv.SMILA-core.psf</tt><br />
* Click ''Import Project Set...'' and choose "No To All"<br />
::'''Hint:''': ''New projects should always be added to the .psf file so you can import them (as before): right click on .psf file and click on "Import Project Set...", be sure to click "No To All" to the question whether to overwrite existing projects in the workspace, otherwise it will check out everything again instead of ignoring the projects, that are already checked out. If projects are removed you have to remove them manually from the workspace, this can't be handled via .psf file.'' <br />
<br />
After having imported the source code into your workspace, it will show up a lot of errors. Don't worry, they'll disappear after the next steps below.<br />
<br />
==== Defining the target platform ====<br />
<br />
The target platform defines the set of bundles and features that you are developing against. SMILA ships a ''Target Definition File'' that you can open in your IDE to configure the target platform automatically. This file contains all the references needed for developing SMILA with Eclipse Juno (Release 4.2).<br />
<br />
===== Using the target platform provided by SMILA =====<br />
<br />
* Checkout <tt>../org.eclipse.smila/trunk/releng</tt> (''if you haven't already done before'')<br />
* Open the file <tt>SMILA.releng/devenv/SMILA.target</tt> with the ''Target Definition'' editor. <br>Eclipse starts downloading the referenced bundles/features which it tells you by stating "Resolving Target Definition" in its status bar. Be patient, this will take quite a while. After it has finished, you can click the link "Set as Target Platform" on the top right of the ''Target Definition'' editor. Doing so will cause Eclipse to start re-compiling the sources and all error markers should be gone when finished.<br />
<br />
===== Defining the target platform manually =====<br />
<br />
* Instead of using the target definition file provided by SMILA (see above) you can also [[SMILA/Development Guidelines/Howto set up target platform|manually set your own target platform]].<br />
<br />
==== Launching SMILA in Eclipse IDE ====<br />
<br />
If you've checked out SMILA's trunk correctly, you should have a project called '''SMILA.launch''' in your workspace. This project contains the SMILA's launch configuration for Eclipse IDE. To start SMILA directly in your Eclipse IDE, just follow the steps below: <br />
<br />
* Click <span style="font-style: italic;">Run</span>--&gt; ''Debug Configurations'' and expand '''''OSGI Framework'''''<b>.</b> <br />
* Select the ''SMILA'' launch file. <br />
* Click '''Debug'''. <br> If everything works fine, you will get an output in the '''Console''' view similar to the following:<br />
<br />
<source lang="text"><br />
osgi> Persistence bundle starting...<br />
ProviderTracker: New service detected...<br />
ProviderTracker: Added service org.eclipse.persistence.jpa.osgi.PersistenceProviderOSGi<br />
Persistence bundle started.<br />
[INFO ] Context /zookeeper: Registered handler(1) ZooKeeperAdminHandler, pattern /(.*)$<br />
[INFO ] Added worker webFetcher to WorkerManager.<br />
...<br />
[INFO ] HTTP server has SMILA handler RequestDispatcher for context /smila.<br />
[INFO ] HTTP server started successfully on port 8080.<br />
</source><br />
<br />
==== You're done ====<br />
<br />
Congratulations! You've just successfully checked out and configured your SMILA development environment and you can now start [[SMILA/Development Guidelines/Create a bundle (plug-in)|developing your own bundles]].<br />
<br />
==== Additional steps ====<br />
<br />
The following steps may be needed for special purposes. If you are a SMILA user who only wants to integrate an own component you won't need them. <br />
<br />
===== Delta Pack =====<br />
''(only needed for building the software outside of eclipse IDE)''<br />
<br />
For building the software you may need to add a "Delta Pack" to an Eclipse SDK installation. You can download it from the [http://download.eclipse.org/eclipse/downloads/drops4/R-4.2-201206081400/ here]. After downloading, you can copy the contained plugins and features in your eclipse installation.<br />
<br />
===== Checkstyle configuration =====<br />
<br />
If you have the [http://eclipse-cs.sourceforge.net/ Eclipse Checkstyle plugin] installed, you will get a lot of error messages complaining about missing check configurations when Eclipse builds the workspace.<br />
(''Hint: For installing the Checkstyle plugin, use location: "http://eclipse-cs.sf.net/update/"'')<br />
<br />
<source lang="text"><br />
Errors running builder 'Checkstyle Builder' on project 'org.eclipse.smila.utils'.<br />
Fileset from project "org.eclipse.smila.utils" has no valid check configuration.<br />
...<br />
</source><br />
<br />
You can solve this by importing them: <br />
* Open ''Window -> Preferences'' and go to ''Checkstyle''.<br />
* Click ''New...'', enter <tt>SMILA Checkstyle</tt> as the name, click ''Import...'', and select ''SMILA.builder/checkstyle/smila_checkstyle-5.xml'' from your workspace. Click ''OK''.<br />
* Click ''New...'' again, enter <tt>SMILA Test Checkstyle</tt> as the name, click ''Import...'', and select ''SMILA.builder/checkstyle/smila-test_checkstyle-5.xml'' from your workspace. Click ''OK''.<br />
* Select <tt>SMILA Checkstyle</tt> and click ''Set as Default''.<br />
* Click ''OK''. <br> Now you should not get those error messages again.<br />
<br />
===== Enabling the BPEL Designer =====<br />
<br />
If you want to work with the SMILA extensions for Eclipse BPEL designer, you need to check out the bundles from <tt>trunk/tooling</tt>. Currently, the required bundles are: <br />
<br />
*<tt>org.eclipse.smila.processing.designer.model</tt> <br />
*<tt>org.eclipse.smila.processing.designer.ui</tt><br />
<br />
To compile them you need additional bundles from the [http://www.eclipse.org/bpel Eclipse BPEL Designer] in your target platform. See [[SMILA/BPEL Designer]] for more information.<br />
<br />
<br />
[[Category:SMILA]]</div>Marco.strack.empolis.com