Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
Difference between revisions of "SMILA/Documentation/Importing/Crawler/Feed"
Line 1: | Line 1: | ||
IN PROGRESS | IN PROGRESS | ||
+ | |||
+ | |||
+ | The FeedCrawler is used to read RSS or Atom feed in importing workflows. | ||
+ | |||
+ | {{Tip|In contrast to the old FeedAgent component, the FeedCrawler does not support checking the feeds for new entries in regular time slots. You can simulate this currently by starting a job using the FeedCrawler regularly from outside, e.g. by using cron or other schedulers. We are planning to integrate an own scheduling component in SMILA in the future.}} | ||
+ | |||
=== FeedCrawler === | === FeedCrawler === | ||
+ | |||
+ | The Feed crawler offers the functionality to read RSS and Atom feeds. The implementation uses [https://rometools.jira.com/wiki/display/ROME/Home ROME and ROME Fetcher] to retrieve and parse the feeds. ROME supports the following feed formats: | ||
+ | * RSS 0.90 | ||
+ | * RSS 0.91 Netscape | ||
+ | * RSS 0.91 Userland | ||
+ | * RSS 0.92 | ||
+ | * RSS 0.93 | ||
+ | * RSS 0.94 | ||
+ | * RSS 1.0 | ||
+ | * RSS 2.0 | ||
+ | * Atom 0.3 | ||
+ | * Atom 1.0. | ||
==== Configuration ==== | ==== Configuration ==== | ||
+ | |||
+ | The FeedCrawler worker is usually the first worker in a workflow and the job is started in <tt>runOnce</tt> mode. | ||
+ | |||
+ | * Worker name: <tt>feedCrawler</tt> | ||
+ | * Parameters: | ||
+ | ** <tt>dataSource</tt> ''(req.)'' value for attribute <tt>_source</tt>, needed e.g. by the delta service | ||
+ | ** <tt>feedUrls</tt> ''(req.)'' URLs (usually HTTP) of the feeds to read. Can be a single string value or a list of string values. Currently, all feeds are read in a single task. | ||
+ | ** <tt>mapping</tt> ''(req.)'' Mapping of feed and feed item properties to record attribute names. See below for the available property names. | ||
+ | ** <tt>deltaProperties</tt> ''(opt.)'' a list of feed or feed item property names (see below) used to generate the value for attribute _deltaHash. If not set, a unique _deltaHash value is generated for each record so that it will be updated in any case, if delta checking is enabled. | ||
+ | ** <tt>maxRecordsPerBulk</tt> ''(opt.)'' maximum number of item records in one bulk in the output bucket. (default: 1000) | ||
+ | * Output slots: | ||
+ | ** <tt>crawledRecords</tt>: One record per item read from the feeds. | ||
+ | |||
+ | These are properties of the feed that can be mapped to record attributes. The values will be identical for all records created from entries of a single feed. Some are not only simple values, but maps (or even lists of maps). The attributes of these map objects are described in further tables below, they cannot be changed via the mapping. | ||
+ | |||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Attribute | ||
+ | ! Type | ||
+ | ! Description | ||
+ | |- | ||
+ | | feedAuthors | ||
+ | | List<Person> | ||
+ | | Returns the feed authors | ||
+ | |- | ||
+ | | feedCategories | ||
+ | | List<Category> | ||
+ | | Returns the feed categories | ||
+ | |- | ||
+ | | feedContributors | ||
+ | | List<Person> | ||
+ | | Returns the feed contributors | ||
+ | |- | ||
+ | | feedCopyright | ||
+ | | String | ||
+ | | Returns the feed copyright information | ||
+ | |- | ||
+ | | feedDescription | ||
+ | | String | ||
+ | | Returns the feed description | ||
+ | |- | ||
+ | | feedEncoding | ||
+ | | String | ||
+ | | Returns the charset encoding of the feed | ||
+ | |- | ||
+ | | feedType | ||
+ | | String | ||
+ | | Returns the feed type | ||
+ | |- | ||
+ | | feedImage | ||
+ | | Image | ||
+ | | Returns the feed image | ||
+ | |- | ||
+ | | feedLanguage | ||
+ | | String | ||
+ | | Returns the feed language | ||
+ | |- | ||
+ | | feedLinks | ||
+ | | List<Link> | ||
+ | | Returns the feed links | ||
+ | |- | ||
+ | | feedPublishDate | ||
+ | | Date | ||
+ | | Returns the feed published date | ||
+ | |- | ||
+ | | feedTitle | ||
+ | | String | ||
+ | | Returns the feed title | ||
+ | |- | ||
+ | | feedUri | ||
+ | | String | ||
+ | | Returns the feed uri | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | And these are properties extracted from the single feed items: | ||
+ | |||
+ | |||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Attribute | ||
+ | ! Type | ||
+ | ! Description | ||
+ | |- | ||
+ | | itemAuthors | ||
+ | | List<Person> | ||
+ | | Returns a feed entry authors | ||
+ | |- | ||
+ | | itemCategories | ||
+ | | List<Category> | ||
+ | | Returns a feed entry categories | ||
+ | |- | ||
+ | | itemContents | ||
+ | | List<Content> | ||
+ | | Returns a feed entry contents | ||
+ | |- | ||
+ | | itemContributors | ||
+ | | List<Person> | ||
+ | | Returns a feed entry contributors | ||
+ | |- | ||
+ | | itemDescription | ||
+ | | Content | ||
+ | | Returns a feed entry description | ||
+ | |- | ||
+ | | itemEnclosures | ||
+ | | List<Enclosure> | ||
+ | | Returns a feed entry enclosures | ||
+ | |- | ||
+ | | itemLinks | ||
+ | | List<Link> | ||
+ | | Returns a feed entry links | ||
+ | |- | ||
+ | | itemPublishDate | ||
+ | | Date | ||
+ | | Returns a feed entry publish date | ||
+ | |- | ||
+ | | itemTitle | ||
+ | | String | ||
+ | | Returns a feed entry title | ||
+ | |- | ||
+ | | itemUpdateDate | ||
+ | | Date | ||
+ | | Returns a feed entry update date. | ||
+ | |- | ||
+ | | itemUri | ||
+ | | String | ||
+ | | Returns a feed entry uri. | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | <b>Content</b> maps can contain these properties: | ||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Attribute | ||
+ | ! Type | ||
+ | ! Description | ||
+ | |- | ||
+ | | Mode | ||
+ | | String | ||
+ | | Returns the mode of the content | ||
+ | |- | ||
+ | | Value | ||
+ | | String | ||
+ | | Returns the value of the content | ||
+ | |- | ||
+ | | Type | ||
+ | | String | ||
+ | | Returns the type of the content | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | <b>Person</b> maps can contain these properties: | ||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Attribute | ||
+ | ! Type | ||
+ | ! Description | ||
+ | |- | ||
+ | | Email | ||
+ | | String | ||
+ | | Returns the email of the person | ||
+ | |- | ||
+ | | Name | ||
+ | | String | ||
+ | | Returns the name of the person | ||
+ | |- | ||
+ | | Uri | ||
+ | | String | ||
+ | | Returns the uri of the person | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | <b>Image</b> maps can contain these properties: | ||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Attribute | ||
+ | ! Type | ||
+ | ! Description | ||
+ | |- | ||
+ | | Link | ||
+ | | String | ||
+ | | Returns the link of the image | ||
+ | |- | ||
+ | | Title | ||
+ | | String | ||
+ | | Returns the title of the image | ||
+ | |- | ||
+ | | Url | ||
+ | | String | ||
+ | | Returns the url of the image | ||
+ | |- | ||
+ | | Description | ||
+ | | String | ||
+ | | Returns the description of the image | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | <b>Category</b> maps can contain these properties: | ||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Attribute | ||
+ | ! Type | ||
+ | ! Description | ||
+ | |- | ||
+ | | Name | ||
+ | | String | ||
+ | | Returns the name of the category | ||
+ | |- | ||
+ | | TaxanomyUri | ||
+ | | String | ||
+ | | Returns the taxonomy uri of the category | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | <b>Enclosure</b> maps can contain these properties: | ||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Attribute | ||
+ | ! Type | ||
+ | ! Description | ||
+ | |- | ||
+ | | Type | ||
+ | | String | ||
+ | | Returns the type of the enclosure | ||
+ | |- | ||
+ | | Url | ||
+ | | String | ||
+ | | Returns the url of the enclosure | ||
+ | |- | ||
+ | | Length | ||
+ | | Integer | ||
+ | | Returns the length of the enclosure | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | <b>Link</b> maps can contain these properties: | ||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Attribute | ||
+ | ! Type | ||
+ | ! Description | ||
+ | |- | ||
+ | | Href | ||
+ | | String | ||
+ | | Returns the href of the link | ||
+ | |- | ||
+ | | Hreflang | ||
+ | | String | ||
+ | | Returns the hreflang of the link | ||
+ | |- | ||
+ | | Rel | ||
+ | | Integer | ||
+ | | Returns the rel of the link | ||
+ | |- | ||
+ | | Title | ||
+ | | String | ||
+ | | Returns the title of the link | ||
+ | |- | ||
+ | | Type | ||
+ | | String | ||
+ | | Returns the type of the link | ||
+ | |- | ||
+ | | Length | ||
+ | | Integer | ||
+ | | Returns the length of the link | ||
+ | |- | ||
+ | |} | ||
+ | |||
==== Processing ==== | ==== Processing ==== | ||
=== Sample Feed Crawler Job === | === Sample Feed Crawler Job === |
Revision as of 04:21, 27 June 2012
IN PROGRESS
The FeedCrawler is used to read RSS or Atom feed in importing workflows.
FeedCrawler
The Feed crawler offers the functionality to read RSS and Atom feeds. The implementation uses ROME and ROME Fetcher to retrieve and parse the feeds. ROME supports the following feed formats:
- RSS 0.90
- RSS 0.91 Netscape
- RSS 0.91 Userland
- RSS 0.92
- RSS 0.93
- RSS 0.94
- RSS 1.0
- RSS 2.0
- Atom 0.3
- Atom 1.0.
Configuration
The FeedCrawler worker is usually the first worker in a workflow and the job is started in runOnce mode.
- Worker name: feedCrawler
- Parameters:
- dataSource (req.) value for attribute _source, needed e.g. by the delta service
- feedUrls (req.) URLs (usually HTTP) of the feeds to read. Can be a single string value or a list of string values. Currently, all feeds are read in a single task.
- mapping (req.) Mapping of feed and feed item properties to record attribute names. See below for the available property names.
- deltaProperties (opt.) a list of feed or feed item property names (see below) used to generate the value for attribute _deltaHash. If not set, a unique _deltaHash value is generated for each record so that it will be updated in any case, if delta checking is enabled.
- maxRecordsPerBulk (opt.) maximum number of item records in one bulk in the output bucket. (default: 1000)
- Output slots:
- crawledRecords: One record per item read from the feeds.
These are properties of the feed that can be mapped to record attributes. The values will be identical for all records created from entries of a single feed. Some are not only simple values, but maps (or even lists of maps). The attributes of these map objects are described in further tables below, they cannot be changed via the mapping.
Attribute | Type | Description |
---|---|---|
feedAuthors | List<Person> | Returns the feed authors |
feedCategories | List<Category> | Returns the feed categories |
feedContributors | List<Person> | Returns the feed contributors |
feedCopyright | String | Returns the feed copyright information |
feedDescription | String | Returns the feed description |
feedEncoding | String | Returns the charset encoding of the feed |
feedType | String | Returns the feed type |
feedImage | Image | Returns the feed image |
feedLanguage | String | Returns the feed language |
feedLinks | List<Link> | Returns the feed links |
feedPublishDate | Date | Returns the feed published date |
feedTitle | String | Returns the feed title |
feedUri | String | Returns the feed uri |
And these are properties extracted from the single feed items:
Attribute | Type | Description |
---|---|---|
itemAuthors | List<Person> | Returns a feed entry authors |
itemCategories | List<Category> | Returns a feed entry categories |
itemContents | List<Content> | Returns a feed entry contents |
itemContributors | List<Person> | Returns a feed entry contributors |
itemDescription | Content | Returns a feed entry description |
itemEnclosures | List<Enclosure> | Returns a feed entry enclosures |
itemLinks | List<Link> | Returns a feed entry links |
itemPublishDate | Date | Returns a feed entry publish date |
itemTitle | String | Returns a feed entry title |
itemUpdateDate | Date | Returns a feed entry update date. |
itemUri | String | Returns a feed entry uri. |
Content maps can contain these properties:
Attribute | Type | Description |
---|---|---|
Mode | String | Returns the mode of the content |
Value | String | Returns the value of the content |
Type | String | Returns the type of the content |
Person maps can contain these properties:
Attribute | Type | Description |
---|---|---|
String | Returns the email of the person | |
Name | String | Returns the name of the person |
Uri | String | Returns the uri of the person |
Image maps can contain these properties:
Attribute | Type | Description |
---|---|---|
Link | String | Returns the link of the image |
Title | String | Returns the title of the image |
Url | String | Returns the url of the image |
Description | String | Returns the description of the image |
Category maps can contain these properties:
Attribute | Type | Description |
---|---|---|
Name | String | Returns the name of the category |
TaxanomyUri | String | Returns the taxonomy uri of the category |
Enclosure maps can contain these properties:
Attribute | Type | Description |
---|---|---|
Type | String | Returns the type of the enclosure |
Url | String | Returns the url of the enclosure |
Length | Integer | Returns the length of the enclosure |
Link maps can contain these properties:
Attribute | Type | Description |
---|---|---|
Href | String | Returns the href of the link |
Hreflang | String | Returns the hreflang of the link |
Rel | Integer | Returns the rel of the link |
Title | String | Returns the title of the link |
Type | String | Returns the type of the link |
Length | Integer | Returns the length of the link |