Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

SMILA/Documentation/Importing/Crawler/Feed

The FeedCrawler is used to read RSS or Atom feed in importing workflows.

Idea.png
In contrast to the old FeedAgent component, the FeedCrawler does not support checking the feeds for new entries in regular time slots. You can simulate this currently by starting a job using the FeedCrawler regularly from outside, e.g. by using cron or other schedulers. We are planning to integrate an own scheduling component in SMILA in the future.


FeedCrawler

The Feed crawler offers the functionality to read RSS and Atom feeds. The implementation uses ROME and ROME Fetcher to retrieve and parse the feeds. ROME supports the following feed formats:

  • RSS 0.90
  • RSS 0.91 Netscape
  • RSS 0.91 Userland
  • RSS 0.92
  • RSS 0.93
  • RSS 0.94
  • RSS 1.0
  • RSS 2.0
  • Atom 0.3
  • Atom 1.0.

Configuration

The FeedCrawler worker is usually the first worker in a workflow and the job is started in runOnce mode.

  • Worker name: feedCrawler
  • Parameters:
    • dataSource (req.) value for attribute _source, needed e.g. by the delta service
    • feedUrls (req.) URLs (usually HTTP) of the feeds to read. Can be a single string value or a list of string values. Currently, all feeds are read in a single task.
    • mapping (req.) Mapping of feed and feed item properties to record attribute names. See below for the available property names.
    • deltaProperties (opt.) a list of feed or feed item property names (see below) used to generate the value for attribute _deltaHash. If not set, a unique _deltaHash value is generated for each record so that it will be updated in any case, if delta checking is enabled.
    • maxRecordsPerBulk (opt.) maximum number of item records in one bulk in the output bucket. (default: 1000)
  • Output slots:
    • crawledRecords: One record per item read from the feeds.
Feed properties

These are properties of the feed that can be mapped to record attributes. The values will be identical for all records created from entries of a single feed. Some are not only simple values, but maps (or even lists of maps). The attributes of these map objects are described in further tables below, they cannot be changed via the mapping.

Attribute Type Description
feedAuthors List<Person> Returns the feed authors
feedCategories List<Category> Returns the feed categories
feedContributors List<Person> Returns the feed contributors
feedCopyright String Returns the feed copyright information
feedDescription String Returns the feed description
feedEncoding String Returns the charset encoding of the feed
feedType String Returns the feed type
feedImage Image Returns the feed image
feedLanguage String Returns the feed language
feedLinks List<Link> Returns the feed links
feedPublishDate Date Returns the feed published date
feedTitle String Returns the feed title
feedUri String Returns the feed uri
Feed Item properties

And these are properties extracted from the single feed items:

Attribute Type Description
itemAuthors List<Person> Returns a feed entry authors
itemCategories List<Category> Returns a feed entry categories
itemContents List<Content> Returns a feed entry contents
itemContributors List<Person> Returns a feed entry contributors
itemDescription Content Returns a feed entry description
itemEnclosures List<Enclosure> Returns a feed entry enclosures
itemLinks List<Link> Returns a feed entry links
itemPublishDate Date Returns a feed entry publish date
itemTitle String Returns a feed entry title
itemUpdateDate Date Returns a feed entry update date.
itemUri String Returns a feed entry uri.
Properties of structured feed/item properties

Content maps can contain these properties:

Attribute Type Description
Mode String Returns the mode of the content
Value String Returns the value of the content
Type String Returns the type of the content

Person maps can contain these properties:

Attribute Type Description
Email String Returns the email of the person
Name String Returns the name of the person
Uri String Returns the uri of the person

Image maps can contain these properties:

Attribute Type Description
Link String Returns the link of the image
Title String Returns the title of the image
Url String Returns the url of the image
Description String Returns the description of the image

Category maps can contain these properties:

Attribute Type Description
Name String Returns the name of the category
TaxanomyUri String Returns the taxonomy uri of the category

Enclosure maps can contain these properties:

Attribute Type Description
Type String Returns the type of the enclosure
Url String Returns the url of the enclosure
Length Integer Returns the length of the enclosure

Link maps can contain these properties:

Attribute Type Description
Href String Returns the href of the link
Hreflang String Returns the hreflang of the link
Rel Integer Returns the rel of the link
Title String Returns the title of the link
Type String Returns the type of the link
Length Integer Returns the length of the link

Processing

Sample Feed Crawler Job

Back to the top