Jump to: navigation, search

Difference between revisions of "SMILA/Documentation/Importing/Crawler/Feed"

m (Delta indexing strategy)
(Feed properties)
 
Line 66: Line 66:
 
! Type
 
! Type
 
! Description
 
! Description
 +
|-
 +
| <tt>feedSourceUrl</tt>
 +
| String
 +
| The URL from which the feed was read, as given in the job definition, not parsed from the feed content.
 
|-
 
|-
 
| <tt>feedAuthors</tt>
 
| <tt>feedAuthors</tt>

Latest revision as of 04:23, 28 January 2013

The FeedCrawler is used to read RSS or Atom feed in importing workflows.

Idea.png
In contrast to the old FeedAgent component, the FeedCrawler does not support checking the feeds for new entries in regular time slots. You can simulate this currently by starting a job using the FeedCrawler regularly from outside, e.g. by using cron or other schedulers. We are planning to integrate an own scheduling component in SMILA in the future.


FeedCrawler

The Feed crawler offers the functionality to read RSS and Atom feeds. The implementation uses ROME Fetcher to retrieve and parse the feeds. ROME supports the following feed formats:

  • RSS 0.90
  • RSS 0.91 Netscape
  • RSS 0.91 Userland
  • RSS 0.92
  • RSS 0.93
  • RSS 0.94
  • RSS 1.0
  • RSS 2.0
  • Atom 0.3
  • Atom 1.0

Configuration

The FeedCrawler worker is usually the first worker in a workflow and the job is started in runOnce mode.

  • Worker name: feedCrawler
  • Parameters:
    • dataSource (req.) value for attribute _source, needed e.g. by the delta service
    • feedUrls (req.) URLs (usually HTTP) of the feeds to read. Can be a single string value or a list of string values. Currently, all feeds are read in a single task.
    • mapping (req.) Mapping of feed and feed item properties to record attribute names. See below for the available property names.
    • deltaProperties (opt.) a list of feed or feed item property names (see below) used to generate the value for attribute _deltaHash. If not set, a unique _deltaHash value is generated for each record so that it will be updated in any case, if delta checking is enabled.
    • maxRecordsPerBulk (opt.) maximum number of item records in one bulk in the output bucket. (default: 1000)
  • Output slots:
    • crawledRecords: One record per item read from the feeds.

You can enable the use of an HTTP proxy for fetching the feeds by setting the system properties http.proxyHost and http.proxyPort. You can do this by adding them to the SMILA.ini file before starting SMILA:

...
-Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.Log4JLogger
-Dlog4j.configuration=file:log4j.properties
-Dhttp.proxyHost=proxy.example.com 
-Dhttp.proxyPort=3128

For additional information about proxy usage in Java see JavaSE documentation.

Delta indexing strategy

When regularly crawling a feed and not wanting to lose older entries, it makes sense to use the additive strategy for delta import in your job parameters:

 "paramters":{
    ...
    "deltaImportStrategy":"additive",
    ...
  }

This ensures that entries from former crawling won't be deleted, but items already indexed are filtered out. But keep in mind, this also means that the items are never deleted from the index by delta indexing. (see also DeltaCheck and UpdatePusher worker.)

Feed properties

These are properties of the feed that can be mapped to record attributes. The values will be identical for all records created from entries of a single feed. Some are not only simple values, but structured, i.e. (mostly list of) maps. The attributes of these map objects are described in further tables below, they cannot be changed via the mapping. Attributes associated to structured properties are not set to empty objects, e.g. a list attribute is either not set at all or the list does indeed have elements.

Property Type Description
feedSourceUrl String The URL from which the feed was read, as given in the job definition, not parsed from the feed content.
feedAuthors Sequence<Person> Returns the feed authors
feedCategories Sequence<Category> Returns the feed categories
feedContributors Sequence<Person> Returns the feed contributors
feedCopyright String Returns the feed copyright information
feedDescription String Returns the feed description
feedEncoding String Returns the charset encoding of the feed
feedType String Returns the feed type
feedImage Image Returns the feed image
feedLanguage String Returns the feed language
feedLinks Sequence<Link> Returns the feed links
feedPublishDate DateTime Returns the feed published date
feedTitle String Returns the feed title
feedUri String Returns the feed uri
Feed Item properties

And these are properties extracted from the single feed items:

Attribute Type Description
itemAuthors Sequence<Person> Returns a feed entry authors
itemCategories Sequence<Category> Returns a feed entry categories
itemContents Sequence<Content> Returns a feed entry contents
itemContributors Sequence<Person> Returns a feed entry contributors
itemDescription Content Returns a feed entry description
itemEnclosures Sequence<Enclosure> Returns a feed entry enclosures
itemLinks Sequence<Link> Returns a feed entry links
itemPublishDate DateTime Returns a feed entry publish date
itemTitle String Returns a feed entry title
itemUpdateDate DateTime Returns a feed entry update date.
itemUri String Returns a feed entry uri.
Properties of structured feed/item properties

Content maps can contain these properties:

  • Mode: String
  • Value: String
  • Type: String

Person maps can contain these properties:

  • Email: String
  • Name: String
  • Uri: String

Image maps can contain these properties:

  • Link: String
  • Title: String
  • Url: String
  • Description: String

Category maps can contain these properties:

  • Name: String
  • TaxanomyUri: String

Enclosure maps can contain these properties:

  • Type: String
  • Url: String
  • Length: Integer

Link maps can contain these properties:

  • Href: String
  • Hreflang: String
  • Rel: Integer
  • Title: String
  • Type: String
  • Length: Integer

Processing

The FeedCrawler is relatively simple: It uses ROME to fetch and parse the configured feed URLs and creates a record for each item read from the feeds according to the configured mapping. These records are written to the output bulks. No follow-up "to-crawl" bulks are created, and therefore no follow-up tasks will be needed.

If none of the configured feed URLs can be fetched and parsed successfully, the task and therefore the complete job will fail. If at least one URL can be used successfully the task will be finished with successful, warnings about the missing feeds will be written to the log.

It depends very much on the feed content which properties are set and which not, so you will have to try with the actual feeds you want to crawl: Not every feed provides everything, and some elements are often used for different purposes in different feeds. You may find more information about how the content of the feed is mapped to properties described above in the ROME Wiki.

Sample Feed Crawler Job

SMILA already provides a sample feed crawling job "crawlFeed" which uses the "feedCrawling" workflow. Crawled feed item records are pushed to the job "indexUpdateFeed" which uses the BPEL pipeline "AddFeedPipeline" for transforming and indexing the data.

Here's another simple example of a feed crawling job definition:

  {
    "name":"crawlSpiegelFeed",
    "workflow":"feedCrawling",
    "parameters":{
      "tempStore":"temp",
      "dataSource":"feed",
      "jobToPushTo":"indexUpdateFeed",
      "feedUrls":"http://www.spiegel.de/schlagzeilen/tops/index.rss",                
      "mapping": {          
        "itemUri":"Url",
        "itemTitle":"Title",          
        "itemUpdateDate":"LastModifiedDate",
        "itemContents": "Contents"          
      }
    }
  }

For testing a one-time crawling of the feed you can start the indexing job "indexUpdateFeed" and the crawl job "crawlSpiegelFeed" and (after a short time) you should be able to search.

Extending feed workflow to fetch content

The job described above uses the text from the feed items as indexing content. In most feeds this is just a summary of the content of an underlying web site which is linked in the feed item. In the following, we describe how to extend the szenario above for indexing the content of the underlying web site instead of the feed item's summary.

What we do in short:

  • create a new feed crawling workflow with a "webFetcher" worker to get the content as attachment
  • create a new feed crawling job with parameters for the "webFetcher" worker
  • create a new pipeline for indexing the attachment content
  • create a new feed indexing job which uses the the new pipeline

Creating the new feed crawling workflow

The new workflow is just a copy of the original "feedCrawling" workflow which additionally uses a "webFetcher" worker:

  {
    "name":"feedCrawlingWithFetching",
    "modes":[
      "runOnce"
    ],
    "startAction":{
      "worker":"feedCrawler",        
      "output":{          
        "crawledRecords":"crawledRecordsBucket"
      }
    },
    "actions":[
      {
        "worker":"deltaChecker",
        "input":{
          "recordsToCheck":"crawledRecordsBucket"
        },
        "output":{
          "updatedRecords":"updatedLinksBucket"     
        }
      },        
      {
        "worker":"webFetcher",
        "input":{
          "linksToFetch":"updatedLinksBucket"
        },
        "output":{
          "fetchedLinks":"fetchedLinksBucket"
        }
      },
      {
        "worker":"updatePusher",
        "input":{
          "recordsToPush":"fetchedLinksBucket"
        }
      }
    ]
  }


Creating the new feed crawling job

The new job is just a copy of the original "crawlFeed" job with the following changes:

  • no mapping entry for the feed item's "itemContents"
  • additional parameters for the "webFetcher" worker
  • we use another indexing job (see below), so "jobToPushTo" changes to "indexUpdateFeedWithFetching"
    
  {
    "name":"crawlFeedWithFetching",
    "workflow":"feedCrawlingWithFetching",
    "parameters":{
      "tempStore":"temp",
      "dataSource":"feed",
      "jobToPushTo":"indexUpdateFeedWithFetching",
      "feedUrls":"http://www.spiegel.de/schlagzeilen/tops/index.rss",                
      "mapping": {          
        "itemUri":"Url",
        "itemTitle":"Title",          
        "itemUpdateDate":"LastModifiedDate",
        "httpCharset": "Charset",
        "httpContenttype": "ContentType",          
        "httpMimetype": "MimeType",
        "httpSize": "Size",
        "httpUrl": "Url",
        "httpContent": "Content"
      }
    }
  }

Creating the new indexing pipeline

The new pipeline "AddFeedWithFetchingPipeline" is just a copy of the "AddFeedPipeline" with the some changes:

   <process name="AddFeedWithFetchingPipeline" ...
   ...   

The activities "extractMimeType" and "extractContent" are not needed here, so we can remove them:

  <!-- extract mimetype -->    
  <extensionActivity>
    <proc:invokePipelet name="extractMimeType">
    ...
  </extensionActivity>    
  <!-- extract content -->    
  <extensionActivity>
    <proc:invokePipelet name="extractContent">
    ...
  </extensionActivity>    

The web fetcher delivers the content as attachment, so the activity "extractTextFromHTML" must use inputType ATTACHMENT:

    
  <extensionActivity>
    <proc:invokePipelet name="extractTextFromHTML">
      ...
      <proc:configuration>
        <rec:Val key="inputType">ATTACHMENT</rec:Val>
        ...
      </proc:configuration>
    </proc:invokePipelet>
  </extensionActivity>

Creating the new indexing job

Now we create an indexing job which uses the new pipeline:

  {
    "name":"indexUpdateFeedWithFetching",
    "workflow":"importToPipeline",
    "parameters":{
      "tempStore":"temp",
      "addPipeline":"AddFeedWithFetchingPipeline"
      "deletePipeline":"AddFeedWithFetchingPipeline"
    }
  }

That's it! Now you can start the new indexing and crawl job as described before, and (after a short time) you should be able to search.