Jump to: navigation, search

SMILA/Project Concepts/Exceptions configurations and processing

Description

Please have a look at the "Exception handling" of crawlers.

Atm the Crawler can have an internal problem and crawling is not stopped

Therefore we need a concept for the framework that Crawlers send information about problems that are only small problems and crawling can be continued und greater problems on which the crawlercontroller should be stopped (if delta indexing is executed, the whole index information would be lost because all entries in the index will be deleted.

Please have a look at the current code of the crawlers and the crawlercontroller and write a concept to solve this Crawlercontroller should be aware of small problems (warning?) and greater problems(errors?)

Crawler is supposed to be 3rd party developed component and described only as interface.

There are at least two problems:

  1. Both currently implemented crawlers filesystem and web based on producer-consumer java-pattern. Crawler controller thread interacts as consumer but exceptions may occurs also on producer level.
  2. Who will decide that problem is critical and process should be stopped or problem is not critical and process may be continued?

Discussion

Sebastian Voigt:

Especially for the Crawler Controller and Crawler Workflow we have only two different error types:

  • Crawler Errors: Crawler cannot start, cannot connect to the data source, Crawlercontroller should stop crawling.
  • Crawling Errors: Error that can occur for every record or only for special record. User has to know that an Error occurred but Crawling should not be stopped.

Usually I would CrawlerException use only the Crawler Errors, Crawling Errors would be logged by the Crawler himself, and the Crawler would not throw an Exception for it.Do we need to throw an Exception in this case? What is the advantage? What can the CrawlerController do with this information? I see only one advantage, if this occur often, it can stop crawling also to prevent cases in that the crawler would throw an "Crawling" Exception for every entry in the data source....otherwise all records in the index would be deleted with the usage of DI.

What is the advantage of collecting exceptions? Important for the use of logging is that the log method is called where the logging should be done and not one class upper because an exception was thrown. Thus we should not use exceptions to move logging to other classes. But Logging and Throwing should not be used at the same time.

Ivan Churkin:

  1. What should I throw/log for FileSystemCrawler if crawling folder is not found? - imho its ok and index should be deleted.
  2. Unfortunately crawling process is not encapsulated inside crawler. For example, crawler was able to read DI for the file, and pass this DI to Crawler Controller but was unable to read all properties of Record - we should throw error to CC. But, from the other case, for example, we was unable to read only User Rights, but content was read ok - should we throw error/skip record?

Sebastian Voigt: I think that is correct, a second crawler run for a specific folder in the case that the folder doesn't exist means the index has to be cleared. Interesting point. For your examples it depends, the crawler developer has to decide from case to case what information should he send to the CrawlerController - stop crawling or skip only this entry.

Daniel Stucky: I will describe my point of view. I agreee that during crawling there are critical and non critical errors. Non critical errors mean that the crawl process itself is stable and that a record could not be read (e.g. temporary network problem, file is accessd by another application, etc.) but crawling should continue. Then there are critical errors were the crawl process should be stopped (e.g. bad configuration, root folder not available, no authentication, reoccuring network problems, .etc). I guess every Crawler has to decide for himself what is critical and non critical. In either case, DeltaIndexing:delete() MUST NOT be performed. It is crucial that the DeltaIndexing logic removes only those records from the index that are really deleted from the original source\! I don't know if it's better if the CrawlerController stops the Crawler on critical errors or if the Crawler does this on his own. Important is that the CrawlerController is informed that there were any errors and so can skip DeltaIndexing:delete(). The CrawlerController doesn't even need knowledge to what record the error belongs to. If each Crawler stops itself on critical errors it would be enough the provide a method hasExceptions() on the Crawler that is called by the CrawlerController to check for. Otherwise the CrawlerController nedds to decide if to stop crawling or continue. To differentiate between errors we could either use different exceptions or error levels.

If a root folder for crawling (or the start URL) is not found the Crawler should throw an exception. Usually this is a wrong configuration. Imagine someone build an index with millions of documents from one directory. Then, because of a typo in the configuration, suddenly the complete index is deleted. This is not a good behavior. We have to provide other mechanisms for deletion of whole indexes or data sources from an index. If however the root folder exists and it is empty the Crawler should process it normaly (thus sending 0 records).

Jürgen Schumacher A remark from an "outsider": I think that empty data sources always indicate some kind of problem. The above case "root folder exists and is empty" could be caused by a mount failure: the root folder could be a directory in which another file system should have been mounted. This case should not delete a million-document-index\! Therefore I propose for now to NEVER delete the index if the crawl did not find ANY document. As an advanced feature we could think of adding a config option to delta indexing to specify a threshold for index deletion: "do not delete obsolete documents from index if more than X% are deleted\!".

Resulting Rules for SMILA

Daniel Stucky:

  • what is the reason for 4.4, never logging when throwing an exception ? The worst case is that the same (or similar) exception is logged multiple times.
  • what is the result for Crawlers? I assume it is
    • Crawlers make use of CrawlerException and CrawlerCriticalException.
    • If any of both exceptions occur, CrawlerController MUST NOT execute delta-delete.
    • On CrawlerCriticalException the CrawlerController stops the crawl process.

I see the following issues:

  • if the Crawler does not stop itself on critical errors but waits to be stopped by the CrawlerController it may produce lots of internal errors
  • getNextDeltaIndexingData() returns a MObject[]. Imagine a (non critical) exception occurs while the Crawler builds this return object. What happens with those MObjects? They are never returned because an Exception is thrown! There is no problem if the size of MObject[] is always 1, but for better performance we want to use bigger sizes (in tests values of up to 100 proved feasible). So an exception on the Nth elemnet would skip n-1 elements. I guess this is a realy bad behavior.



Technical proposal

First problem "producer-consumer"

Its obvious that producer should collect exceptions to some exceptions list/queue.

private Queue<Throwable> _producerExceptions;
 
  ...
 
 
  //producer thread
     try{
     }catch(Throwable ex){
       _producerExceptions.add(ex);
       if(isCritical(ex)){
         // stop
       }
     }

There are two solutions how to pass all errors to crawler-controller. First of them to is to check producer exceptions inside of every public/consumer method.

public MObject[] getNextDeltaIndexingData() throws CrawlerException {
    Throwable ex = _producerExceptions.poll();
    if(ex!=null){
      throw new CrawlerException(ex);
    }
  }
  Record getRecord(int pos) throws CrawlerException{
   //the same
  }
   //the same

Second solution is to enrich crawler interface

interface Crawler{
        ...
      boolean hasInnerExceptions();
      Throwable[] getInnerExceptions();
   }

And to process it in the Crawler Controller.

...
  MObject[] diData;
  while(true){
    if(_crawler.hasInnerExceptions()){
      // check is critical
    }
    try{
       diData = _crawler.getNextDeltaIndexingData();
    }catch(Throwable ex){
       // check is critical
    }
  }

Second problem "is critical?"

There are also two solutions.

"let crawler developer choose it byself"

First of them is "let crawler developer choose it byself". And to define in interface two types of fixed exceptions, for example, CrawlerCriticalException and CrawlerNonCriticalException.

Configurable SMILAException

Second solution is to make exceptions processing configurable. It may be done on generic way. It should be declared list of possible exceptions in XML configuration file. And it will be written XSD schema and helper for loading/validating configuration and creating new exception based on exception name.

budle: org.eclipse.smila.exception

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"
xmlns="http://www.eclipse.org/smila/exception"
targetNamespace="http://www.eclipse.org/smila/exception"
>
<xs:element name="ExceptionConfigurations">
  <xs:annotation>
    <xs:documentation>list of exceptions</xs:documentation>
  </xs:annotation>
  <xs:complexType>
    <xs:sequence>
      <xs:element name="ExceptionConfig" minOccurs="0" maxOccurs="unbounded">
        <xs:complexType>
          <xs:sequence>
            <xs:element name="Text" type="xs:string"/>
          </xs:sequence>
          <xs:attribute name="name" type="xs:string" use="required"/>
          <xs:attribute name="critical" type="xs:boolean" use="optional" default="true"/>
          <!-- for extensibility - class should extends EilfException or implements some common inteface -->
          <xs:attribute name="class" type="xs:string" use="optional"/>
        </xs:complexType>
       </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>
public class EilfException extends Exception{
      ....
  boolean isCritical();
      ....
}
public class ExceptionHelper {
 
public ExceptionHelper(final String bundleId)
{
// load configuration here  - for  usage in crawler controller, for example
}
 
public ExceptionHelper(final String bundleId, String[] )
{
// load configuration here and validate for required exceptions - for  usage in crawler, for example
}
 
 
public EilfException createException(final String name)
{     // create EilfException
  return null;
}
public void throwException(final String name) {
 // create and throw
}
 
public boolean isCritical(String name);
 
}

Core component, for example, crawler controller will process exceptions using the same configuration file.

Resulting Rules for SMILA (suggestion)

  1. Throwing an Exception depends on the situation. Every bundle/package should decide by itself when to throw an Exception or when to handle the problem/error by itself.

Bad Example:

class Bundle {
  public void run() {
    try{
      ...
    } catch(Exception) {
      throw MyException("Problem!");
    }
 
    Or
 
  public void run() {
    try{
      ...
    } catch(Exception) {
      _log.error("Problem");
    }
}

Good example:

public void run() {
  try{
    ...
  } catch(IOException ioException) {
    _log.debug("IOException, will try it again");
  } catch(UrlException urlException)
    throw FileNotFoundException("Can't work anymore, inform caller");
  }
  1. Bundles/Packages (same scope of work) can define own checked Exceptions. They can be used to return more specific information, because some Exceptions that are thrown into methods are not informational for the caller. Therefore, it should be avoided to throw "internal" Exceptions (e.g., URLException).

Bad Example:

class FileSystemCrawler {
     public getThing() throw URLException { // without explicit description the URLException makes no sense for the caller
     	... 
     }
}


class FileSystemCrawler {
  public getThings() throw CrawlerException {
    try {
 
    }catch(UrlException e) {
      throw new CrawlerException("Problems with ...", e);
    }
  }
}
  1. Self-defined Exceptions should encapsulate layer-dependent / internal exceptions (like IOException). The caller can use getCause / Stacktrace to see more information about the problem.

Bundles/Packages should define different Exceptions, in case the caller has to react differently on problems. Bad Example:

class FileSystemCrawler {
	public get...() throw CrawlerException {
		if(state==1) {
			throw CrawlerException("not really bad");
		} else {
 			throw CrawlerException("Crawler should be stopped");
		}
	}
}

Good Example:

class FileSystemCrawler {
  public getThings() throw CrawlerException {
    try {
       ...
    } catch(IOException e) {
      if(state==1) {
          throw CrawlerException("not really bad", e);
      } else {
          throw CrawlerCriticalException("Crawler should be stopped", e);
      }
    }
  }
}
  1. Never log and throw an Exception (should be used only in special cases)

Bad Example:

catch(IOException e) {
_log.error("We have a problem");
throw BundleException();
}
  1. Unchecked Exceptions should only be used in specific cases. These cases are given when the developer wants to ensure that requirements are fulfilled. If these requirements are not fulfilled, the developer should throw an unchecked Exception to stop the execution immediately.

Bad Example:

public getThings() {
  ...
  catch(Exception e) {
    throw NullPointerException();
  }
}

Good Example:


/** this method should only be execute when _crawler is instantiated
   * 
   */
public getThings() {
  ...
  if(_crawler==null)   //  here: we know that a developer has used this method incorrectly
    throw NullPointerException();
  }
}