SMILA/Project Concepts/MimeTypeIdentifier

Description

We need the functionality to identify the mimetype of documents, e.g. for compound handling or to control data transformation in BPEL.

Technical proposal

The MimeTypeIdentifier has to provide functionality to identify the mimetype of a document, either by the documents content or by file extension mapping to mimetypes. The interface supports both approaches, as both may be combined for optimized results. Implementations could be done stepwise:

initial implementation: file extension to mime type mapping
advanced implementation: magic bytes analysis

{info:title=Useful Information} Mimetype identification is one of the core tasks of aperture. Their approach is a combination of MagicBytes interpretation and file extension to mimetype mapping. Perhaps it is possible that aperture contributes this functionality to SMILA ? For details see [[1]]

Aperture is also capable of identifying (and converting) OpenOffice formats like .docx. So these formats are not confused with zips containing XML files. see [[2]] {info}

Configuration

For the file extension - mimetype mapping a configuration has to be provided. Note that multiple extensions may be associated with a single mimetype. This is supported. In theory a single extension may also be associated with multiple mimetypes. This may be a valid case but the mapping used by MimeTypeIdentifier has to be unambiguous! The implementation has to ensure this and prevent such configurations (e.g. by simply overriding ). A configuration could look like this:

<mimetypes>
    <mimetype id="application/rtf">
        <extensions>
            <ext>rtf</ext>
        </extensions>
    </mimetype>
    <mimetype id="application/vnd.ms-excel">
        <extensions>
            <ext>xls</ext>
        </extensions>
    </mimetype>
    <mimetype id="application/vnd.ms-powerpoint">
        <extensions>
            <ext>ppt</ext>
        </extensions>
    </mimetype>
    <mimetype id="image/jpeg">
        <extensions>
            <ext>jpe</ext>
            <ext>jpeg</ext>
            <ext>jpg</ext>
        </extensions>
    </mimetype>
    <mimetype id="text/html">
        <extensions>
            <ext>htm</ext>
            <ext>html</ext>
        </extensions>
    </mimetype>
    ...
</mimetypes>

At the moment I don't know if there is anything the be configured for MagicBytes analysis. The implementation should be extendable to provide MagicByte detection for additional mimetypes.

Interface

interface MimeTypeIdentifier
{
    /**
       Identifies the mimetype of a document.
       @param document - the document to identify the charset for. Note that the provided bytes do not need to be the whole document.
       @param String - the filename of the document. This could be a simple filename, a full path or even a complex URI
       @return a String containing the mimetype or null if none could be identified
    */
    public String identify( byte[] document, String filename );
 
    /**
       Returns the minimum number of bytes needed to identify mimetypes. The size of parameter document of method identify must not be less than this value. Otherwise identification can not be done.
       @return the minimum number of bytes needed to identify mimetypes
    */
    public int getMinByteCount();
}

This functionality may be needed at various stages in the SMILA. Besides a simple POJO implementation that provides the core functionality, we should also consider wrapping the functionality in a BPEL service.

Related functionality

A utility component is needed to extract the file extension of a filename, path or uri.

interface ExtensionExtractor
{
    /**
       Extractes the file extension of a filename
       @param String - the filename of the document. This could be a simple filename, a full path or even a complex URI
       @return a String containing the file extension or null if none could be identified
    */
    public String getExtension(String filename );
}

Another component could be needed that extracts the encoding/charset of text documents (txt, html, xml, etc.).

This can be done by checking for BOM (ByteOrderMark) and/or by checking for encoding/charset tags/attributes in markup documents.

interface EncodingIdentifier
{
    /**
       Identifies the charset of a text or markup document.
       @param document - the document to identify the charset for. Note that the provided bytes do not need to be the whole document.
       @return a String containing the charset or null if none could be identified
    */
    public String identify( byte[] document );
 
    /**
       Returns the minimum number of bytes needed to identify the charset. The size of parameter document of method identify must not be less than this value. Otherwise identification can not be done.
       @return the minimum number of bytes needed to identify the charset
    */
    public int getMinByteCount();
}

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/MimeTypeIdentifier

Contents

Description

Technical proposal

Configuration

Interface

Related functionality

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

SMILA/Project Concepts/MimeTypeIdentifier

Contents

Description

Technical proposal

Configuration

Interface

Related functionality