Server

Binary Metadata

Updated: March 18, 2024

The Nuxeo Platform enables to extract information from the uploaded files attached to a document and automatically fill in the document metadata at creation time. This enables you to leverage metadata existing outside the Nuxeo Platform to automatically categorize documents, preventing users from editing document to report these metadata. Automated metadata extraction is activated by default on Nuxeo DAM: The IPTC legend, copyright and source are used to automatically fill in the description, rights and source metadata of pictures.

How It Works

A Nuxeo listener watches for document creation/modification and triggers metadata mapping in the following conditions:

  • On document creation, if the attached binary is not empty, the listener reads the metadata and updates the document.
  • On document modification:
    • If the attached binary has changed and the document metadata has not, the listener reads the metadata from attached binary and stores it into the document.
    • If the attached binary has changed and the document metadata also has changed, the listener writes the metadata from the document to the attached binary.
    • If the attached binary hasn't changed and the document metadata has changed, the listener writes the metadata from the document to the attached binary.

This means that the MetadataMapping is bidirectional by default: Values are read from the blob and stored in fields, and if the value of a mapped field is modified and the document is saved, this value is then written into the blob. This behavior is sometimes not desirable and can be controlled using the read-only property of the metadataMappings extension point. If the value is readonly="true", then modifying a field will not change the binary. The default value is readonly="false" to preserve compatibility with existing configurations.

You can contribute your metadata mapping and choose to have it applied with the same rules and/or through Nuxeo Automation operations.

By default, the Nuxeo Platform uses ExifTool, which supports many different data formats including EXIFGPSIPTCXMP. You can refer to its documentation for further details and a complete list of formats. Other processors can be added if needed.

Contributing Metadata Mappings

Metadata mapping is made through an XML contribution on the metadataMappings extension point:

   <!-- Map binary metadata to Nuxeo document metadata -->

  <extension target="org.nuxeo.binary.metadata"
             point="metadataMappings">
    <!-- Define "processor" to use and specify the attached binary's xpath ("blobXPath") -->
    <!-- Technical "id" should be unique  -->
    <!-- "ignorePrefix" is by default set to true. Here metadata have prefixes, so set it to false. -->
    <metadataMapping id="Example" processor="exifTool" blobXPath="file:content" ignorePrefix="false">
      <!-- "name" = binary metadata  , "xpath" = document metadata -->
      <!-- See PDF metadata extraction example in this page -->
      <metadata name="PDF:Producer" xpath="dc:title"/>
      <metadata name="PDF:Author" xpath="dc:description"/>
    </metadataMapping>

    <!-- A metadata mapping with no bidirectional update. Changing the myschema:my_exif_resolution_unit value will not change the EXIF/ResolutionUnit in the binary (readonly="true")-->
    <metadataMapping id="ExampleReadOnly" processor="exifTool" blobXPath="file:content" ignorePrefix="true" readonly="true">
       <metadata name="ResolutionUnit" xpath="myschema:my_exif_resolution_unit" />
    </metadataMapping>
 </extension>

Contributing Metadata Rules

This part is only needed if you plan to use your metadata mapping with the standard listener.

Metadata rules are defined through an XML contribution on the metadataRules extension point:

 <!-- Define which mappings will be called by the listener, and under which conditions -->
 <extension target="org.nuxeo.binary.metadata"
             point="metadataRules">
   <!-- "order" = priority , "async" = listener mode (set "true" to apply mapping as background work) -->
   <!-- Technical "id" should be unique  -->
   <rule id="default" order="0" enabled="true" async="false">
      <metadataMappings>
        <metadataMapping-id>Example</metadataMapping-id>
        <metadataMapping-id>...</metadataMapping-id>
      </metadataMappings>
      <!-- see the link below for filter contributions -->
      <filters>
        <filter-id>hasFileType</filter-id>
        <filter-id>...</filter-id>
      </filters>
    </rule>
  </extension>

  <extension target="org.nuxeo.ecm.platform.actions.ActionService"
             point="filters">
    <filter id="hasFileType">
      <rule grant="true">
        <type>File</type>
      </rule>
    </filter>
  </extension>

Filters contribution documentation.

Default Operations

  • Document.SetMetadataFromBlob: To write metadata to a Document from a binary according to a contributed metadata mapping.
  • Blob.SetMetadataFromDocument: To write metadata to a Blob (xpath parameter, or BlobHolder if empty) from a document (input) given a custom metadata mapping defined in a Properties parameter (xpath=metadataName), using a named processor (exifTool for instance).
  • Blob.SetMetadataFromContext: To write metadata to a Blob from Context using a named processor (exifTool for instance) and given metadata, and return the updated Blob.
  • Context.SetMetadataFromBlob: To read metadata from a Blob (input) given a custom list of metadata defined (or optional, to get all metadata in result of ExifTool) in a StringList parameter (metadataName1, metadataName2, ...), using a named processor (exifTool for instance), and put the result (a Map) in the Context.
  • Blob.ReadMetadata: To return Map of all binary properties in input.

Contributing a New Processor

The Nuxeo default contribution for binary metadata processor is ExifTool:

<extension target="org.nuxeo.binary.metadata"
             point="metadataProcessors">
    <processor id="exifTool"
               class="org.nuxeo.binary.metadata.internals.ExifToolProcessor"
               prefix="true"/>
  </extension>

If you need to add a new processor:

  1. Declare a new contribution with specific id and class.

    <extension target="org.nuxeo.binary.metadata"
                 point="metadataProcessors">
        <processor id="myProcessor"
                   class="org.mycompany.my.MyProcessorClazz"/>
      </extension>
    
  2. Extend org.nuxeo.binary.metadata.api.BinaryMetadataProcessor and implement the following methods:

    /**
         * Write given metadata into given blob. Since Nuxeo 7.3 ignorePrefix is added.
         *
         * @param blob Blob to write.
         * @param metadata Metadata to inject.
     * @param ignorePrefix
         * @return the updated blob, or {@code null} if there was an error
         */
        public Blob writeMetadata(Blob blob, Map<String, Object> metadata, boolean ignorePrefix);
        /**
         * Read from a given blob given metadata map. Since Nuxeo 7.3 ignorePrefix is added.
         *
         * @param blob Blob to read.
         * @param metadata Metadata to extract.
     * @param ignorePrefix
         * @return Metadata map.
         */
        public Map<String, Object> readMetadata(Blob blob, List<String> metadata, boolean ignorePrefix);
        /**
         * Read all metadata from a given blob. Since Nuxeo 7.3 ignorePrefix is added.
         *
         * @param blob Blob to read.
     * @param ignorePrefix
         * @return Metadata map.
         */
        public Map<String, Object> readMetadata(Blob blob, boolean ignorePrefix);
    

    Here is the ExifTool example org.nuxeo.binary.metadata.internals.ExifToolProcessor and the command line documentation to execute the command lines from the Nuxeo Platform.

ExifTool Extraction Example

Metadata extraction example from a PDF file using ExifTool:

> exiftool -G -json hello.pdf
[{
  "SourceFile": "hello.pdf",
  "ExifTool:ExifToolVersion": 9.76,
  "File:FileName": "hello.pdf",
  "File:Directory": ".",
  "File:FileSize": "30 kB",
  "File:FileModifyDate": "2015:01:05 14:57:19+01:00",
  "File:FileAccessDate": "2015:01:05 17:02:43+01:00",
  "File:FileInodeChangeDate": "2015:01:05 14:57:19+01:00",
  "File:FilePermissions": "rwxr-xr-x",
  "File:FileType": "PDF",
  "File:MIMEType": "application/pdf",
  "PDF:PDFVersion": 1.4,
  "PDF:Linearized": "No",
  "PDF:PageCount": 1,
  "PDF:Language": "en-US",
  "PDF:Author": "John Doe",
  "PDF:Creator": "Writer",
  "PDF:Producer": "OpenOffice.org 3.2",
  "PDF:CreateDate": "2010:10:26 15:48:33+02:00"
}]

Metrics

Metrics have been added to Binary Metadata services to monitor default/custom processor performances with Nuxeo.

To activate it, the following variable in nuxeo.conf must be set:

binary.metadata.monitor.enable=true

Or log4j level to TRACE for org.nuxeo.binary.metadata.internals.BinaryMetadataComponent must be set.

This feature gives the ability to get time execution informations through JMX: org.nuxeo.StopWatch.

Default Contribution

  • IPTC schema has been removed from document type Picture
  • Only IPTC:Source, IPTC:CopyrightNoticeIPTC:Caption-Abstract are stored respectively into dc:sourcedc:rights and dc:description.
  • Widget summary_picture_iptc has been removed from document summary
  • Mistral engine is removed from metadata extraction of the Nuxeo Platform
  • EXIF mapping remains identical

Here is the default metadata mapping contribution in the Nuxeo Platform:

Default Contribution

<extension target="org.nuxeo.binary.metadata"
 point="metadataMappings">
  <metadataMapping id="EXIF" processor="exifTool" blobXPath="file:content" ignorePrefix="false">
    <metadata name="EXIF:ImageDescription" xpath="imd:image_description"/>
    <metadata name="EXIF:UserComment" xpath="imd:user_comment"/>
    <metadata name="EXIF:Equipment" xpath="imd:equipment"/>
    <metadata name="EXIF:DateTimeOriginal" xpath="imd:date_time_original"/>
    <metadata name="EXIF:XResolution" xpath="imd:xresolution"/>
    <metadata name="EXIF:YResolution" xpath="imd:yresolution"/>
    <metadata name="EXIF:PixelXDimension" xpath="imd:pixel_xdimension"/>
    <metadata name="EXIF:PixelYDimension" xpath="imd:pixel_ydimension"/>
    <metadata name="EXIF:Copyright" xpath="imd:copyright"/>
    <metadata name="EXIF:ExposureTime" xpath="imd:exposure_time"/>
    <metadata name="EXIF:ISO" xpath="imd:iso_speed_ratings"/>
    <metadata name="EXIF:FocalLength" xpath="imd:focalLength"/>
    <metadata name="EXIF:ColorSpace" xpath="imd:color_space"/>
    <metadata name="EXIF:WhiteBalance" xpath="imd:white_balance"/>
    <metadata name="EXIF:IccProfile" xpath="imd:icc_profile"/>
    <metadata name="EXIF:Orientation" xpath="imd:orientation"/>
    <metadata name="EXIF:FNumber" xpath="imd:fnumber"/>
  </metadataMapping>
  <metadataMapping id="IPTC" processor="exifTool" blobXPath="file:content" ignorePrefix="false">
    <metadata name="IPTC:Source" xpath="dc:source"/>
    <metadata name="IPTC:CopyrightNotice" xpath="dc:rights"/>
    <metadata name="IPTC:Caption-Abstract" xpath="dc:description"/>
  </metadataMapping>
</extension>
<extension target="org.nuxeo.binary.metadata"
 point="metadataRules">
  <rule id="iptc" order="0" enabled="true" async="false">
    <metadataMappings>
      <metadataMapping-id>EXIF</metadataMapping-id>
      <metadataMapping-id>IPTC</metadataMapping-id>
    </metadataMappings>
    <filters>
      <filter-id>hasPictureType</filter-id>
    </filters>
  </rule>
</extension>
<extension target="org.nuxeo.ecm.platform.actions.ActionService"
 point="filters">
  <filter id="hasPictureType">
    <rule grant="true">
      <type>Picture</type>
    </rule>
  </filter>
</extension>