The Nuxeo Platform enables to extract information from the uploaded files attached to a document and automatically fill in the document metadata at creation time. This enables you to leverage metadata existing outside the Nuxeo Platform to automatically categorize documents, preventing users from editing document to report these metadata. Automated metadata extraction is activated by default on Nuxeo DAM: The IPTC legend, copyright and source are used to automatically fill in the description, rights and source metadata of pictures.
How It Works
A Nuxeo listener watches for document creation/modification and triggers metadata mapping in the following conditions:
- On document creation, if the attached binary is not empty, the listener reads the metadata and updates the document.
- On document modification:
- If the attached binary has changed and the document metadata has not, the listener reads the metadata from attached binary and stores it into the document.
- If the attached binary has changed and the document metadata also has changed, the listener writes the metadata from the document to the attached binary.
- If the attached binary hasn't changed and the document metadata has changed, the listener writes the metadata from the document to the attached binary.
This means that the MetadataMapping
is bidirectional by default: Values are read from the blob and stored in fields, and if the value of a mapped field is modified and the document is saved, this value is then written into the blob. This behavior is sometimes not desirable and can be controlled using the read-only property of the metadataMappings
extension point. If the value is readonly="true"
, then modifying a field will not change the binary. The default value is readonly="false"
to preserve compatibility with existing configurations.
You can contribute your metadata mapping and choose to have it applied with the same rules and/or through Nuxeo Automation operations.
By default, the Nuxeo Platform uses ExifTool, which supports many different data formats including EXIF, GPS, IPTC, XMP. You can refer to its documentation for further details and a complete list of formats. Other processors can be added if needed.
Contributing Metadata Mappings
Metadata mapping is made through an XML contribution on the metadataMappings
extension point:
<!-- Map binary metadata to Nuxeo document metadata -->
<extension target="org.nuxeo.binary.metadata"
point="metadataMappings">
<!-- Define "processor" to use and specify the attached binary's xpath ("blobXPath") -->
<!-- Technical "id" should be unique -->
<!-- "ignorePrefix" is by default set to true. Here metadata have prefixes, so set it to false. -->
<metadataMapping id="Example" processor="exifTool" blobXPath="file:content" ignorePrefix="false">
<!-- "name" = binary metadata , "xpath" = document metadata -->
<!-- See PDF metadata extraction example in this page -->
<metadata name="PDF:Producer" xpath="dc:title"/>
<metadata name="PDF:Author" xpath="dc:description"/>
</metadataMapping>
<!-- A metadata mapping with no bidirectional update. Changing the myschema:my_exif_resolution_unit value will not change the EXIF/ResolutionUnit in the binary (readonly="true")-->
<metadataMapping id="ExampleReadOnly" processor="exifTool" blobXPath="file:content" ignorePrefix="true" readonly="true">
<metadata name="ResolutionUnit" xpath="myschema:my_exif_resolution_unit" />
</metadataMapping>
</extension>
Contributing Metadata Rules
Metadata rules are defined through an XML contribution on the metadataRules
extension point:
<!-- Define which mappings will be called by the listener, and under which conditions -->
<extension target="org.nuxeo.binary.metadata"
point="metadataRules">
<!-- "order" = priority , "async" = listener mode (set "true" to apply mapping as background work) -->
<!-- Technical "id" should be unique -->
<rule id="default" order="0" enabled="true" async="false">
<metadataMappings>
<metadataMapping-id>Example</metadataMapping-id>
<metadataMapping-id>...</metadataMapping-id>
</metadataMappings>
<!-- see the link below for filter contributions -->
<filters>
<filter-id>hasFileType</filter-id>
<filter-id>...</filter-id>
</filters>
</rule>
</extension>
<extension target="org.nuxeo.ecm.platform.actions.ActionService"
point="filters">
<filter id="hasFileType">
<rule grant="true">
<type>File</type>
</rule>
</filter>
</extension>
Filters contribution documentation.
Default Operations
- Document.SetMetadataFromBlob: To write metadata to a Document from a binary according to a contributed metadata mapping.
- Blob.SetMetadataFromDocument: To write metadata to a Blob (xpath parameter, or BlobHolder if empty) from a document (input) given a custom metadata mapping defined in a Properties parameter (xpath=metadataName), using a named processor (exifTool for instance).
- Blob.SetMetadataFromContext: To write metadata to a Blob from Context using a named processor (exifTool for instance) and given metadata, and return the updated Blob.
- Context.SetMetadataFromBlob: To read metadata from a Blob (input) given a custom list of metadata defined (or optional, to get all metadata in result of ExifTool) in a StringList parameter (metadataName1, metadataName2, ...), using a named processor (exifTool for instance), and put the result (a Map) in the Context.
- Blob.ReadMetadata: To return Map of all binary properties in input.
Contributing a New Processor
The Nuxeo default contribution for binary metadata processor is ExifTool:
<extension target="org.nuxeo.binary.metadata"
point="metadataProcessors">
<processor id="exifTool"
class="org.nuxeo.binary.metadata.internals.ExifToolProcessor"
prefix="true"/>
</extension>
If you need to add a new processor:
Declare a new contribution with specific id and class.
<extension target="org.nuxeo.binary.metadata" point="metadataProcessors"> <processor id="myProcessor" class="org.mycompany.my.MyProcessorClazz"/> </extension>
Extend
org.nuxeo.binary.metadata.api.BinaryMetadataProcessor
and implement the following methods:/** * Write given metadata into given blob. Since Nuxeo 7.3 ignorePrefix is added. * * @param blob Blob to write. * @param metadata Metadata to inject. * @param ignorePrefix * @return the updated blob, or {@code null} if there was an error */ public Blob writeMetadata(Blob blob, Map<String, Object> metadata, boolean ignorePrefix); /** * Read from a given blob given metadata map. Since Nuxeo 7.3 ignorePrefix is added. * * @param blob Blob to read. * @param metadata Metadata to extract. * @param ignorePrefix * @return Metadata map. */ public Map<String, Object> readMetadata(Blob blob, List<String> metadata, boolean ignorePrefix); /** * Read all metadata from a given blob. Since Nuxeo 7.3 ignorePrefix is added. * * @param blob Blob to read. * @param ignorePrefix * @return Metadata map. */ public Map<String, Object> readMetadata(Blob blob, boolean ignorePrefix);
Here is the ExifTool example
org.nuxeo.binary.metadata.internals.ExifToolProcessor
and the command line documentation to execute the command lines from the Nuxeo Platform.
ExifTool Extraction Example
Metadata extraction example from a PDF file using ExifTool:
> exiftool -G -json hello.pdf
[{
"SourceFile": "hello.pdf",
"ExifTool:ExifToolVersion": 9.76,
"File:FileName": "hello.pdf",
"File:Directory": ".",
"File:FileSize": "30 kB",
"File:FileModifyDate": "2015:01:05 14:57:19+01:00",
"File:FileAccessDate": "2015:01:05 17:02:43+01:00",
"File:FileInodeChangeDate": "2015:01:05 14:57:19+01:00",
"File:FilePermissions": "rwxr-xr-x",
"File:FileType": "PDF",
"File:MIMEType": "application/pdf",
"PDF:PDFVersion": 1.4,
"PDF:Linearized": "No",
"PDF:PageCount": 1,
"PDF:Language": "en-US",
"PDF:Author": "John Doe",
"PDF:Creator": "Writer",
"PDF:Producer": "OpenOffice.org 3.2",
"PDF:CreateDate": "2010:10:26 15:48:33+02:00"
}]
Metrics
Metrics have been added to Binary Metadata services to monitor default/custom processor performances with Nuxeo.
To activate it, the following variable in nuxeo.conf must be set:
binary.metadata.monitor.enable=true
Or log4j level to TRACE for org.nuxeo.binary.metadata.internals.BinaryMetadataComponent
must be set.
This feature gives the ability to get time execution informations through JMX: org.nuxeo.StopWatch
.
Default Contribution
- IPTC schema has been removed from document type Picture
- Only
IPTC:Source
,IPTC:CopyrightNotice
,IPTC:Caption-Abstract
are stored respectively intodc:source
,dc:rights
anddc:description
. - Widget
summary_picture_iptc
has been removed from document summary - Mistral engine is removed from metadata extraction of the Nuxeo Platform
- EXIF mapping remains identical
Here is the default metadata mapping contribution in the Nuxeo Platform:
<extension target="org.nuxeo.binary.metadata"
point="metadataMappings">
<metadataMapping id="EXIF" processor="exifTool" blobXPath="file:content" ignorePrefix="false">
<metadata name="EXIF:ImageDescription" xpath="imd:image_description"/>
<metadata name="EXIF:UserComment" xpath="imd:user_comment"/>
<metadata name="EXIF:Equipment" xpath="imd:equipment"/>
<metadata name="EXIF:DateTimeOriginal" xpath="imd:date_time_original"/>
<metadata name="EXIF:XResolution" xpath="imd:xresolution"/>
<metadata name="EXIF:YResolution" xpath="imd:yresolution"/>
<metadata name="EXIF:PixelXDimension" xpath="imd:pixel_xdimension"/>
<metadata name="EXIF:PixelYDimension" xpath="imd:pixel_ydimension"/>
<metadata name="EXIF:Copyright" xpath="imd:copyright"/>
<metadata name="EXIF:ExposureTime" xpath="imd:exposure_time"/>
<metadata name="EXIF:ISO" xpath="imd:iso_speed_ratings"/>
<metadata name="EXIF:FocalLength" xpath="imd:focalLength"/>
<metadata name="EXIF:ColorSpace" xpath="imd:color_space"/>
<metadata name="EXIF:WhiteBalance" xpath="imd:white_balance"/>
<metadata name="EXIF:IccProfile" xpath="imd:icc_profile"/>
<metadata name="EXIF:Orientation" xpath="imd:orientation"/>
<metadata name="EXIF:FNumber" xpath="imd:fnumber"/>
</metadataMapping>
<metadataMapping id="IPTC" processor="exifTool" blobXPath="file:content" ignorePrefix="false">
<metadata name="IPTC:Source" xpath="dc:source"/>
<metadata name="IPTC:CopyrightNotice" xpath="dc:rights"/>
<metadata name="IPTC:Caption-Abstract" xpath="dc:description"/>
</metadataMapping>
</extension>
<extension target="org.nuxeo.binary.metadata"
point="metadataRules">
<rule id="iptc" order="0" enabled="true" async="false">
<metadataMappings>
<metadataMapping-id>EXIF</metadataMapping-id>
<metadataMapping-id>IPTC</metadataMapping-id>
</metadataMappings>
<filters>
<filter-id>hasPictureType</filter-id>
</filters>
</rule>
</extension>
<extension target="org.nuxeo.ecm.platform.actions.ActionService"
point="filters">
<filter id="hasPictureType">
<rule grant="true">
<type>Picture</type>
</rule>
</filter>
</extension>