The Scan Documents Importer addon allows to create documents from XML files located on the file system every time a dedicated event is launched. It can therefore be easily configured to import data on a regular basis.
Installation
This addon requires no specific installation steps. It can be installed like any other package with nuxeoctl command line or from the Update Center.
Import process
The scan import process is composed of several elements:
- The files to import, classified in a folder structure.
- XML files linked to the files to attach its document type and property values
- An output folder where XML files are moved once processed by the scan importer.
Scan importer configurations, declared in an XML extension into Nuxeo Studio
- The import frequency (every 30 seconds or every night for example)
- Specific import sizing information (batch size, number of threads...)
- The document type which should be applied for the folders (Workspace by default) and the files (File by default)
If you need to import with specific properties, then you would have to change the default document type applied to the file (targetLeafType
property in the XML contribution)- The property mapping between the XML file tags and the Nuxeo document model (XPATH values)
Configuration
A step by step example explaining the addon configuration can be found in the Nuxeo blogs : [Monday Dev Heaven] Multi-threaded, transactional bulk import with Nuxeo
Please note that the XML can only be mapped to non-multivalued and non-complex fields. If you need this functionality, see the advanced XML parsing section.
A Java mapper class example can be found on GitHub. This allows to create a specific Nuxeo document type depending on the XML source.
Advanced XML Parsing
Advanced XML parsing for complex and / or multivalued fields can be achieved by adding the following bundles into your platform (copy the jar files into the nxserver/bundles
directory):
These bundles provide you with a new service (org.nuxeo.ecm.platform.importer.xml.parser.XMLImporterComponent
) and extension points that need to be used instead of the regular ones:
documentMapping
to determine which document type should be created depending on a set of conditionsattributeMapping
to do the XML parsing and map to the corresponding metadata
A detailed documentation on the advanced XML parsing usage can be found on the nuxeo-importer-xml-parser GitHub page. To get you started, below is a working example with the original XML file and the corresponding XML configuration that can be pasted into Nuxeo Studio.
<invoice>
<order_number value="Invoice NX38937987-421-690" />
<software_source value="My accounting software" />
<supplier value="Papeterie Stylo Dépôt" />
<order_date value="2005-03-12T11:00:00.000Z" />
<planned_delivery_date value="2005-04-17" />
<total_incl_taxes value="65.90" />
<file name="order made on march 12 2005.pdf" />
<item>
<ref>373668</ref>
<desc>Pens</desc>
<amount>12.30</amount>
<delivery_date>2005.04.17</delivery_date>
</item>
<item>
<ref>737282</ref>
<desc>Poster</desc>
<amount>3.70</amount>
<delivery_date>2005.04.17</delivery_date>
</item>
<item>
<ref>029938</ref>
<desc>Glue sticks</desc>
<amount>7.75</amount>
<delivery_date>2005.04.20</delivery_date>
</item>
</invoice>
<!-- Doctype to create depending on XML formatting
In this case, having an invoice tag means I should create an Invoice document in Nuxeo -->
<extension target="org.nuxeo.ecm.platform.importer.xml.parser.XMLImporterComponent" point="documentMapping">
<docConfig tagName="invoice">
<docType>Invoice</docType>
</docConfig>
</extension>
<!-- XML to metadata mapping
In this case, my invoice schema is as follows:
order_number string
software_source string
supplier string
total_inc_taxes float
order_date date
planned_delivery_date date
items complex, multivalued
ref string
description string
amount float
deliverydate date
-->
<extension target="org.nuxeo.ecm.platform.importer.xml.parser.XMLImporterComponent" point="attributeMapping">
<attributeConfig tagName="order_number" docProperty="dc:title" xmlPath="@value"/>
<attributeConfig tagName="software_source" docProperty="dc:source" xmlPath="@value"/>
<attributeConfig tagName="supplier" docProperty="invoice:supplier" xmlPath="@value"/>
<attributeConfig tagName="total_incl_taxes" docProperty="invoice:amount" xmlPath="@value"/>
<attributeConfig tagName="order_date" docProperty="invoice:orderdate" xmlPath="@value"/>
<attributeConfig tagName="planned_delivery_date" docProperty="invoice:planneddeliverydate" xmlPath="@value"/>
<attributeConfig tagName="file" docProperty="file:content">
<mapping documentProperty="filename">@name</mapping>
<mapping documentProperty="content">@name</mapping>
</attributeConfig>
<attributeConfig tagName="item" docProperty="invoice:items">
<mapping documentProperty="ref">ref/text()</mapping>
<mapping documentProperty="description">desc/text()</mapping>
<mapping documentProperty="amount">amount/text()</mapping>
<mapping documentProperty="deliverydate">
#{
String date = currentElement.selectNodes('delivery_date/text()')[0].getText().trim();
return Fn.parseDate(date, 'yyyy.MM.dd')
}]]>
</mapping>
</attributeConfig>
</extension>