Nuxeo Add-Ons

Scan Documents Importer

Updated: March 18, 2024

Follow the related video course and exercises on Hyland University.

The Scan Documents Importer addon allows to create documents from XML files located on the file system every time a dedicated event is launched. It can therefore be easily configured to import data on a regular basis.

Installation

This addon requires no specific installation steps. It can be installed like any other package with nuxeoctl command line or from the Update Center.

Import process

The scan import process is composed of several elements:

  1. The files to import, classified in a folder structure.
  2. XML files linked to the files to attach its document type and property values
  3. An output folder where XML files are moved once processed by the scan importer.

  1. Scan importer configurations, declared in an XML extension into Nuxeo Studio

    1. The import frequency (every 30 seconds or every night for example)
    2. Specific import sizing information (batch size, number of threads...)
    3. The document type which should be applied for the folders (Workspace by default) and the files (File by default)

    If you need to import with specific properties, then you would have to change the default document type applied to the file (targetLeafType property in the XML contribution)

    1. The property mapping between the XML file tags and the Nuxeo document model (XPATH values)

Configuration

A step by step example explaining the addon configuration can be found in the Nuxeo blogs : [Monday Dev Heaven] Multi-threaded, transactional bulk import with Nuxeo

Please note that the XML can only be mapped to non-multivalued and non-complex fields. If you need this functionality, see the advanced XML parsing section.

A Java mapper class example can be found on GitHub. This allows to create a specific Nuxeo document type depending on the XML source.

Advanced XML Parsing

Advanced XML parsing for complex and / or multivalued fields can be achieved by adding the following bundles into your platform (copy the jar files into the nxserver/bundles directory):

  1. nuxeo-importer-xml-parser
  2. nuxeo-importer-scan-xml-parser

These bundles provide you with a new service (org.nuxeo.ecm.platform.importer.xml.parser.XMLImporterComponent) and extension points that need to be used instead of the regular ones:

  1. documentMapping to determine which document type should be created depending on a set of conditions

  2. attributeMapping to do the XML parsing and map to the corresponding metadata

A detailed documentation on the advanced XML parsing usage can be found on the nuxeo-importer-xml-parser GitHub page. To get you started, below is a working example with the original XML file and the corresponding XML configuration that can be pasted into Nuxeo Studio.

Original XML file

<invoice>
  <order_number value="Invoice NX38937987-421-690" />
  <software_source value="My accounting software" />
  <supplier value="Papeterie Stylo Dépôt" />
  <order_date value="2005-03-12T11:00:00.000Z" />
  <planned_delivery_date value="2005-04-17" />
  <total_incl_taxes value="65.90" />
  <file name="order made on march 12 2005.pdf" />
  <item>
    <ref>373668</ref>
    <desc>Pens</desc>
    <amount>12.30</amount>
    <delivery_date>2005.04.17</delivery_date>
  </item>
  <item>
    <ref>737282</ref>
    <desc>Poster</desc>
    <amount>3.70</amount>
    <delivery_date>2005.04.17</delivery_date>
  </item>
  <item>
    <ref>029938</ref>
    <desc>Glue sticks</desc>
    <amount>7.75</amount>
    <delivery_date>2005.04.20</delivery_date>
  </item>
</invoice>

Corresponding XML extension into Nuxeo Studio

<!-- Doctype to create depending on XML formatting
     In this case, having an invoice tag means I should create an Invoice document in Nuxeo -->
<extension target="org.nuxeo.ecm.platform.importer.xml.parser.XMLImporterComponent" point="documentMapping">
    <docConfig tagName="invoice">
      <docType>Invoice</docType>
    </docConfig>
</extension>

<!-- XML to metadata mapping
     In this case, my invoice schema is as follows:
         order_number                         string
        software_source                        string
        supplier                            string
        total_inc_taxes                        float
        order_date                            date
        planned_delivery_date                date
        items                                complex, multivalued
            ref                                string
            description                        string
            amount                            float
            deliverydate                    date
-->
<extension target="org.nuxeo.ecm.platform.importer.xml.parser.XMLImporterComponent" point="attributeMapping">
    <attributeConfig tagName="order_number" docProperty="dc:title" xmlPath="@value"/>
  <attributeConfig tagName="software_source" docProperty="dc:source" xmlPath="@value"/>
    <attributeConfig tagName="supplier" docProperty="invoice:supplier" xmlPath="@value"/>
  <attributeConfig tagName="total_incl_taxes" docProperty="invoice:amount" xmlPath="@value"/>
  <attributeConfig tagName="order_date" docProperty="invoice:orderdate" xmlPath="@value"/>
  <attributeConfig tagName="planned_delivery_date" docProperty="invoice:planneddeliverydate" xmlPath="@value"/>

  <attributeConfig tagName="file" docProperty="file:content">
        <mapping documentProperty="filename">@name</mapping>
        <mapping documentProperty="content">@name</mapping>
    </attributeConfig>

    <attributeConfig tagName="item" docProperty="invoice:items">
       <mapping documentProperty="ref">ref/text()</mapping>
    <mapping documentProperty="description">desc/text()</mapping>
    <mapping documentProperty="amount">amount/text()</mapping>
    <mapping documentProperty="deliverydate">
             #{
                String date = currentElement.selectNodes('delivery_date/text()')[0].getText().trim();
              return Fn.parseDate(date, 'yyyy.MM.dd')
        }]]>
        </mapping>
  </attributeConfig>
</extension>