Addons

Nuxeo Bulk Document Importer

Updated: September 3, 2024

Watch the related courses on Hyland University

Available for any Nuxeo platform-based application, the Bulk document importer package enables mass document import in a Nuxeo repository. A single HTTP query launches a full, multi-threaded import from the server file system.

This module is designed to offer support for multi-threaded import on a Nuxeo repository.

Usage

The file importer comes as a Java library (with the Nuxeo Runtime Service) and a sample JAX-RS interface to launch, monitor and abort import jobs.

Quick Start

To import the folder /path/to/import into the workspace /default-domain/workspaces/some-workspace while monitoring the import logs from a REST client, use the following HTTP GET queries:

  • GET http://NUXEO_SERVER/nuxeo/site/fileImporter/logActivate
  • GET http://NUXEO_SERVER/nuxeo/site/fileImporter/run?targetPath=/default-domain/workspaces/some-workspace&inputPath=/path/to/import&batchSize=10&interactive=false&nbThreads
  • GET http://NUXEO_SERVER/nuxeo/site/fileImporter/log

A basic user interface is provided when using just http://NUXEO_SERVER/nuxeo/site/fileImporter in a browser.

To execute these HTTP queries you can either use a browser with an active Nuxeo session (JSESSIONID cookie) or use a third party stateless HTTP client with HTTP Basic Authentication. This is an example with the curl command line client:

$ curl --basic -u 'Administrator:Administrator' "http://localhost:8080/nuxeo/site/fileImporter/log"

Don't forget to put the URL in quotes if it includes special shell characters such as &. You can also use the generic HTTP GUI client from the rest-client Java project: https://github.com/wiztools/rest-client. Be sure to fill in the Auth tab with your user credentials.

Memory

The importer requires a lot of memory. Make sure your maximum heap size is set as high as possible for your environment. Maximum heap size can be set in nuxeo.conf in the JAVA_OPTS variable. For example, argument -Xmx4g will set maximum heap size to 4 gigabytes. See Configuration Parameters Index (nuxeo.conf) for more details.

REST API

Resource URL Description Output
GET nuxeo/site/randomImporter/run Random text generator for load testing text/plain; charset=UTF-8
GET nuxeo/site/fileImporter/run Default file importer text/plain; charset=UTF-8
GET nuxeo/site/fileImporter/log Get current log buffer content text/plain; charset=UTF-8
GET nuxeo/site/fileImporter/logActivate Activate logging text/plain; charset=UTF-8
GET nuxeo/site/fileImporter/logDesactivate Deactivate logging text/plain; charset=UTF-8
GET nuxeo/site/fileImporter/status Get importer thread status text/plain; charset=UTF-8
"Running" or "Not Running"
GET nuxeo/site/fileImporter/kill Stop the importer thread if running text/plain; charset=UTF-8
GET nuxeo/site/fileImporter Displays a user interface letting the user to set the parameters (html)

fileImporter/run

Parameter Default value Description
leafType null Leaf type used by the documentModelFactory for the import.
folderishType null Folderish type used by the documentModelFactory for the import.
inputPath N/A Root path to import (local to the server).
targetPath N/A Target path in Nuxeo
skipRootContainerCreation false If true the root container won't be created
batchSize 5 Number of documents that will be created before doing a commit
nbThreads 5 Maximum number of importer threads that can be allocated
interactive false
transactionTimeout 600 Timeout for the transaction (in seconds). Can be increased when importing very big files for example

N/A: no default value, the parameter is required.

With the following contribution you can configure the importer to work in non-bulk mode, which is a bit slower but allows regular Work instances to be created and directed to specific queues (see NXP-19573 for details):

<extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent"
    point="importerConfiguration">
  <importerConfig>
    <bulkMode>false</bulkMode>
    <documentModelFactory/>
  </importerConfig>
</extension>

randomImporter/run

Parameter Default value Description
targetPath N/A Target path in Nuxeo
skipRootContainerCreation
batchSize Number of documents that will be created before doing a commit
nbThreads Maximum number of importer threads that can be allocated
interactive
nbNodes N/A Number of nodes to create
fileSizeKB
onlyText true
blockSyncPostCommitProcessing
blockAsyncProcessing
bulkMode true
blockIndexing false When indexing is blocked, the import will be faster and the reindexing can be done after the mass import.
nonUniform false Allows a non uniform distribution of the number of nodes per folder:
  • A small number of nodes (~= 1) 10% of the time.
  • A big number of nodes (~= 5000) 10% of the time.
  • A random variation of the default number of nodes ( ~= 100) 80% of the time.

N/A: no default value, the parameter is required.

Listeners/Event Handlers

When importing with a sidecar metadata (see the Importer and Metadata section), the default importer works in two steps:

  1. It creates a document with the title (and the file when importing the leaf)
  2. It applies the metadata

This means the about to create and document created events triggered during the step 1 will set the metadata fields to null (or to their default values, if any). It is only at the second step, with the events before document modification and/or document modified, that the metadata values are set.

Also, the importer triggers the documentImportedWithPlatformImporter event once the document has been imported and fully set up. This event would be a good place to set up related fields/behaviors while being certain all the data have been set.

If your configuration has listeners handling the events about to create and document created, then you must be careful, for example when testing if a field is null in about to create and/or document created. Depending on the context (creation by the importer vs creation in the UI for example) it may be normal or not to have a field whose value is null.

Extend

You can easily write your own importer, extending the org.nuxeo.ecm.platform.importer.base.GenericMultiThreadedImporter class.

Using XML extension points you can also define the different building blocks of the importer:

  • class for reading source nodes
  • docType used for leaf Documents
  • docType used for folderish Documents
  • documentModelFactoryClass

See nuxeo-platform-importer Javadoc.

Directory Tree and Threading

The default importer is targeting a simple use case: import a complete filesystem tree inside a Nuxeo repository.

On most computers you have several CPUs and several cores: this means you can import more documents per second by using several threads. However, when importing a tree, threading must be considered carefully:

  • Each thread will be associated with a Transaction (remember we import several documents before doing a commit)
  • Each transaction is isolated from others (MVCC mode)

This means that a new thread must be created only when a new branch will be accessible inside the source filesystem. At least, the default ImporterThreadingPolicy (DefaultMultiThreadingPolicy) does that.

As a result, if you import a big folder with a flat structure, you will only have one importer thread, even if you configure to allow more.

To be sure to be able to leverage multi-threading, you can either:

  • Ensure the source filesystem is a tree with at least two levels
  • Change the importer threading policy.

Importer and Metadata

The default importer provides three classes to read the source files as well as metadata:

FileWithMetadataSourceNode

This is the default implementation, that was mainly targeting at importing a filesystem where file are stored by folders.

The idea is to associate a set of metadata on a per folder basis: the metadata.properties will be used for defining the metadata for all files inside the same folder. By default, metadata will be inherited from parent folder, but may be completed or overridden by a local metadata.properties.

Here is a structure:

├── TopicA
│   ├── file1.pdf
│   ├── file2.pdf
│   ├── metadata.properties
│   ├── TopicA1
│   │   ├── file1.pdf
│   │   ├── file2.pdf
│   │   └── metadata.properties
│   ├── TopicA2
│   │   ├── file1.pdf
│   │   ├── file2.pdf
│   │   └── metadata.properties
│   └── TopicA3
│      ├── file1.pdf
│      ├── file2.pdf
│      └── metadata.properties
└── TopicB
 ├── file1.pdf
 ├── metadata.properties
 ├── TopicB1
 │   ├── file1.pdf
 │   ├── file2.pdf
 │   └── metadata.properties
 └── TopicB12
    ├── file1.pdf
    ├── file2.pdf
    └── metadata.properties

The metadata.properties file is a simple property file in the format xpath = value. Typically:

dc\:description=some description
dc\:source=some source
dc\:subjects=subject4|subject5
dc\:issued=2015-30-04T09:39:43.00Z

Please note that:

  • Date properties must be formatted using the ISO 8601 standard
  • Multi-valued property syntax is dc\:subjects=subject4|subject5, the default separator being |
  • Multi-valued property must embed a corresponding separator (| by default for lists) for the value to be interpreted correctly. For example with a single value: dc\:subjects=subject4|
  • Complex properties are currently not supported

You can use the ecm:primaryType field to tell the importer to create a specific document type. In the following example, the importer will create a DesignArt custom document type for each file in the folder (da is the schema's prefix):

ecm\:primaryType=DesignArt
dc\:description=Created by the bulk-importer
da\:batch_import_id=123456
da\:author=John Doe

If the ecm:primaryType field is not found, the leafType is used.

FileWithIndividualMetadasSourceNode

This second implementation will try to file a property file for each imported file. This allows to have a per file metadata set.

A sample structure would be:

├── branch1
│   ├── branch11
│   │   ├── hello11.pdf
│   │   └── hello11.properties
│   ├── hello1.pdf
│   └── hello1.properties
├── hello.pdf
└── hello.properties

The format of this .properties file is the same as the one described above. If you use the ecm:primaryType field, you will be able to create a specific document type for each file.

To use this node type you need to redefine the importer. There are two ways to do so:

  • Add an XML extension in your Nuxeo Studio projectwith the following content:

     <require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require>
    
    <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration">
       <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithIndividualMetadasSourceNode" >
          <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" />
        </importerConfig>
    </extension>
    
  • Create an importer-config.xml with the following content in nxserver/config:

    <?xml version="1.0"?>
    <component name="customImporter">
    <require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require>
    
    <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration">
       <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithIndividualMetadasSourceNode" >
          <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" />
        </importerConfig>
    </extension>
    </component>
    

    You can name this file whatever you want, as long as the suffix is -config.xml. See the page Runtime and Component Model.

FileWithNonHeritedIndividalMetaDataSourceNode

This implementation will provide a per-file metadata set. It differs from the FileWithIndividualMetadasSourceNode in that it does not provide inheritance – properties defined for a folderish document will not sync to its children. If this inheritance is not necessary for your import, using this class can improve performance, because it avoids the use of a structure shared between import threads, and avoids the overhead generated by the need to sync information between import threads.

The structure used by this class is identical to what FileWithIndividualMetadasSourceNode uses:

├── branch1
│   ├── branch11
│   │   ├── hello11.pdf
│   │   └── hello11.properties
│   ├── hello1.pdf
│   └── hello1.properties
├── hello.pdf
└── hello.properties

To enable this node type, redefine the importer using one of these methods:

  • Add an XML extension in your Nuxeo Studio projectwith the following content:

    <require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require>
    
    <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration">
       <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithNonHeritedIndividalMetaDataSourceNode" >
          <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" />
        </importerConfig>
    </extension>
    
  • Create an importer-config.xml with the following content in nxserver/config:

    <?xml version="1.0"?>
    <component name="customImporter">
    <require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require>
    
    <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration">
       <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithNonHeritedIndividalMetaDataSourceNode" >
          <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" />
        </importerConfig>
    </extension>
    </component>
    

    You can name this file whatever you want, as long as the suffix is -config.xml. See the page Runtime and Component Model.

Instantiating the Importer

It has a configurable framework which has as the main part, the org.nuxeo.ecm.platform.importer.base.GenericMultiThreadedImporter class. This 'importer' is responsible, depending on the way it is configured, for performing the import. The configuration of an 'importer' can be established starting with the instantiation of such an 'importer'.

You need to provide a source node of the import, which should contain:

  • the entry point of what will be imported,
  • a path to where the import should be made on the current repository,
  • parameters that will control the maximum number of threads that will be created during the import,
  • a logger that will be used during the import (a default one, which is provided by the module, can be used).

In case you need to have an audit support for the import, you can obtain one by providing a 'jobName', which will be used to represent the workflow of the import that will be started in audit. The audit support can be used to avoid later imports (in case the import finished with success).

Here is an example of how such an importer can be instantiated:

TestSourceNode sourceNode = new TestSourceNode(...);
GenericMultiThreadedImporter importer = new GenericMultiThreadedImporter(
    sourceNode, "/", 10, 5, super.getLogger());

Configuring the Importer

Next, an 'importer' can be configured after instantiation, by providing it 'tools' that are used during the import.

factory

One of these 'tools' is so called the 'factory', and it is used when performing the import of a document. Usually such a 'factory' is supposed to treat both cases, when importing a folderish or a leaf document (an interface is provided for this scope org.nuxeo.ecm.platform.importer.factories.ImporterDocumentModelFactory).

filter

Another 'tool' that is used is the 'filter'. More than one 'filter' can be provided to a 'factory' and their scope is to handle the events that are raised during the import. Usually it is better to block all the events that are raised during and after the import of a document (the import of a document can be translated in creating a Nuxeo document model and saving properties on it, which often causes the raise of events), in order to increase the performance of the import.

Notice the events are blocked for the whole system, so this feature will be used during mass import, while the system is not yet in production for example.

Also, filters cannot be configured via an XML extension, but can be used in your own code extending the imporoter (you can find an example in the code of the random importer)

Thread Policy

The last 'tool' that can be provided to an 'importer' is the thread policy that should be used. In case no thread policy is specified, then the default multi thread one is used (this is provided by org.nuxeo.ecm.platform.importer.threading.DefaultMultiThreadingPolicy class).

Here is an example of how such tools can be provided to an instantiated importer.

TestDocumentModelFactory documentModelFactory = new TestDocumentModelFactory(...);
importer.setFactory(documentModelFactory);
if (useMultiThread) {
    importer.setThreadPolicy(super.getThreadPolicy());
} else {
    importer.setThreadPolicy(new MonoThreadPolicy());
}
ImporterFilter filter = new TestImporterFilter(true,
    true, true, false);
importer.addFilter(filter);

Usually such an 'importer' should be instantiated and configured in an instance method of a class that extends the org.nuxeo.ecm.platform.importer.executor.AbstractImporterExecutor class. In this instance method, after the importer is instantiated and configured, a call to a superclass method should be made, which will start the import.

super.doRun(importer, Boolean.TRUE);

The second parameter specifies whether the import should start synchronous or asynchronous.

This class will be the base class for the import, and the method that instantiates, configure and start the import, should be called.

Download

To download nuxeo-platform-importer, check the Nuxeo Marketplace or, if needed, download a more recent version of the JAR (to be installed by hand) from the Nuxeo Maven repository.