Marketplace Add-Ons

Nuxeo Bulk Document Importer

Updated: July 17, 2023

The nuxeo-importer-core module is designed to offer support for multi-threaded import on a Nuxeo repository.

Usage

The file importer comes as a Java library (with the Nuxeo Runtime Service) and a sample JAX-RS interface to launch, monitor and abort import jobs.

Quick Start

To import the folder /path/to/import into the workspace /default-domain/workspaces/some-workspace while monitoring the import logs from a REST client, use the following HTTP GET queries:

  • GET http://NUXEO_SERVER/nuxeo/site/fileImporter/logActivate
  • GET http://NUXEO_SERVER/nuxeo/site/fileImporter/run?targetPath=/default-domain/workspaces/some-workspace&inputPath=/path/to/import&batchSize=10&interactive=false&nbThreads
  • GET http://NUXEO_SERVER/nuxeo/site/fileImporter/log

A basic user interface is provided when using just http://NUXEO_SERVER/nuxeo/site/fileImporter in a browser.

To execute these HTTP queries you can either use a browser with an active Nuxeo session (JSESSIONID cookie) or use a third party stateless HTTP client with HTTP Basic Authentication. This is an example with the curl command line client:

$ curl --basic -u 'Administrator:Administrator' "
http://localhost:8080/nuxeo/site/fileImporter/log"

Don't forget to put the URL in quotes if it includes special shell characters such as &. You can also use the generic HTTP GUI client from the rest-client Java project: https://github.com/wiztools/rest-client.  Be sure to fill in the Auth tab with your user credentials.

Memory

The importer requires a lot of memory. Make sure your maximum heap size is set as high as possible for your environment. Maximum heap size can be set in nuxeo.conf in the JAVA_OPTS variable. For example, argument -Xmx4g will set maximum heap size to 4 gigabytes. See Configuration Parameters Index (nuxeo.conf) for more details.

REST API

Resource URLDescriptionOutput
GET nuxeo/site/randomImporter/runRandom text generator for load testingtext/plain; charset=UTF-8
GET nuxeo/site/fileImporter/runDefault file importertext/plain; charset=UTF-8
GET nuxeo/site/fileImporter/logGet current log buffer contenttext/plain; charset=UTF-8
GET nuxeo/site/fileImporter/logActivateActivate loggingtext/plain; charset=UTF-8
GET nuxeo/site/fileImporter/logDesactivateDeactivate loggingtext/plain; charset=UTF-8
GET nuxeo/site/fileImporter/statusGet importer thread status text/plain; charset=UTF-8 "Running" or "Not Running"
GET nuxeo/site/fileImporter/killStop the importer thread if runningtext/plain; charset=UTF-8
GET nuxeo/site/fileImporterDisplays a user interface letting the user to set the parameters(html)

fileImporter/run

ParameterDefault valueDescription
leafTypenullLeaf type used by the documentModelFactory for the import.
folderishTypenullFolderish type used by the documentModelFactory for the import.
inputPathN/ARoot path to import (local to the server).
targetPathN/ATarget path in Nuxeo
skipRootContainerCreationfalseIf true the root container won't be created
batchSize5Number of documents that will be created before doing a commit
nbThreads5Maximum number of importer threads that can be allocated
interactivefalse 
transactionTimeout600Timeout for the transaction (in seconds). Can be increased when importing very big files for example

N/A: no default value, the parameter is required.

randomImporter/run

ParameterDefault valueDescription
targetPathN/ATarget path in Nuxeo
skipRootContainerCreation  
batchSize Number of documents that will be created before doing a commit
nbThreads Maximum number of importer threads that can be allocated
interactive  
nbNodesN/ANumber of nodes to create
fileSizeKB  
onlyTexttrue 
blockSyncPostCommitProcessing  
blockAsyncProcessing  
bulkModetrue 

N/A: no default value, the parameter is required.

Extend

You can easily write your own importer, extending the  org.nuxeo.ecm.platform.importer.base.GenericMultiThreadedImporter class.

Using XML extension points you can also define the different building blocks of the importer:

  • class for reading source nodes
  • docType used for leaf Documents
  • docType used for folderish Documents
  • documentModelFactoryClass

See the developer documentation of Nuxeo Bulk Document Importer for details.

See nuxeo-platform-importer Javadoc.

Directory Tree and Threading

The default importer is targeting a simple use case: import a complete filesystem tree inside a Nuxeo repository.

On most computers you have several CPUs and several cores: this means you can import more documents per second by using several threads. However, when importing a tree, threading must be considered carefully:

  • Each thread will be associated with a Transaction (remember we import several documents before doing a commit)
  • Each transaction is isolated from others (MVCC mode)

This means that a new thread must be created only when a new branch will be accessible inside the source filesystem. At least, the default ImporterThreadingPolicy (DefaultMultiThreadingPolicy) does that.

As a result, if you import a big folder with a flat structure, you will only have one importer thread, even if you configure to allow more.

To be sure to be able to leverage multi-threading, you can either:

  • Ensure the source filesystem is a tree with at least two levels
  • Change the importer threading policy

Importer and Metadata 

The default importer provides two classes to read the source files as well as metadata:

FileWithMetadataSourceNode

This is the default implementation, that was mainly targeting at importing a filesystem where file are stored by folders.

The idea is to associate a set of metadata on a per folder basis: the metadata.properties will be used for defining the metadata for all files inside the same folder. By default, metadata will be inherited from parent folder, but may be completed or overridden by a local metadata.properties.

Here is a structure:

├── TopicA
│   ├── file1.pdf
│   ├── file2.pdf
│   ├── metadata.properties
│   ├── TopicA1
│   │   ├── file1.pdf
│   │   ├── file2.pdf
│   │   └── metadata.properties
│   ├── TopicA2
│   │   ├── file1.pdf
│   │   ├── file2.pdf
│   │   └── metadata.properties
│   └── TopicA3
│      ├── file1.pdf
│      ├── file2.pdf
│      └── metadata.properties
└── TopicB
 ├── file1.pdf
 ├── metadata.properties
 ├── TopicB1
 │   ├── file1.pdf
 │   ├── file2.pdf
 │   └── metadata.properties
 └── TopicB12
    ├── file1.pdf
    ├── file2.pdf
    └── metadata.properties

The metadata.properties file is a simple property file in the format xpath = value. Typically:

dc\:description=some description
dc\:source=some source
dc\:subjects=subject4|subject5
dc\:issued=2015-30-04T09:39:43.00Z

Please note that:

  • Date properties must be formatted using the ISO 8601 standard
  • multi-valued property syntax is dc\:subjects=subject4|subject5, the default separator being |
  • Multi-valued property must embed a corresponding separator (| by default for lists) for the value to be interpreted correctly. For example with a single value: dc\:subjects=subject4|
  • complex properties are not supported currently.

You can use the ecm:primaryType field to tell the importer to create a specific document type. In the following example, the importer will create a DesignArt custom document type for each file in the folder (da is the schema's prefix):

ecm\:primaryType=DesignArt
dc\:description=Created by the bulk-importer
da\:batch_import_id=123456
da\:author=John Doe

If the ecm:primaryType field is not found, the leafType is used.

FileWithIndividualMetadasSourceNode

This second implementation will try to file a property file for each imported file. This allows to have a per file metadata set.

A sample structure would be:

├── branch1
│   ├── branch11
│   │   ├── hello11.pdf
│   │   └── hello11.properties
│   ├── hello1.pdf
│   └── hello1.properties
├── hello.pdf
└── hello.properties

The format of this .properties file is the same as the one described above. If you use the ecm:primaryType field, you will be able to create a specific document type for each file.

To use this node type you need to redefine the importer. There are two ways to do so:

  • Add an XML extension in your Nuxeo Studio project with the following content:

    <require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require>
    
    <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration">
       <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithIndividualMetadasSourceNode" >
          <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" />
        </importerConfig>
    </extension> 
    
  • Create an importer-config.xml with the following content in nxserver/config:

    <?xml version="1.0"?>
    <component name="customImporter">
    <require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require>
    
    <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration">
       <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithIndividualMetadasSourceNode" >
          <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" />
        </importerConfig>
    </extension>
    </component>
    

    You can name this file whatever you want, as long as the suffix is "-config.xml", see the page Runtime and Component Model.

Instantiating the Importer

It has a configurable framework which has as the main part, the org.nuxeo.ecm.platform.importer.base.GenericMultiThreadedImporter class. This 'importer' is responsible, depending on the way it is configured, for performing the import. The configuration of an 'importer' can be established starting with the instantiation of such an 'importer'.

You need to provide a source node of the import, which should contain:

  • the entry point of what will be imported,
  • a path to where the import should be made on the current repository,
  • parameters that will control the maximum number of threads that will be created during the import,
  • a logger that will be used during the import (a default one, which is provided by the module, can be used).

In case you need to have an audit support for the import, you can obtain one by providing a 'jobName', which will be used to represent the workflow of the import that will be started in audit. The audit support can be used to avoid later imports (in case the import finished with success).

Here is an example of how such an importer can be instantiated:

TestSourceNode sourceNode = new TestSourceNode(...);
GenericMultiThreadedImporter importer = new GenericMultiThreadedImporter(
    sourceNode, "/", 10, 5, super.getLogger());

Configuring the Importer

Next, an 'importer' can be configured after instantiation, by providing it 'tools' that are used during the import.

factory

One of these 'tools' is so called the 'factory', and it is used when performing the import of a document. Usually such a 'factory' is supposed to treat both cases, when importing a folderish or a leaf document (an interface is provided for this scope org.nuxeo.ecm.platform.importer.factories.ImporterDocumentModelFactory).

filter

Another 'tool' that is used is the 'filter'. More than one 'filter' can be provided to a 'factory' and their scope is to handle the events that are raised during the import. Usually it is better to block all the events that are raised during and after the import of a document (the import of a document can be translated in creating a Nuxeo document model and saving properties on it, which often causes the raise of events), in order to increase the performance of the import.

Thread Policy

The last 'tool' that can be provided to an 'importer' is the thread policy that should be used. In case no thread policy is specified, then the default multi thread one is used (this is provided by org.nuxeo.ecm.platform.importer.threading.DefaultMultiThreadingPolicy class).

Here is an example of how such tools can be provided to an instantiated importer.

TestDocumentModelFactory documentModelFactory = new TestDocumentModelFactory(...);
importer.setFactory(documentModelFactory);
if (useMultiThread) {
    importer.setThreadPolicy(super.getThreadPolicy());
} else {
    importer.setThreadPolicy(new MonoThreadPolicy());
}
ImporterFilter filter = new TestImporterFilter(true,
    true, true, false);
importer.addFilter(filter);

Usually such an 'importer' should be instantiated and configured in an instance method of a class that extends the org.nuxeo.ecm.platform.importer.executor.AbstractImporterExecutor class. In this instance method, after the importer is instantiated and configured, a call to a superclass method should be made, which will start the import.

super.doRun(importer, Boolean.TRUE);

The second parameter specifies whether the import should start synchronous or asynchronous.

This class will be the base class for the import, and the method that instantiates, configure and start the import, should be called.

Download

To download nuxeo-importer-core, check the Nuxeo Marketplace or, if needed, download a more recent version of the JAR (to be installed by hand) from the Nuxeo Maven repository.