- Nuxeo Bulk Importer on Hyland University.
Available for any Nuxeo platform-based application, the Bulk document importer package enables mass document import in a Nuxeo repository. A single HTTP query launches a full, multi-threaded import from the server file system.
This module is designed to offer support for multi-threaded import on a Nuxeo repository.
Usage
The file importer comes as a Java library (with the Nuxeo Runtime Service) and a sample JAX-RS interface to launch, monitor and abort import jobs.
Quick Start
To import the folder /path/to/import
into the workspace /default-domain/workspaces/some-workspace
while monitoring the import logs from a REST client, use the following HTTP GET queries:
GET http://NUXEO_SERVER/nuxeo/site/fileImporter/logActivate
GET http://NUXEO_SERVER/nuxeo/site/fileImporter/run?targetPath=/default-domain/workspaces/some-workspace&inputPath=/path/to/import&batchSize=10&interactive=false&nbThreads
GET http://NUXEO_SERVER/nuxeo/site/fileImporter/log
http://NUXEO_SERVER/nuxeo/site/fileImporter
in a browser.
To execute these HTTP queries you can either use a browser with an active Nuxeo session (JSESSIONID cookie) or use a third party stateless HTTP client with HTTP Basic Authentication. This is an example with the curl command line client:
$ curl --basic -u 'Administrator:Administrator' "http://localhost:8080/nuxeo/site/fileImporter/log"
Don't forget to put the URL in quotes if it includes special shell characters such as &
. You can also use the generic HTTP GUI client from the rest-client Java project: https://github.com/wiztools/rest-client.
Be sure to fill in the Auth tab with your user credentials.
Memory
The importer requires a lot of memory. Make sure your maximum heap size is set as high as possible for your environment. Maximum heap size can be set in nuxeo.conf in the JAVA_OPTS variable. For example, argument -Xmx4g
will set maximum heap size to 4 gigabytes. See Configuration Parameters Index (nuxeo.conf) for more details.
REST API
Resource URL | Description | Output |
---|---|---|
GET nuxeo/site/randomImporter/run |
Random text generator for load testing | text/plain; charset=UTF-8 |
GET nuxeo/site/fileImporter/run |
Default file importer | text/plain; charset=UTF-8 |
GET nuxeo/site/fileImporter/log |
Get current log buffer content | text/plain; charset=UTF-8 |
GET nuxeo/site/fileImporter/logActivate |
Activate logging | text/plain; charset=UTF-8 |
GET nuxeo/site/fileImporter/logDesactivate |
Deactivate logging | text/plain; charset=UTF-8 |
GET nuxeo/site/fileImporter/status |
Get importer thread status | text/plain; charset=UTF-8 "Running" or "Not Running" |
GET nuxeo/site/fileImporter/kill |
Stop the importer thread if running | text/plain; charset=UTF-8 |
GET nuxeo/site/fileImporter |
Displays a user interface letting the user to set the parameters | (html) |
fileImporter/run
Parameter | Default value | Description |
---|---|---|
leafType |
null | Leaf type used by the documentModelFactory for the import. |
folderishType |
null | Folderish type used by the documentModelFactory for the import. |
inputPath |
N/A | Root path to import (local to the server). |
targetPath |
N/A | Target path in Nuxeo |
skipRootContainerCreation |
false | If true the root container won't be created |
batchSize |
5 | Number of documents that will be created before doing a commit |
nbThreads |
5 | Maximum number of importer threads that can be allocated |
interactive |
false | |
transactionTimeout |
600 | Timeout for the transaction (in seconds). Can be increased when importing very big files for example |
N/A: no default value, the parameter is required.
With the following contribution you can configure the importer to work in non-bulk mode, which is a bit slower but allows regular Work instances to be created and directed to specific queues (see NXP-19573 for details):
<extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent"
point="importerConfiguration">
<importerConfig>
<bulkMode>false</bulkMode>
<documentModelFactory/>
</importerConfig>
</extension>
randomImporter/run
Parameter | Default value | Description |
---|---|---|
targetPath |
N/A | Target path in Nuxeo |
skipRootContainerCreation |
||
batchSize |
Number of documents that will be created before doing a commit | |
nbThreads |
Maximum number of importer threads that can be allocated | |
interactive |
||
nbNodes |
N/A | Number of nodes to create |
fileSizeKB |
||
onlyText |
true | |
blockSyncPostCommitProcessing |
||
blockAsyncProcessing |
||
bulkMode |
true | |
blockIndexing |
false | When indexing is blocked, the import will be faster and the reindexing can be done after the mass import. |
nonUniform |
false | Allows a non uniform distribution of the number of nodes per folder:
|
N/A: no default value, the parameter is required.
Listeners/Event Handlers
When importing with a sidecar metadata (see the Importer and Metadata section), the default importer works in two steps:
- It creates a document with the title (and the file when importing the leaf)
- It applies the metadata
This means the about to create
and document created
events triggered during the step 1 will set the metadata fields to null
(or to their default values, if any). It is only at the second step, with the events before document modification
and/or document modified
, that the metadata values are set.
Also, the importer triggers the documentImportedWithPlatformImporter
event once the document has been imported and fully set up. This event would be a good place to set up related fields/behaviors while being certain all the data have been set.
If your configuration has listeners handling the events about to create
and document created
, then you must be careful, for example when testing if a field is null
in about to create
and/or document created
. Depending on the context (creation by the importer vs creation in the UI for example) it may be normal or not to have a field whose value is null
.
Extend
You can easily write your own importer, extending the org.nuxeo.ecm.platform.importer.base.GenericMultiThreadedImporter
class.
Using XML extension points you can also define the different building blocks of the importer:
- class for reading source nodes
- docType used for leaf Documents
- docType used for folderish Documents
documentModelFactoryClass
See nuxeo-platform-importer Javadoc.
Directory Tree and Threading
The default importer is targeting a simple use case: import a complete filesystem tree inside a Nuxeo repository.
On most computers you have several CPUs and several cores: this means you can import more documents per second by using several threads. However, when importing a tree, threading must be considered carefully:
- Each thread will be associated with a Transaction (remember we import several documents before doing a commit)
- Each transaction is isolated from others (MVCC mode)
This means that a new thread must be created only when a new branch will be accessible inside the source filesystem. At least, the default ImporterThreadingPolicy
(DefaultMultiThreadingPolicy
) does that.
As a result, if you import a big folder with a flat structure, you will only have one importer thread, even if you configure to allow more.
To be sure to be able to leverage multi-threading, you can either:
- Ensure the source filesystem is a tree with at least two levels
- Change the importer threading policy.
Importer and Metadata
The default importer provides three classes to read the source files as well as metadata:
FileWithMetadataSourceNode
This is the default implementation, that was mainly targeting at importing a filesystem where file are stored by folders.
The idea is to associate a set of metadata on a per folder basis: the metadata.properties
will be used for defining the metadata for all files inside the same folder. By default, metadata will be inherited from parent folder, but may be completed or overridden by a local metadata.properties.
Here is a structure:
├── TopicA │ ├── file1.pdf │ ├── file2.pdf │ ├── metadata.properties │ ├── TopicA1 │ │ ├── file1.pdf │ │ ├── file2.pdf │ │ └── metadata.properties │ ├── TopicA2 │ │ ├── file1.pdf │ │ ├── file2.pdf │ │ └── metadata.properties │ └── TopicA3 │ ├── file1.pdf │ ├── file2.pdf │ └── metadata.properties └── TopicB ├── file1.pdf ├── metadata.properties ├── TopicB1 │ ├── file1.pdf │ ├── file2.pdf │ └── metadata.properties └── TopicB12 ├── file1.pdf ├── file2.pdf └── metadata.properties
The metadata.properties
file is a simple property file in the format xpath
= value
. Typically:
dc\:description=some description
dc\:source=some source
dc\:subjects=subject4|subject5
dc\:issued=2015-30-04T09:39:43.00Z
Please note that:
- Date properties must be formatted using the ISO 8601 standard
- Multi-valued property syntax is
dc\:subjects=subject4|subject5
, the default separator being|
- Multi-valued property must embed a corresponding separator (
|
by default for lists) for the value to be interpreted correctly. For example with a single value:dc\:subjects=subject4|
- Complex properties are currently not supported
You can use the ecm:primaryType
field to tell the importer to create a specific document type. In the following example, the importer will create a DesignArt
custom document type for each file in the folder (da
is the schema's prefix):
ecm\:primaryType=DesignArt
dc\:description=Created by the bulk-importer
da\:batch_import_id=123456
da\:author=John Doe
If the ecm:primaryType
field is not found, the leafType
is used.
FileWithIndividualMetadasSourceNode
This second implementation will try to file a property file for each imported file. This allows to have a per file metadata set.
A sample structure would be:
├── branch1 │ ├── branch11 │ │ ├── hello11.pdf │ │ └── hello11.properties │ ├── hello1.pdf │ └── hello1.properties ├── hello.pdf └── hello.properties
The format of this .properties
file is the same as the one described above. If you use the ecm:primaryType
field, you will be able to create a specific document type for each file.
To use this node type you need to redefine the importer. There are two ways to do so:
Add an XML extension in your Nuxeo Studio projectwith the following content:
<require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require> <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration"> <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithIndividualMetadasSourceNode" > <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" /> </importerConfig> </extension>
Create an
importer-config.xml
with the following content innxserver/config
:<?xml version="1.0"?> <component name="customImporter"> <require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require> <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration"> <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithIndividualMetadasSourceNode" > <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" /> </importerConfig> </extension> </component>
You can name this file whatever you want, as long as the suffix is
-config.xml
. See the page Runtime and Component Model.
FileWithNonHeritedIndividalMetaDataSourceNode
This implementation will provide a per-file metadata set. It differs from the FileWithIndividualMetadasSourceNode in that it does not provide inheritance – properties defined for a folderish document will not sync to its children. If this inheritance is not necessary for your import, using this class can improve performance, because it avoids the use of a structure shared between import threads, and avoids the overhead generated by the need to sync information between import threads.
The structure used by this class is identical to what FileWithIndividualMetadasSourceNode uses:
├── branch1 │ ├── branch11 │ │ ├── hello11.pdf │ │ └── hello11.properties │ ├── hello1.pdf │ └── hello1.properties ├── hello.pdf └── hello.properties
To enable this node type, redefine the importer using one of these methods:
Add an XML extension in your Nuxeo Studio projectwith the following content:
<require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require> <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration"> <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithNonHeritedIndividalMetaDataSourceNode" > <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" /> </importerConfig> </extension>
Create an
importer-config.xml
with the following content innxserver/config
:<?xml version="1.0"?> <component name="customImporter"> <require>org.nuxeo.ecm.platform.importer.service.jaxrs.contrib</require> <extension target="org.nuxeo.ecm.platform.importer.service.DefaultImporterComponent" point="importerConfiguration"> <importerConfig sourceNodeClass ="org.nuxeo.ecm.platform.importer.source.FileWithNonHeritedIndividalMetaDataSourceNode" > <documentModelFactory leafType="File" folderishType="Folder" documentModelFactoryClass="org.nuxeo.ecm.platform.importer.factories.DefaultDocumentModelFactory" /> </importerConfig> </extension> </component>
You can name this file whatever you want, as long as the suffix is
-config.xml
. See the page Runtime and Component Model.
Instantiating the Importer
It has a configurable framework which has as the main part, the org.nuxeo.ecm.platform.importer.base.GenericMultiThreadedImporter
class. This 'importer' is responsible, depending on the way it is configured, for performing the import. The configuration of an 'importer' can be established starting with the instantiation of such an 'importer'.
You need to provide a source node of the import, which should contain:
- the entry point of what will be imported,
- a path to where the import should be made on the current repository,
- parameters that will control the maximum number of threads that will be created during the import,
- a logger that will be used during the import (a default one, which is provided by the module, can be used).
In case you need to have an audit support for the import, you can obtain one by providing a 'jobName', which will be used to represent the workflow of the import that will be started in audit. The audit support can be used to avoid later imports (in case the import finished with success).
Here is an example of how such an importer can be instantiated:
TestSourceNode sourceNode = new TestSourceNode(...);
GenericMultiThreadedImporter importer = new GenericMultiThreadedImporter(
sourceNode, "/", 10, 5, super.getLogger());
Configuring the Importer
Next, an 'importer' can be configured after instantiation, by providing it 'tools' that are used during the import.
factory
One of these 'tools' is so called the 'factory', and it is used when performing the import of a document. Usually such a 'factory' is supposed to treat both cases, when importing a folderish or a leaf document (an interface is provided for this scope org.nuxeo.ecm.platform.importer.factories.ImporterDocumentModelFactory
).
filter
Another 'tool' that is used is the 'filter'. More than one 'filter' can be provided to a 'factory' and their scope is to handle the events that are raised during the import. Usually it is better to block all the events that are raised during and after the import of a document (the import of a document can be translated in creating a Nuxeo document model and saving properties on it, which often causes the raise of events), in order to increase the performance of the import.
Notice the events are blocked for the whole system, so this feature will be used during mass import, while the system is not yet in production for example.
Also, filters cannot be configured via an XML extension, but can be used in your own code extending the imporoter (you can find an example in the code of the random importer)
Thread Policy
The last 'tool' that can be provided to an 'importer' is the thread policy that should be used. In case no thread policy is specified, then the default multi thread one is used (this is provided by org.nuxeo.ecm.platform.importer.threading.DefaultMultiThreadingPolicy
class).
Here is an example of how such tools can be provided to an instantiated importer.
TestDocumentModelFactory documentModelFactory = new TestDocumentModelFactory(...);
importer.setFactory(documentModelFactory);
if (useMultiThread) {
importer.setThreadPolicy(super.getThreadPolicy());
} else {
importer.setThreadPolicy(new MonoThreadPolicy());
}
ImporterFilter filter = new TestImporterFilter(true,
true, true, false);
importer.addFilter(filter);
Usually such an 'importer' should be instantiated and configured in an instance method of a class that extends the org.nuxeo.ecm.platform.importer.executor.AbstractImporterExecutor
class. In this instance method, after the importer is instantiated and configured, a call to a superclass method should be made, which will start the import.
super.doRun(importer, Boolean.TRUE);
The second parameter specifies whether the import should start synchronous or asynchronous.
This class will be the base class for the import, and the method that instantiates, configure and start the import, should be called.
Download
To download nuxeo-platform-importer
, check the Nuxeo Marketplace or, if needed, download a more recent version of the JAR (to be installed by hand) from the Nuxeo Maven repository.