This page provides several configuration use cases for Elasticsearch.
Setting up an Elasticsearch Cluster
Elasticsearch Supported Versions
The Nuxeo Platform communicates with Elasticsearch using the transport client JAVA API, as stated in the Elasticsearch documentation: "You are encouraged to use the same version on client and cluster sides. You may hit some incompatibility issues when mixing major versions".
The Nuxeo Platform 7.3 (and above) uses Elasticsearch 1.5.2 library and has been successfully tested against 1.1.2 to 1.7.x cluster.
We recommend to use the same JVM version for all Elasticsearch nodes and Nuxeo.
The default configuration uses an embedded Elasticsearch instance that runs in the same JVM as the Nuxeo Platform's.
This embedded mode is only for testing purpose and should not be used in production.
For production you need to setup an Elasticsearch cluster.
Installing the Elasticsearch Cluster
Refer to the Elasticsearch documentation to install and secure your cluster. Basically:
- Don’t run Elasticsearch open to the public.
- Don’t run Elasticsearch as root.
- Disable dynamic scripting (disabled by default since 1.2.X).
Use an explicit cluster name by setting the cluster.name
in the /etc/elasticsearch/elasticsearch.yml
file, this will avoid conflicts with other environments.
Recommended Tuning
If you have a large number of documents or if you use Nuxeo in cluster you may reach the default configuration limitation, here are some recommended tuning:
Consider disabling the OS swapping or using other Elasticsearch option to prevent the heap to be swapped.
In /etc/default/elasticsearch
file you can increase the JVM heap to half of the available OS memory:
# For a dedicated node with 12g of RAM
ES_HEAP_SIZE=6g
To prevent indexing errors like:
EsRejectedExceptionException[rejected execution (queue capacity 50)
Increase the bulk queue size In/etc/elasticsearch/elasticsearch.yml
configuration file:
threadpool.bulk.queue_size: 500
Configuring Nuxeo to Access the Cluster
Nuxeo manages 3 Elasticsearch indexes:
- The repository index used to index document content, this index can be rebuild from scratch by extracting content from the repository.
- The audit logs index to store audit entries, this index is a primary storage and can not be rebuild.
- A sequence index used to serve unique value that can be used as primary keys, this index is also a primary storage.
To make the connection between the Nuxeo Platform instance and the ES cluster check the following options in the nuxeo.conf
file and edit if you need to change the default value:
elasticsearch.addressList=somenode:9300,anothernode:9300
elasticsearch.clusterName=elasticsearch
elasticsearch.indexName=nuxeo
elasticsearch.indexNumberOfReplicas=0
audit.elasticsearch.indexName=${elasticsearch.indexName}-audit
seqgen.elasticsearch.indexName=${elasticsearch.indexName}-uidgen
Where:
elasticsearch.addressList
points to one or many Elasticsearch nodes. Note that Nuxeo connects to the API port 9300 and not the HTTP port 9200.elasticsearch.clusterName
is the cluster name to join,elasticsearch
being the default cluster name.elasticsearch.indexName
is the name of the Elasticsearch index for the default document repository.elasticsearch.indexNumberOfReplicas
is the number of replicas. By default you have 5 shards and 1 replicas. If you have a single node in your cluster you should set theindexNumberOfReplicas
to0
. Visit the Elasticsearch documentation for more information on shards and replicas.audit.elasticsearch.indexName
is the name of the Elasticsearch index for audit logs.seqgen.elasticsearch.indexName
is the name of the Elasticsearch index for the uid sequencer, extensively used for audit logs.
You can find all the available options in the nuxeo.defaults.
Disabling Elasticsearch
Elasticsearch is enabled by default, if you want to disable Elasticsearch indexing and search you can simply add the following option to the nuxeo.conf
:
elasticsearch.enabled=false
Disabling Elasticsearch for Audit Logs
When Elasticsearch is enabled and the audit.elasticsearch.enabled
property is set to true
in nuxeo.conf
which is the case by default, Elasticsearch is used as a backend for audit logs.
This improves scalability, especially when using Nuxeo Drive with a large set of users.
When Elasticsearch is used as a backend for audit logs it becomes the reference (no more SQL backend as it was the case in Nuxeo versions lower than 7.3).
For this purpose make sure you read the Backing Up and Restoring the Audit Elasticsearch Index section.
If you want to disable Elasticsearch and use the SQL database as the default backend for audit logs you can simply update this property in nuxeo.conf
:
audit.elasticsearch.enabled=false
Triggering SQL to Elasticsearch Audit Logs Migration
When upgrading a Nuxeo instance from a version lower than 7.3 to 7.3 or higher, if you decide to use Elasticsearch as a backend for audit logs you need to add the following property to nuxeo.conf
to trigger the migration of existing audit log entries.
audit.elasticsearch.migration=true
This will launch a background job at server startup to migrate data from the nxp_logs
, nxp_logs_extinfo
and nxp_logs_mapextinfo
tables of the SQL database to the ${<span style="color: rgb(0,0,0);">audit.elasticsearch.indexName</span>}
Elasticsearch index.
Migration uses batch processing. The number of log entries processed per batch can be configured by adding the folllowing property to nuxeo.conf
:
audit.elasticsearch.migration.batchSize=5000
Default value is 1000. As an example, we successfully tested migration of 22.000.000 log entries with an average speed of 1500 entries / second using audit.elasticsearch.migration.batchSize=10000
on a Linux virtual machine with two cores, 4 GB RAM, a local PostgreSQL instance and an embedded Elasticsearch instance.
Once the migration is done you should remove the audit.elasticsearch.migration
property from nuxeo.conf
, else you wil see a warning about it in the logs.
Rebuilding the Repository Index
If you need to reindex the whole repository, you can do this from the Admin > Elasticsearch > Admin tab.
You can fine tune the indexing process using the following options:
Sizing the indexing worker thread pool. The default size is 4, using more threads will crawl the repository faster:
elasticsearch.indexing.maxThreads=4
Tuning the number of documents per worker and the number of document submitted using the Elasticsearch bulk API:
# Reindexing option, number of documents to process per worker elasticsearch.reindex.bucketReadSize=500 # Reindexing option, number of documents to submit to Elasticsearch per bulk command elasticsearch.reindex.bucketWriteSize=50
Changing the Mappings and Settings of Indexes
Updating the Repository Index Configuration
Nuxeo comes with a default mapping that sets the locale for full-text and declares some fields as being date or numeric.
For fields that are not explicitly defined in the mapping, Elasticsearch will try to guess the type the first time it indexes the field. If the field is empty it will be treated as a String field. This is why most of the time you need to explicitly set the mapping for your custom fields that are of type date, numeric or full-text. Also fields that are used to sort and that could be empty need to be defined to prevent an unmapped field error.
The default mapping is located in the ${NUXEO_HOME}/templates/common-base/nxserver/config/elasticsearch-config.xml.nxftl
.
To override and tune the default mapping:
Create a custom template like
myapp
with anuxeo.defaults
file that contains:myapp.target=.
In this custom template create a
nxserver/config/elasticsearch-myapp-config.xml.nxftl
file and override the mapping contribution.<component name="org.nuxeo.elasticsearch.myapp"> <require>org.nuxeo.elasticsearch.defaultConfig</require> <extension target="org.nuxeo.elasticsearch.ElasticSearchComponent" point="elasticSearchIndex"> <elasticSearchIndex name="nuxeo" type="doc" repository="default"> <mapping> ... Here copy and adapt the default mapping </mapping> </elasticSearchIndex> </extension> </component>
Update the
nuxeo.conf
to use your custom template.nuxeo.templates=default,myapp
Restart and re-index the entire repository from the Admin tab (see previous section)
For mapping customization examples, see the page Configuring the Elasticsearch Mapping.
Updating the Audit Logs Index Configuration
Here the index is a primary storage and you can not rebuild it. So we need a tool that will extract the _source
of documents from one index and submit it to a new index that have been setup with the new configuration.
- Update the mappings or settings configuration by overriding the
{NUXEO_HOME}/templates/common-base/nxserver/config/elasticsearch-audit-index-config.xml
(follow the same procedure as the section above for the repository index) - Use a new name for the
audit.elasticsearch.indexName
(likenuxeo-audit2
) - Start the Nuxeo Platform. The new index is created with the new mapping.
- Stop the Nuxeo Platform
Copy the audit logs entries in the new index using stream2es. Here we copy
nuxeo-audit
tonuxeo-audit2
.curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es ./stream2es es --source http://localhost:9200/nuxeo-audit --target http://localhost:9200/nuxeo-audit2 --replace
Configuration for Multi Repositories
You need to define an index for each repository. This is done by adding an elasticSearchIndex
contribution.
- Create a custom template as described in the above section "Changing the mapping of the index".
Add a second
elasticSearchIndex
contribution:<elasticSearchIndex name="nuxeo-repo2" type="doc" repository="repo2"> ....
Where
name
is the Elasticsearch index name andrepository
the repository name.
Investigating and Reporting Problems
Activate Traces
To understand why a document is not present in search results or not indexed, you can activate a debug trace.
Open at the lib/log4j.xml
file and uncomment the ELASTIC section:
<appender name="ELASTIC" class="org.apache.log4j.FileAppender">
<errorHandler class="org.apache.log4j.helpers.OnlyOnceErrorHandler" />
<param name="File" value="${nuxeo.log.dir}/elastic.log" />
<param name="Append" value="false" />
<layout class="org.apache.log4j.PatternLayout">
<param name="ConversionPattern" value="%d{ISO8601} %-5p [%t][%c] %m%X%n" />
</layout>
</appender>
<category name="org.nuxeo.elasticsearch" additivity="false">
<priority value="TRACE" />
<appender-ref ref="ELASTIC" />
</category>
The elastic.log
file will contain all the requests done by the Nuxeo Platform to Elasticsearch including the curl
command ready to be copy/past/debug in a term.
Reporting Settings Mapping and Stats
It is also important to report the current settings and mapping of an Elasticsearch index (here called nuxeo
)
curl localhost:9200/nuxeo/_settings?pretty > /tmp/nuxeo-settings.json
curl localhost:9200/nuxeo/_mapping?pretty > /tmp/nuxeo-mapping.json
# misc info and stats on Elasticsearch
curl localhost:9200 > /tmp/es-info.txt
curl localhost:9200/_cluster/stats?pretty >> /tmp/es-info.txt
curl localhost:9200/_nodes/stats?pretty >> /tmp/es-info.txt
curl localhost:9200/_cat/health?v >> /tmp/es-info.txt
curl localhost:9200/_cat/nodes?v >> /tmp/es-info.txt
curl localhost:9200/_cat/indices?v >> /tmp/es-info.txt
Testing an Analyzer
To test the full-text analyzer:
curl -XGET 'localhost:9200/nuxeo/_analyze?analyzer=fulltext&pretty' -d 'This is a text for testing, file_name/1-foos-BAR.jpg'
To test an analyzer derived from the mapping:
curl -XGET 'localhost:9200/nuxeo/_analyze?field=ecm:path.children&pretty' -d 'workspaces/main folder/folder'
Viewing Indexed Terms for Document Field
This can be done using a customized Luke tool and looking at the Lucene index level, or you can use the aggregates and retrieve the first 1000 tokens:
# view indexed tokens for dc:title.fulltext of document 3d50118c-7472-4e99-9cc9-321deb4fe053
curl -XGET 'localhost:9200/nuxeo/doc/_search?search_type=count&pretty' -d'{
"query" : {"ids" : { "values" : ["3d50118c-7472-4e99-9cc9-321deb4fe053"] }},
"aggs": {"my_aggs": {"terms": {"field": "dc:title.fulltext", "order" : { "_count" : "desc" }, "size": 1000}}}}}}'
Comparing the Elasticsearch Index with the Database Content
You can use the esync tool to compare both content and pinpoint discrepancies.
This tool is a read-only standalone tool, it requires both access to the database and Elasticsearch (using transport client on port 9300).