This page provides several configuration use cases for Elasticsearch.
Setting up an Elasticsearch Cluster
Elasticsearch Supported Versions
The Nuxeo Platform communicates with Elasticsearch using the HTTP Rest protocol (port 9200 by default), which provides looser coupling with Elasticsearch.
Nuxeo Platform Version: | LTS 2021 | LTS 2019 | LTS 2017 | LTS 2016 |
---|---|---|---|---|
Elasticsearch |
Library: 7.9.2 Cluster: 7.x with x >= 7 RestClient protocol only (Elastic ensures forward compatibility on minor version, 7.7 has been validated). |
Library: 6.5.3 Cluster:
|
Library: 5.6.3 Cluster: 5.6.x |
From 8.1 to 8.3: Library: 1.5.2 Cluster: 1.5.2 to 1.7.x From 8.10: Library: 2.3.5 Cluster: 2.3.x to 2.4.x |
The default configuration uses an embedded Elasticsearch instance that runs in the same JVM as the Nuxeo Platform's.
For production you need to setup an Elasticsearch cluster.
Installing the Elasticsearch Cluster
Refer to the Elasticsearch documentation to install and secure your cluster. Basically:
- Don’t run Elasticsearch open to the public.
- Don’t run Elasticsearch as root.
cluster.name
in the /etc/elasticsearch/elasticsearch.yml
file, this will avoid conflicts with other environments.
Recommended Tuning
If you have a large number of documents or if you use Nuxeo in cluster you may reach the default configuration limitation, here are some recommended tuning:
Consider disabling the OS swapping or using other Elasticsearch option to prevent the heap to be swapped.
In /etc/default/elasticsearch
file you can increase the JVM heap to half of the available OS memory:
# For a dedicated node with 12g of RAM for Elasticsearch < 6
# ES_HEAP_SIZE=6g
# For a dedicated node with 12g of RAM for Elasticsearch >= 6
ES_JAVA_OPTS="-Xms6g -Xmx6g"
To prevent indexing errors like:
EsRejectedExceptionException[rejected execution (queue capacity 50)
Increase the bulk queue size In/etc/elasticsearch/elasticsearch.yml
configuration file:
# For Elasticsearch 2.x
# threadpool.bulk.queue_size: 500
# For Elasticsearch 5.6
# thread_pool.bulk.queue_size: 500
# For Elasticsearch 6.x
thread_pool.write.queue_size: 500
To reduce disk IO you should consider changing the default translog
durability from request
to async
.
Since Nuxeo 10.3 this can be done from nuxeo.conf
:
elasticsearch.index.translog.durability=async
If your indexes are already created you need some manual operation to change the translog:
curl -H "Content-Type: application/json" -XPUT "http://localhost:9200/nuxeo-uidgen/_settings" -d '{
"index.translog.durability" : "async"
}'
curl -H "Content-Type: application/json" -XPUT "http://localhost:9200/nuxeo-audit/_settings" -d '{
"index.translog.durability" : "async"
}'
curl -H "Content-Type: application/json" -XPUT "http://localhost:9200/nuxeo/_settings" -d '{
"index.translog.durability" : "async"
}'
Configuring Nuxeo to Access the Elasticsearch Cluster
Nuxeo supports two protocols to access the Elasticsearch cluster: the transport client protocol and the Rest client.
The REST Client (default)
This protocol is supported since Nuxeo 9.3:
elasticsearch.client=RestClient
elasticsearch.addressList=http://somenode:9200,https://anothernode:443
Where:
elasticsearch.client
choose the RestClient protocolelasticsearch.addressList
is a comma separated list of URL.
The Transport Client protocol
Here are the nuxeo.conf
options available for the Transport Client protocol:
elasticsearch.client=TransportClient
elasticsearch.addressList=somenode:9300,anothernode:9300
elasticsearch.clusterName=elasticsearch
Where:
elasticsearch.client
choose the TransportClient protocol.elasticsearch.addressList
points to one or many Elasticsearch nodes, this is a comma separated list ofhost:port
. Note that the default port for this protocol is 9300 (and not 9200).elasticsearch.clusterName
is the cluster name to join,elasticsearch
being the default cluster name.
Advanced REST Client configuration
If you have installed Elasticsearch X-Pack you have the possibility to secure communication between Nuxeo and Elasticsearch using the Rest Client (supported since Nuxeo 9.10-HF01).
For Elasticsearch please follow this guide to Securing Elasticsearch and Kibana.
Basic Authentication
If you have chosen to configure Basic User Authentication then you can setup Nuxeo using nuxeo.conf
with the follow properties:
elasticsearch.restClient.username=your_username
elasticsearch.restClient.password=your_password
curl -XPOST -u elastic 'localhost:9200/_xpack/security/role/nuxeo_role' -H "Content-Type: application/json" -d '{
"cluster" : [
"all"
],
"indices" : [
{
"names" : [ "nuxeo*" ],
"privileges" : [ "all" ]
}
]
}'
Configuring a user for that role could look something like this:
curl -XPOST -u elastic 'localhost:9200/_xpack/security/user/nuxeo_user' -H "Content-Type: application/json" -d '{
"password" : "nuxeo_secret_password",
"full_name" : "Nuxeo User",
"roles" : [ "nuxeo_role" ]
}'
TLS/SSL Configuration
If you have chosen to configure TLS/SSL then you can set up Nuxeo using nuxeo.conf
with the following properties:
elasticsearch.restClient.truststore.path
elasticsearch.restClient.truststore.password
elasticsearch.restClient.truststore.type
elasticsearch.restClient.keystore.path
elasticsearch.restClient.keystore.password
elasticsearch.restClient.keystore.type
elasticsearch.addressList
will need to be updated to include https
.
See the Trust Store and Key Store Configuration page for more.
Index names
Nuxeo manages 3 Elasticsearch indexes:
- The repository index used to index document content, this index can be rebuild from scratch by extracting content from the repository.
- The audit logs index to store audit entries, this index is a primary storage and can not be rebuild.
- A sequence index used to serve unique value that can be used as primary keys, this index is also a primary storage.
To make the connection between the Nuxeo Platform instance and the ES cluster check the following options in the nuxeo.conf
file and edit if you need to change the default value:
elasticsearch.indexName=nuxeo
elasticsearch.indexNumberOfReplicas=0
audit.elasticsearch.indexName=${elasticsearch.indexName}-audit
seqgen.elasticsearch.indexName=${elasticsearch.indexName}-uidgen
Where
elasticsearch.indexName
is the name of the Elasticsearch index for the default document repository.elasticsearch.indexNumberOfReplicas
is the number of replicas. By default you have 5 shards and 1 replicas. If you have a single node in your cluster you should set theindexNumberOfReplicas
to0
. Visit the Elasticsearch documentation for more information on shards and replicas.audit.elasticsearch.indexName
is the name of the Elasticsearch index for audit logs.seqgen.elasticsearch.indexName
is the name of the Elasticsearch index for the uid sequencer, extensively used for audit logs.
You can find all the available options in the nuxeo.defaults.
Index Aliases and Reindexing without Service Interruption
Reindexing the repository can be a long operation depending on the size of the repository. This is an administrative procedure that is required in order to apply a new Elastic mapping or setting.
By default, when Nuxeo performs a reindex of the repository, it deletes and re-create the Elastic index then it submits all the documents for indexation. During this operation only the indexed documents are searchable, this would impact strongly the users experience and, it requires a service interruption.
To avoid this Nuxeo can manage 2 indexes at the same time, the current one with continue to serve queries and index new document modifications, while the new one is going to reindex the entire repository (including the new updates). On completion Nuxeo will switch to the new index.
Nuxeo leverages Elastic Aliases to do this, it manages 2 aliases: one for searching using the name of the contrib (default to nuxeo
), one for writing with a -write
suffix (default to nuxeo-write
),
both aliases will point to the same index (nuxeo-0000
at the beginning). The index name ends with a number and is automatically incremented when reindexing.
Here is how to proceed:
- Nuxeo must be configured to manage Elastic aliases, add the
elasticsearch.manageAlias.enabled=true
in yournuxeo.conf
Note that if you are switching an existing instance to use manage aliases it will require a service interruption. Stop Nuxeo and drop the existingnuxeo
index, then activate the manage aliases option and, start Nuxeo, proceed to a repository reindexing while service is interrupted. The next reindexing will not require service interruption. - Perform a full reindexing using the Bulk Service (the legacy reindexing will not work properly during reindexing).
- On completion, you have to delete the old unused index.
Disabling Elasticsearch
Elasticsearch is enabled by default, if you want to disable Elasticsearch indexing and search you can simply add the following option to the nuxeo.conf
:
elasticsearch.enabled=false
Disabling Elasticsearch for Audit Logs
When Elasticsearch is enabled and the audit.elasticsearch.enabled
property is set to true
in nuxeo.conf
which is the case by default, Elasticsearch is used as a backend for audit logs.
This improves scalability, especially when using Nuxeo Drive with a large set of users.
For this purpose make sure you read the Backing Up and Restoring the Audit Elasticsearch Index page.
If you want to disable Elasticsearch and use the SQL database as the default backend for audit logs you can simply update this property in nuxeo.conf
:
audit.elasticsearch.enabled=false
Rebuilding the Repository Index
If you need to reindex the whole repository, you have different possibilities:
Re-index the Repository Using the WorkManager (the legacy way)
There are 3 ways to run it:
From the Nuxeo Dev Tool Browser Extension.
From JSF UI (DEPRECATED) > Admin center > Elasticsearch > Admin
Using
curl
curl -X POST "<NUXEO_URL>/nuxeo/site/automation/Elasticsearch.Index" -u Administrator:<PASSWORD> -H 'content-type: application/json' -d '{"params":{},"context":{}}'
Look at the server.log
you should have 3 WARNs in the logs:
# start of re-indexing
WARN [http-nio-0.0.0.0-8080-exec-31] [org.nuxeo.elasticsearch.web.admin.ElasticSearchManager] Re-indexing the entire repository: default
...
# all the repository have been scrolled we know how much document are going to be re-indexed
WARN [Nuxeo-Work-elasticSearchIndexing-1:785116626625974.1486048658] [org.nuxeo.elasticsearch.work.ScrollingIndexingWorker] Re-indexing job: /elasticSearchIndexing:785116626625974.1486048658 has submited 270197 documents in 541 bucket workers
...
# end of the re-indexing
WARN [Nuxeo-Work-elasticSearchIndexing-1:785120666169686.1890981267] [org.nuxeo.elasticsearch.work.BucketIndexingWorker] Re-indexing job: /elasticSearchIndexing:785116626625974.1486048658 completed.
You can fine tune the WorkManager indexing process using the following options:
Sizing the indexing worker thread pool. The default size is
4
, using more threads will crawl the repository faster:elasticsearch.indexing.maxThreads=4
Tuning the number of documents per worker and the number of document submitted using the Elasticsearch bulk API:
# Reindexing option, number of documents to process per worker elasticsearch.reindex.bucketReadSize=500 # Reindexing option, number of documents to submit to Elasticsearch per bulk command elasticsearch.reindex.bucketWriteSize=50
Re-index Repository Using Bulk Service
Run a bulk command to re-index the repository, the command id is returned:
curl -s -X POST "<SERVER_URL>/nuxeo/site/automation/Elasticsearch.BulkIndex" -u Administrator:<PASSWORD> -H 'content-type: application/json' -d '{"params":{},"context":{}}'
{"commandId": "21aeaea1-0ef0-4a89-a92d-fa8f679361de"}
At any time, you can request the status of the re-indexing using the previous command id:
curl -s -X GET "<SERVER_URL>/nuxeo/api/v1/bulk/21aeaea1-0ef0-4a89-a92d-fa8f679361de" -u Administrator:<PASSWORD> -H 'content-type: application/json'
{
"entity-type": "bulkStatus",
"commandId": "21aeaea1-0ef0-4a89-a92d-fa8f679361de",
"state": "RUNNING",
"processed": 200,
"error": false,
"errorCount": 0,
"total": 42932,
"action": "index",
"username": "Administrator",
"submitted": "2020-11-16T15:26:50.346Z",
"scrollStart": "2020-11-16T15:26:50.432Z",
"scrollEnd": "2020-11-16T15:26:50.446Z",
"processingStart": null,
"processingEnd": null,
"completed": null,
"processingMillis": 0
}
Changing Mappings and Settings of Indexes
Updating Repository Index Configuration
Nuxeo comes with a default mapping that sets the locale for full-text and declares some fields as being date or numeric.
The default mapping is located in the ${NUXEO_HOME}/templates/common-base/nxserver/config/elasticsearch-config.xml.nxftl
.
To override and tune the default mapping:
Since Nuxeo 9.3, instead of overriding the extension point you can simply override the default mapping or settings JSON files:
Create a custom template like
myapp
with anuxeo.defaults
file that contains:myapp.target=.
In this custom template create a file named
nxserver/config/elasticsearch-doc-mapping.json
to override the mapping. You can create a file namednxserver/config/elasticsearch-doc-settings.json
to override the settings.
Important: You must add your custom mapping to the existing one, you cannot just set your custom mapping in the file, Nuxeo does not merge your mapping with the default one. So, you must duplicate the original file at${NUXEO_HOME}/templates/common-base/nxserver/config/elasticsearch-doc-mapping.json
tomyapp/nxserver/config/elasticsearch-doc-mapping.json
, and modify the copy.
Update the
nuxeo.conf
to use your custom template.nuxeo.templates=default,myapp
Restart and re-index the entire repository (see previous section). A re-indexing is needed to apply the new settings and mapping.
For mapping customization examples, see the page Configuring the Elasticsearch Mapping.
Updating the Audit Logs Index Configuration
Here the index is a primary storage and you cannot rebuild it. So we need a tool that will extract the _source
of documents from one index and submit it to a new index that have been setup with the new configuration.
- Update the mappings or settings configuration by overriding the
{NUXEO_HOME}/templates/common-base/nxserver/config/elasticsearch-audit-index-config.xml
(follow the same procedure as the section above for the repository index) - Use a new name for the
audit.elasticsearch.indexName
(likenuxeo-audit2
) - Start the Nuxeo Platform. The new index is created with the new mapping.
- Stop the Nuxeo Platform
Copy the audit logs entries in the new index using the
_reindex
endpoint. Here we copynuxeo-audit
tonuxeo-audit2
.curl -X POST http://localhost:9200/_reindex -H 'Content-Type: application/json' -d '{ "source": { "index": "nuxeo-audit" }, "dest": { "index": "nuxeo-audit2" } }'
Configuration for Multi Repositories
You need to define an index for each repository. This is done by adding an elasticSearchIndex
contribution.
- Create a custom template as described in the above section "Changing the mapping of the index".
Add a second
elasticSearchIndex
contribution:<elasticSearchIndex name="nuxeo-repo2" type="doc" repository="repo2"> ....
Where
name
is the Elasticsearch index name andrepository
the repository name.
Investigating and Reporting Problems
Activate Traces
To understand why a document is not present in search results or not indexed, you can activate a debug trace.
Open at the lib/log4j.xml
file and uncomment the ELASTIC
section:
<!-- Elasticsearch logging -->
<File name="ELASTIC" fileName="${sys:nuxeo.log.dir}/elastic.log" append="false">
<PatternLayout pattern="%d{ISO8601} %-5p [%t] [%c] %m%n" />
</File>
<Logger name="org.nuxeo.elasticsearch" level="trace" additivity="false">
<AppenderRef ref="ELASTIC" />
</Logger>
The elastic.log
file will contain all the requests done by the Nuxeo Platform to Elasticsearch including the curl
command ready to be copy/past/debug in a term.
Reporting Settings and Mapping
It is also important to report the current settings and mapping of an Elasticsearch index (here called nuxeo
)
curl localhost:9200/nuxeo/_settings?pretty > /tmp/nuxeo-settings.json
curl localhost:9200/nuxeo/_mapping?pretty > /tmp/nuxeo-mapping.json
# misc info and stats on Elasticsearch
curl localhost:9200 > /tmp/es-info.txt
curl localhost:9200/_cluster/stats?pretty >> /tmp/es-info.txt
curl localhost:9200/_nodes/stats?pretty >> /tmp/es-info.txt
curl localhost:9200/_cat/health?v >> /tmp/es-info.txt
curl localhost:9200/_cat/nodes?v >> /tmp/es-info.txt
curl localhost:9200/_cat/indices?v >> /tmp/es-info.txt
Testing an Analyzer
To test the full-text analyzer:
curl -s -X GET "localhost:9200/nuxeo/_analyze" -H 'Content-Type: application/json' -d' {
"analyzer" : "fulltext",
"text" : "This is a text for testing, file_name/1-foos-BAR.jpg"
}'
To test an analyzer derived from the mapping:
curl -s -X GET "localhost:9200/nuxeo/_analyze" -H 'Content-Type: application/json' -d' {
"field" : "ecm:path.children",
"text" : "workspaces/main folder/sub-folder"
}'
Viewing Indexed Terms for Document Field
This can be done using tool like Luke to analyze at the Lucene index level.
It is also possible to use aggregate on fields that are not text or text with fielddata
option:
# view indexed tokens for dc:title.fulltext of document 3d50118c-7472-4e99-9cc9-321deb4fe053
curl -XGET 'localhost:9200/nuxeo/doc/_search?pretty' -H 'Content-Type: application/json' -d'{
"query" : {"ids" : { "values" : ["3d50118c-7472-4e99-9cc9-321deb4fe053"] }},
"aggs": {"my_aggs": {"terms": {"field": "dc:title", "order" : { "_count" : "desc" }, "size": 1000}}}}'
You may need to change the size
parameter to get more or less indexed terms.
Explain and Profile Elasticsearch Queries
When trace level logs are actived, Elasticsearch curl command will be present in the elastic.log
log file. Getting more details on what is happening during the query execution, can either be done using explain or profile.
Those two approaches will help to understand the mapping and the field scoring, it can also gives inputs about unmapped fields for example.
Comparing the Elasticsearch Index with the Database Content
You can use the esync tool to compare both content and pinpoint discrepancies.
This tool is a read-only standalone tool, it requires both access to the database and Elasticsearch (using transport client on port 9300).