Elasticsearch Setup

This page provides several configuration use cases for Elasticsearch.

Elasticsearch Supported Versions

The Nuxeo Platform communicates with Elasticsearch using the transport client JAVA API, as stated in the Elasticsearch documentation: "You are encouraged to use the same version on client and cluster sides. You may hit some incompatibility issues when mixing major versions". `

Nuxeo Platform 6.0 uses Elasticsearch 1.1.2 library and has been successfully tested against 1.7.x cluster.

We recommend to use the same JVM version for Elasticsearch and Nuxeo.

Setting up an Elasticsearch Cluster

The default configuration uses an embedded Elasticsearch instance that runs in the same JVM as the Nuxeo Platform's.

This embedded mode is only for testing purpose and should not be used in production.

For production you need setup an Elasticsearch cluster.

Installing the Elasticsearch Cluster

Refer to the Elasticsearch documentation to install and secure your cluster. Basically:

Don’t run Elasticsearch open to the public.
Don’t run Elasticsearch as root.
Disable dynamic scripting (disabled by default since 1.2.X).

Use an explicit cluster name by setting the cluster.name in the /etc/elasticsearch/elasticsearch.yml file. This will avoid conflicts with other environments.

Tuning Elasticsearch

If you have a large number of documents or if you use Nuxeo in cluster you may reach the default configuration limitation, here are some recommended tuning:

In /etc/default/elasticsearch file you can increase the JVM heap:

ES_HEAP_SIZE=6g

To prevent indexing errors like:

EsRejectedExceptionException[rejected execution (queue capacity 50)

Increase the bulk queue size In/etc/elasticsearch/elasticsearch.yml configuration file:

threadpool.bulk.queue_size: 1000

Configuring Nuxeo to Access the Cluster

To make the connexion between the Nuxeo Platform instance and the ES cluster edit the nuxeo.conf an set the following options:

elasticsearch.addressList=somenode:9300,anothernode:9300
elasticsearch.clusterName=elasticsearch
elasticsearch.indexName=nuxeo
elasticsearch.indexNumberOfReplicas=0

Where:

addressList points to one or many Elasticsearch nodes. Note that we connect to the API port 9300 and not the HTTP port 9200.
clusterName is the cluster name to join, elasticsearch being the default cluster name.
indexName is the name of the Elasticsearch index.
indexNumberOfReplicas is the number of replicas. By default you have 5 shards and 1 replicas. If you have a single node in your cluster you should set the indexNumberOfReplicasto 0, visit the Elasticsearch documentation for more information on shards and replicas.

You can find all the available options in the nuxeo.defaults.

Disabling Elasticsearch

Elasticsearch is enabled by default, if you want to disable Elasticsearch indexing and search you can simply add the following option to the nuxeo.conf :

elasticsearch.enabled=false

Rebuilding the Index

If you need to re-index the whole repository, you can do this from the Admin > Elasticsearch > Administration tab.

Changing the Mapping of the Index

Nuxeo comes with a default mapping that sets the locale for full-text and declares some fields as being date or numeric.

For fields that are not explicitly defined in the mapping, Elasticsearch will try to guess the type the first time it indexes the field. If the field is empty it will be treated as a String field. This is why most of the time you need to explicitly set the mapping for your custom fields that are of type date, numeric or full-text. Also fields that are used to sort and that could be empty need to be defined to prevent an unmapped field error.

The default mapping is located in the ${NUXEO_HOME}/templates/common-base/nxserver/config/elasticsearch-config.xml.nxftl.

To override and tune the default mapping:

Create a custom template like myapp with a nuxeo.defaults file that contains:
```
myapp.target=.
```

In this custom template create a nxserver/config/elasticsearch-myapp-config.xml.nxftlfile and override the mapping contribution.

<component name="org.nuxeo.elasticsearch.myapp">
  <require>org.nuxeo.elasticsearch.defaultConfig</require>
  <extension target="org.nuxeo.elasticsearch.ElasticSearchComponent"
    point="elasticSearchIndex">
    <elasticSearchIndex name="nuxeo" type="doc" repository="default">
     <mapping>
... Here copy and adapt the default mapping
     </mapping>
    </elasticSearchIndex>
  </extension>
</component>

Update the nuxeo.conf to use your custom template
```
nuxeo.templates=default,myapp
```
Restart and re-index the entire repository from the Admin tab (see previous section).

For mapping customization examples, see the page Configuring the Elasticsearch Mapping.

Fast Rebuild of the Index to Update a Mapping

You may want to change the mapping but keep the existing indexed data. You can do it quickly by using a tool that will extract the _source of documents from one index and submit it to a new index that have another mapping. This is fast because the Nuxeo Platform doesn't have to read all documents and submit them to Elasticsearch.

After you changed the mapping (see previous section), in your nuxeo.conf, point to a new index by setting elasticsearch.indexName=nuxeo_v2.
Start the Nuxeo Platform. The new index is created with the new mapping.

Copy from the index content using stream2es. Here we copy nuxeo_v1 to nuxeo_v2.

curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es
./stream2es es --source http://localhost:9200/nuxeo_v1 --target http://localhost:9200/nuxeo_v2

Using an Index Alias

You can even change a mapping without restarting Nuxeo if you use an alias as the index name.

For instance the Nuxeo Platform will only know the nuxeo alias and once your mapping is ready on nuxeo_v2 you can atomically switch:

curl -XPOST 'localhost:9200/_aliases' -d '{ "actions" : [
    { "remove" : { "index" : "nuxeo_v1", "alias" : "nuxeo" } },
    { "add" : { "index" : "nuxeo_v2", "alias" : "nuxeo" } } ] }'

Configuration for Multi Repositories

You need to define an index for each repository. This is done by adding an elasticSearchIndex contribution.

Create a custom template as described in the above section "Changing the mapping of the index".
Add a second elasticSearchIndex contribution:
```
<elasticSearchIndex name="nuxeo-repo2" type="doc" repository="repo2"> ....
```
Where name is the Elasticsearch index name and repository the repository name.

Investigating and Reporting Problems

Activate Traces

To understand why a document is not present in search results or not indexed, you can activate a debug trace.

Open at the lib/log4j.xml file and uncomment the ELASTIC section:

      <appender name="ELASTIC" class="org.apache.log4j.FileAppender">        
        <errorHandler class="org.apache.log4j.helpers.OnlyOnceErrorHandler" />
        <param name="File" value="${nuxeo.log.dir}/elastic.log" />
        <param name="Append" value="false" />
        <layout class="org.apache.log4j.PatternLayout">
          <param name="ConversionPattern" value="%d{ISO8601} %-5p [%t][%c] %m%X%n" />
        </layout>
      </appender>
      <category name="org.nuxeo.elasticsearch" additivity="false">
        <priority value="TRACE" />
        <appender-ref ref="ELASTIC" />
      </category>

The elastic.log file will contain all the requests done by the Nuxeo Platform to Elasticsearch including the curl command ready to be copy/past/debug in a term.

Reporting Settings and Mapping

It is also important to report the current settings and mapping of an Elasticsearch index (here called nuxeo)

curl localhost:9200/nuxeo/_settings?pretty > /tmp/nuxeo-settings.json
curl localhost:9200/nuxeo/_mapping?pretty > /tmp/nuxeo-mapping.json

Testing an Analyzer

To test the full-text analyzer:

curl -XGET 'localhost:9200/nuxeo/_analyze?analyzer=fulltext&pretty=true' -d 'This is a text for testing, file_name/1-foos-BAR.jpg'

Viewing Indexed Terms for Document Field

This can be done using a customized Luke tool and looking at the Lucene index level, or you can use the aggregates and retrieve the first 1000 tokens:

# view indexed tokens for dc:title.fulltext of document 3d50118c-7472-4e99-9cc9-321deb4fe053
curl -XGET 'localhost:9200/nuxeo/doc/_search?search_type=count&pretty' -d'{
 "query" : {"ids" : { "values" : ["3d50118c-7472-4e99-9cc9-321deb4fe053"] }},
 "aggs": {"my_aggs": {"terms": {"field": "dc:title.fulltext", "order" : { "_count" : "desc" }, "size": 1000}}}}}}'

Comparing the Elasticsearch Index with the Database Content

You can use the esync tool to compare both content and pinpoint discrepancies.

This tool is a read-only standalone tool, it requires both access to the database and Elasticsearch (using transport client on port 9300).