Nuxeo Server

Configuring the Elasticsearch Mapping

Updated: September 22, 2017 Page Information Edit on GitHub

This documentation page talks about the many aspects you can tune for improving the search experience for your users when it comes to search the document repository index.

Nuxeo comes with a default mapping that can work with custom fields of your schemas, but in a limited way. To leverage the search capabilities of Elasticsearch you need to define your own mapping, for instance in the following cases:

  • use of a non English or a custom analyzer
  • use a specific NXQL operators on a custom field: ILIKE, ecm:fulltext.custom, STARTSWITH
  • exclude field from the full-text search
  • sort on a custom field that may not exist

To do this you need to create your own custom template that redefines the Elasticsearch mapping. This way the mapping reference stay on the Nuxeo configuration side and you should not update the mapping directly on the Elasticsearch side.

Nuxeo updates the mapping on Elasticsearch only when:

  • the Elasticsearch index does not exist
  • a full repository reindexing is performed

Customizing the Language

The Nuxeo code uses a full-text analyzer named fulltext. This is an alias that points to the en_fulltext analyzer by default.

To change it to the French analyzer for instance, move the following line into the fr_fulltext:

 "alias" : "fulltext"

If you dump the mapping from the Elasticsearch HTTP API (or using an Elasticsearch plug-in like head or kopf), you will see that alias are replaced by the target.

To do case insensitive search using an ILIKE operation you need to declare your field as a multi_field with a lowercase index like this:

"my:field" : {
  "type" : "multi_field",
  "fields" : {
    "my:field" : {
      "include_in_all" : "true",
      "type" : "string"
    },
    "lowercase" : {
      "type": "string",
      "analyzer" : "lowercase_analyzer"
    }
  }
}

Making STARTSWITH Work with a Custom Field

To use a STARTSWITH operator on a field with a path pattern like a hierarchical vocabulary. Turn your field into a multi_field and a children sub field:

"my:field" : {
  "type" : "multi_field",
  "fields" : {
    "my:field" : {
      "index" : "not_analyzed",
      "type" : "string"
    },
    "children" : {
      "search_analyzer" : "keyword",
      "index_analyzer" : "path_analyzer",
      "type" : "string"
    }
  }
}

Adding a New Full-Text Field

To use the full-text search syntax on a custom field you need to create a multi_field with a fulltext index like this:

"my:text" : {
  "type" : "multi_field",
  "fields" : {
    "my:text" : {
      "include_in_all" : "true",
      "type" : "string"
    },
    "fulltext" : {
      "type": "string",
      "analyzer" : "fulltext"
    }
  }
}

Note that if you:

  • don't perform non fulltext search on this field
  • don't use this field with a IS NULL or IS NOT NULLoperation
  • don't sort on this field

Then you can disable the default index on the field by adding after the second

"my:text" : {
    "index" : "no",
    ...
},

Suppose you want to exclude my:secret field from the ecm:fulltext search:

 "my:secret" : {
    "type" : "string",
    "include_in_all" : false
 }

When you need to search with left truncature (or left and right truncatures) the NXQL syntax to use is LIKE '%foo%'. This kind of query use an Elasticsearch wildcard search but the cost of the left truncature is high because the term index can not be used efficiently. Using an NGram index is a good alternative for such a case.

First you need to define an nGram analyzer in your settings:

   "analysis" : {
...
      "tokenizer" : {
...
         "ngram_tokenizer": {
           "type": "nGram",
           "min_gram": 3,
           "max_gram": 12
          },
...
      "analyzer" : {
...
        "ngram_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase"
          ],
          "tokenizer": "ngram_tokenizer"
        },
...

Then use it in the mapping:

   "properties" : {
...
      "dc:title" : {
         "type" : "multi_field",
         "fields" : {
           "dc:title" : {
             "index" : "not_analyzed",
             "type" : "string"
           },
           "fulltext" : {
             "boost": 2,
             "type": "string",
             "analyzer" : "fulltext"
           },
           "ngram": {
             "type": "string",
             "analyzer": "ngram_analyzer"
           }
         }
      },

Now you can do an efficient version of:

SELECT * FROM Document WHERE dc:title ILIKE '%Foo%'

Using:

SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) ANALYZER(lowercase_analyzer) OPERATOR(match) */ dc:title = 'Foo'"));

Index the Main Attachment Content for Use with the Common Operator

 
Extract from the course What's New in Nuxeo Platform LTS 2015? in Nuxeo University

Suppose you want to be able to search using the common operator on your documents' main attachment content. This Elasticsearch operator is interesting for two reasons:

  • The common operator can be seen as an alternative to the full-text search. One notable difference is that it allows to search on terms that would have been removed by the full-text analyzer. If I absolutely want to search for the “Not Beyond Space Travel Agencies”, I’d like to be able to search for the “Not” keyword.
  • The common operator is smart. It divides query terms between those which are rare into the index, and those which are commonly found into it. Rare terms will get a boost, common terms will be lowered. Let's say you have lots of contracts in your repository, and you search for "confidentiality clause". If both query terms were considered of same importance, most relevant results might be drowned. The common operator will understand that the term "confidentiality" is rare and boost it, while lowering the importance of the "clause" term, that is common. This will help you getting the most relevant results first.

To implement this use case:

  • In the analyzer configuration, add an analyzer that will be used to index the main attachment's content:
"my_attachment_analyzer" : {
  "type" : "custom",
    "filter" : [
      "word_delimiter_filter",
      "lowercase",
      "asciifolding"
    ],
  "tokenizer" : "standard"
}
  • In the properties configuration, update the ecm:binarytext field mapping configuration to the following:
"ecm:binarytext" : {
  "type" : "multi_field",
  "fields" : {
    "ecm:binarytext" : {
      "type" : "string",
      "index" : "no",
      "include_in_all" : true
    },
    "common" : {
      "type": "string",
      "analyzer" : "my_attachment_analyzer",
      "include_in_all" : false
    }
  }
}

You can now configure hints in Nuxeo Studio using the common operator when querying on the ecm:binarytext.common index.


3 days ago manonlumeau NXDOC-1323: Update BDE doc
a year ago Manon Lumeau 31
a year ago Benoit Delbosc 30 | Add a note about ngram search
2 years ago Solen Guitter 29
2 years ago Bertrand Chauvin 28 | Fix typo
2 years ago Bertrand Chauvin 27 | Added video
2 years ago Bertrand Chauvin 26 | Update explanations
2 years ago Manon Lumeau 25
2 years ago Bertrand Chauvin 24 | fix anchor
2 years ago Bertrand Chauvin 23
2 years ago Bertrand Chauvin 22
2 years ago Bertrand Chauvin 21 | Added common operator mapping conf
2 years ago Benoit Delbosc 20
2 years ago Bertrand Chauvin 19 | Typo and anchor
2 years ago Benoit Delbosc 18 | Add new mapping for STARTSWITH needed since 7.10
2 years ago Bertrand Chauvin 17 | Removed reference to 6.0
3 years ago Benoit Delbosc 16 | don't disable default index for fulltext field unless you know how the field is used
3 years ago Solen Guitter 15 | fix brocken link
3 years ago Solen Guitter 14
3 years ago Benoit Delbosc 13
3 years ago Benoit Delbosc 12
3 years ago Benoit Delbosc 11
3 years ago Michaël Vachette 10
3 years ago Michaël Vachette 9
3 years ago Michaël Vachette 8
3 years ago Michaël Vachette 7
3 years ago Solen Guitter 6 | Formatting
3 years ago Manon Lumeau 5
3 years ago Alain Escaffre 4
3 years ago Solen Guitter 3
3 years ago Alain Escaffre 2
3 years ago Alain Escaffre 1
History: Created by Alain Escaffre