Server

Elasticsearch Hints Cheat Sheet

Updated: March 18, 2024

This page lists interesting use cases of Elasticsearch Hints.

Hyland University
Watch the related course on Hyland University.

More Like This

The More Like hint allows leveraging the More Like This query of Eleasticsearch to find documents that are "like" a given set of documents.

Example:

SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.fulltext,dc:description.fulltext) OPERATOR(more_like_this) */ ecm:uuid = '1234'

It will take the most frequent terms of the title and description of document 1234 and finds documents that also match those terms.

Fuzzy Search on Full Text Index

Nuxeo Studio Configuration

  • In your Page Provider in Nuxeo Studio, drop any string field as a predicate of your Page Provider
  • Use the following values for the ES hints configuration:
    • Index: all_field
    • Analyzer: fulltext
    • Operator: fuzzy

Once these values are filled, any value chosen for the main "Operator" item (=, !=, etc.) is ignored.

Test case

  • Create a new document that contains a text file which itself contains the string "Nuxeo rocks"
  • Search for "Nuxo", the document created previously appears in the results

Using the Common Operator on the Main Attachment Content

Suppose you want to be able to search using the common operator on your documents' main attachment content. This Elasticsearch operator is interesting for two reasons:

  • The common operator can be seen as an alternative to the full-text search.
    One notable difference is that it allows to search on terms that would have been removed by the full-text analyzer. If I absolutely want to search for the “Not Beyond Space Travel Agencies”, I’d like to be able to search for the “Not” keyword.

  • The common operator is smart. It divides query terms between those which are rare into the index, and those which are commonly found into it.
    Rare terms will get a boost, common terms will be lowered. Let's say you have lots of contracts in your repository, and you search for "confidentiality clause". If both query terms were considered of same importance, most relevant results might be drowned. The common operator will understand that the term "confidentiality" is rare and boost it, while lowering the importance of the "clause" term, that is common. This will help you getting the most relevant results first.

To implement this use case:

  • In the analyzer configuration (present in the elasticsearch settings file), add an analyzer that will be used to index the main attachment's content:
{
  ...
  "analysis": {
    ...
    "analyzer": {
      ...
      "my_attachment_analyzer" : {
        "filter" : [
          "word_delimiter_filter",
          "lowercase",
          "asciifolding"
        ],
        "type" : "custom",
        "tokenizer" : "standard"
      }
    }
  }
}
  • In the properties configuration (present in the elasticsearch mapping file), update the ecm:binarytext field mapping configuration to the following:
{
  ...
  "properties": {
    ...
    "ecm:binarytext" : {
      "type" : "text",
      "analyzer": "fulltext",
      "copy_to": "all_field",
      "fields": {
        "common" : {
          "type": "text",
          "analyzer" : "my_attachment_analyzer"
        }
      }
    }
  }
}

You can now configure hints in Nuxeo Studio using the common operator when querying on the ecm:binarytext.common index.

Nuxeo Studio Configuration

  • In your Page Provider in Nuxeo Studio, drop any string field as a predicate of your Page Provider
  • Use the following values for the ES hints configuration:
    • Index: ecm:binarytext.common
    • Analyzer: my_attachment_analyzer
    • Operator: common

Once these values are filled, any value chosen for the main "Operator" item (=, !=, etc.) is ignored.

Test case

  • Create a new document that contains an attachment which itself contains the string "Not Beyond Space Travel Agency"
  • Search for "Not", the document created previously appears in the results

Please note this is a basic test case. The common operator is best used on very large indexes.