Page Provider Aggregates

When using the Elasticsearch Page Provider, you can define aggregates that will be returned along with the query result.

You can define a page provider that will query documents from Elasticsearch. The Nuxeo Platform takes advantage of the Elasticsearch aggregate module and you can define your own aggregates within a page provider definition. Please refer to Elasticsearch documentation about aggregates for more information.

As for now, the Nuxeo Platform focuses on bucket aggregation. In addition to accessing the documents returned by a page provider, you will be able to get and expose each bucket of the aggregates you have defined in this page provider. Quoting Elasticsearch documentation:

Each bucket is associated with a criterion (depending on the aggregation type) which determines whether or not a document in the current context "falls" into it. In other words, the buckets effectively define document sets. In addition to the buckets themselves, the 'bucket' aggregations also compute and return the number of documents that "fell in" to each bucket.

Nuxeo Platform default search leverages Elasticsearch aggregates on some default document properties.

The picture above shows the default search of Nuxeo Platform. On the left-hand side panel, you can see the search layout with search criteria for the current search (such as full text) and also some aggregates results (Creation date, Modification date, Author, Nature, Subjects, Coverage, Size). For example, according to the screen shot, for the current document result set, there are 58 documents whose the size is less than 100KB.

The aggregate navigation allows multiple selections and is adaptive. For instance if you select a Size aggregate "less than 100KB", this filter will be applied to the search result and to other aggregates. But the filter is not applied to the Size aggregate itself, so you are still able to see other Size repartitions and you can extend the selection by checking another values.

Even with multiple aggregates, all of this is done with a single Elasticsearch query including the search results. The technical principle is similar to the one described in the blog post "Build Zappos like faceted navigation with ElasticSearch".

Note that aggregate results are either displayed with a checkbox-based widget or a select2-based widget. For further documentation on widgets displaying aggregate results, please refer to the page Aggregate Widget Types.

An aggregate is defined inside the page provider definition according to the following syntax:

<aggregate id="aggregate_id" type="aggregate_type" parameter="aggregate_parameter">
  <field schema="agg_field_schema" name="agg_field_name" />
  <properties>
    ...
  </properties>
</aggregate>

where:

aggregate_id is the id of the aggregate. It is used when defining a widget for a given aggregate, see Aggregate Widget Types.
aggregate_type is the type of the aggregates. See below for possible value.
aggregate_parameter is the field on which the aggregate will be calculated
agg_field_schema and agg_field_name point to the search document model field which will handle the current selection of the aggregate.

Here are the type of aggregates currently supported by the Nuxeo Platform.

Terms Aggregate

In the search layout on the above picture, you can see Author, Nature, Subjects, Coverage aggregate results. These are Terms aggregates respectively on dc:creator, dc:nature, dc:subjects, dc:coverage document properties defined as follow:

<aggregate id="dc_nature_agg" type="terms" parameter="dc:nature">
  <field schema="default_search" name="dc_nature_agg" />
  <properties>
    <property name="size">10</property>
  </properties>
</aggregate>
<aggregate id="dc_subjects_agg" type="terms" parameter="dc:subjects">
  <field schema="default_search" name="dc_subjects_agg" />
  <properties>
    <property name="size">10</property>
  </properties>
</aggregate>
<aggregate id="dc_coverage_agg" type="terms" parameter="dc:coverage">
  <field schema="default_search" name="dc_coverage_agg" />
  <properties>
    <property name="size">10</property>
  </properties>
</aggregate>
<aggregate id="dc_creator_agg" type="terms" parameter="dc:creator">
  <field schema="default_search" name="dc_creator_agg" />
  <properties>
    <property name="size">10</property>
  </properties>
</aggregate>

The type of such aggregate is terms. The parameter must be of type string.

It has the following properties:

size property is set to define how many term buckets should be returned out of the overall terms.
minDocCount property is set to only return buckets having more document than the defined value (default is 1)
order property to order the buckets. Possible values are count desc, count asc, term desc, term asc.

Other properties can be:

exclude: used to filter out values. Use the following syntax to exclude several values: (value1)|(value2).
include: used to filter values and show only defined values. Use the following syntax to define the values to show: (value1)|(value2).

Significant Terms Aggregate

This aggregate does the same as Terms but returns significant terms buckets. Properties are also the same. Please refer to significant terms Elasticsearch documentation.

The type of such aggregate is significant_terms. The parameter must be of type string.

Range Aggregate

Here is an example of Range aggregate on the file:content/length document property.

<aggregate id="common_size_agg" type="range" parameter="file:content/length">
  <field schema="default_search" name="common_size_agg" />
  <ranges>
    <range key="tiny" to="102400"/>
    <range key="small" from="102400" to="1048576"/>
    <range key="medium" from="1048576" to="10485760"/>
    <range key="big" from="10485760" to="104857600" />
    <range key="huge" from="104857600" />
  </ranges>
</aggregate>

The type of such aggregate is range. The parameter must be a numeric of type integer, double or long.

It has no specific properties but at least one range must be defined. A range must have a key and at least a from or a to. The from values are included and the to values are excluded for each range defined. Note that there always will be a returned bucket for each defined range even if the document count is 0.

Date Range Aggregate

Here is an example of Date Range aggregate on the dc:modified document property.

<aggregate id="dc_modified_agg" type="date_range" parameter="dc:modified">
  <field schema="default_search" name="dc_modified_agg" />
  <properties>
    <property name="format">"dd-MM-yyyy"</property>
  </properties>
  <dateRanges>
    <dateRange key="last24h" fromDate="now-24H" toDate="now"/>
    <dateRange key="lastWeek" fromDate="now-7d" toDate="now-24H"/>
    <dateRange key="lastMonth" fromDate="now-1M" toDate="now-7d"/>
    <dateRange key="lastYear" fromDate="now-1y" toDate="now-1M"/>
    <dateRange key="priorToLastYear" toDate="now-1y"/>
  </dateRanges>
</aggregate>

The type of such aggregate is date_range. The parameter must be a numeric of type date.

At least one dateRange must be defined. A range must have a key and at least a fromDate or a toDate. The fromDate values are included and the toDate values are excluded for each dateRange defined. Note that there always will be a returned bucket for each defined range even if the document count is 0.

The fromDate or a toDate accept value which can be:

relative. For instance now, now-24H, now-1y, etc.
absolute: For instance 2014-11-06, 14/02/2014 04:00:45, etc. But you must define the format property accordingly e.g. yyyy-MM-dd, dd/MM/yyyy HH:mm:ss, etc.

Histogram Aggregate

The type of such aggregate is histogram. The parameter must be a numeric of type integer, double or long.

It has the following properties:

interval: Defines the interval covered by each returned bucket. It is required.
minDocCount: Only returns buckets having more document than the defined value (default is 1).
order: Orders the buckets. Possible values are count desc, count asc, key desc, key asc. Note that the key of the histogram bucket is the to value.
extendedBoundsMin : Forces the histogram aggregate to return buckets (even if empty) starting from this value.
extendedBoundsMax: Forces the histogram aggregate to return buckets (even if empty) up to this value.

The use of extendedBoundsMin and extendedBoundsMax is strongly recommended. It prevents elasticsearch from scanning the range of all possible values which may affect performance.

Date Histogram Aggregate

Here is an example of Date Range aggregate on the dc:created document property.

<aggregate id="dc_created_agg" type="date_histogram" parameter="dc:created">
  <field schema="default_search" name="dc_created_agg" />
  <properties>
    <property name="interval">year</property>
    <property name="format">yyyy</property>
    <property name="order">key desc</property>
  </properties>
</aggregate>

The type of such aggregate is date_histogram. The parameter must be of type date.

It has the same properties as the histogram aggregate except:

interval: Accepts values such as year, week, day, hour, minute or second. It also accepts: 1M, 1.5h, etc.
extendedBoundsMin and extendedBoundsMax accept value formatted according to the format property.

The use of extendedBoundsMin and extendedBoundsMax is strongly recommended. It prevents elasticsearch from scanning the range of all possible values which may affect performance.