When using the Elasticsearch Page Provider, you can define aggregates that will be returned along with the query result.
You can define a page provider that will query documents from Elasticsearch. The Nuxeo Platform takes advantage of the Elasticsearch aggregate module and you can define your own aggregates within a page provider definition. Please refer to Elasticsearch documentation about aggregates for more information.
As for now, the Nuxeo Platform focuses on bucket aggregation. In addition to accessing the documents returned by a page provider, you will be able to get and expose each bucket of the aggregates you have defined in this page provider. Quoting Elasticsearch documentation:
Each bucket is associated with a criterion (depending on the aggregation type) which determines whether or not a document in the current context "falls" into it. In other words, the buckets effectively define document sets. In addition to the buckets themselves, the
bucket
aggregations also compute and return the number of documents that "fell in" to each bucket.
Nuxeo Platform default search leverages Elasticsearch aggregates on some default document properties.
The picture above shows the default search of Nuxeo Platform. On the left-hand side panel, you can see the search layout with search criteria for the current search (such as full text) and also some aggregates results (Creation date, Modification date, Author, Nature, Subjects, Coverage, Size). For example, according to the screen shot, for the current document result set, there are 58 documents whose the size is less than 100KB.
The aggregate navigation allows multiple selections and is adaptive. For instance if you select a Size aggregate "less than 100KB", this filter will be applied to the search result and to other aggregates. But the filter is not applied to the Size aggregate itself, so you are still able to see other Size repartitions and you can extend the selection by checking another values.
Even with multiple aggregates all of this is done with a single Elasticsearch query including the search results. The technical principle is similar to the one described in this blog post.
Note that aggregate results are either displayed with checkbox-based widget or select2-based widget. For further documentation on widget to display aggregate results, please refer to the page Aggregate Widget Types.
An aggregate is defined inside the page provider definition according to the following syntax:
<aggregate id="aggregate_id" type="aggregate_type" parameter="aggregate_parameter">
<field schema="agg_field_schema" name="agg_field_name" />
<properties>
...
</properties>
</aggregate>
where:
-
aggregate_id
is the id of the aggregate. It is used when defining a widget for a given aggregate, see Aggregate Widget Types. aggregate_type
is the type of the aggregates. See below for possible value.aggregate_parameter
is the field on which the aggregate will be calculatedagg_field_schema
andagg_field_name
point to the search document model field which will handle the current selection of the aggregate.
Here are the type of aggregates currently supported by the Nuxeo Platform.
Terms Aggregate
In the search layout on the above picture, you can see Author, Nature, Subjects, Coverage aggregate results. These are Terms aggregates respectively on dc:creator
, dc:nature
, dc:subjects
, dc:coverage
document properties defined as follow:
<aggregate id="dc_nature_agg" type="terms" parameter="dc:nature">
<field schema="default_search" name="dc_nature_agg" />
<properties>
<property name="size">10</property>
</properties>
</aggregate>
<aggregate id="dc_subjects_agg" type="terms" parameter="dc:subjects">
<field schema="default_search" name="dc_subjects_agg" />
<properties>
<property name="size">10</property>
</properties>
</aggregate>
<aggregate id="dc_coverage_agg" type="terms" parameter="dc:coverage">
<field schema="default_search" name="dc_coverage_agg" />
<properties>
<property name="size">10</property>
</properties>
</aggregate>
<aggregate id="dc_creator_agg" type="terms" parameter="dc:creator">
<field schema="default_search" name="dc_creator_agg" />
<properties>
<property name="size">10</property>
</properties>
</aggregate>
The type of such aggregate is terms
. The parameter must be of type string
.
It has the following properties:
size
property is set to define how many term buckets should be returned out of the overall terms Note: Setsize
to0
to get all the buckets (mandatory when the terms aggregate is rendered as a select widget).minDocCount
property is set to only return buckets having more document than the defined value (default is 1)order
property to order the buckets. Possible values arecount desc
,count asc
,term desc
,term asc
.
Significant Terms Aggregate
This aggregate does the same as Terms but returns significant terms buckets. Properties are also the same. Please refer to significant terms Elasticsearch documentation.
The type of such aggregate is significant_terms
. The parameter must be of type string
.
Range Aggregate
Here is an example of Range aggregate on the common:size
document property.
<aggregate id="common_size_agg" type="range" parameter="common:size">
<field schema="default_search" name="common_size_agg" />
<ranges>
<range key="tiny" to="102400"/>
<range key="small" from="102400" to="1048576"/>
<range key="medium" from="1048576" to="10485760"/>
<range key="big" from="10485760" to="104857600" />
<range key="huge" from="104857600" />
</ranges>
</aggregate>
The type of such aggregate is range
. The parameter must be a numeric of type integer
, double
or long
.
It has no specific properties but at least one range
must be defined. A range must have a key
and at least a from
or a to
. The from
values are included and the to
values are excluded for each range
defined. Note that there always will be a returned bucket for each defined range even if the document count is 0.
Date Range Aggregate
Here is an example of Date Range aggregate on the dc:modified
document property.
<aggregate id="dc_modified_agg" type="date_range" parameter="dc:modified">
<field schema="default_search" name="dc_modified_agg" />
<properties>
<property name="format">"dd-MM-yyyy"</property>
</properties>
<dateRanges>
<dateRange key="last24h" fromDate="now-24H" toDate="now"/>
<dateRange key="lastWeek" fromDate="now-7d" toDate="now-24H"/>
<dateRange key="lastMonth" fromDate="now-1M" toDate="now-7d"/>
<dateRange key="lastYear" fromDate="now-1y" toDate="now-1M"/>
<dateRange key="priorToLastYear" toDate="now-1y"/>
</dateRanges>
</aggregate>
The type of such aggregate is date_range
. The parameter must be a numeric of type date
.
At least one dateRange
must be defined. A range must have a key
and at least a fromDate
or a toDate
. The fromDate
values are included and the toDate
values are excluded for each dateRange
defined. Note that there always will be a returned bucket for each defined range even if the document count is 0.
The fromDate
or a toDate
accept value which can be:
- relative. For instance
now
,now-24H
,now-1y
, etc. - absolute: For instance
2014-11-06
,14/02/2014 04:00:45
, etc. But you must define theformat
property accordingly e.g.yyyy-MM-dd
,dd/MM/yyyy HH:mm:ss
, etc.
Histogram Aggregate
The type of such aggregate is histogram
. The parameter must be a numeric of type integer
, double
or long
.
It has the following properties:
interval
: Defines the interval covered by each returned bucket. It is required.minDocCount
: Only returns buckets having more document than the defined value (default is 1).order
: Orders the buckets. Possible values arecount desc
,count asc
,key desc
,key asc
. Note that the key of the histogram bucket is theto
value.extendedBoundsMin
: Forces the histogram aggregate to return buckets (even if empty) starting from this value.extendedBoundsMax
: Forces the histogram aggregate to return buckets (even if empty) up to this value.
The use of extendedBoundsMin
and extendedBoundsMax
is strongly recommended. It prevents elasticsearch from scanning the range of all possible values which may affect performance.
Here is an example of Date Range aggregate on the dc:created
document property.
<aggregate id="dc_created_agg" type="date_histogram" parameter="dc:created">
<field schema="default_search" name="dc_created_agg" />
<properties>
<property name="interval">year</property>
<property name="format">yyyy</property>
<property name="order">key desc</property>
</properties>
</aggregate>
The type of such aggregate is date_histogram
. The parameter must be of type date
.
It has the same properties as the histogram
aggregate except:
interval
: Accepts values such asyear
,week
,day
,hour
,minute
orsecond
. It also accepts:1M
,1.5h
, etc.extendedBoundsMin
andextendedBoundsMax
accept value formatted according to theformat
property.
The use of extendedBoundsMin
and extendedBoundsMax
is strongly recommended. It prevents elasticsearch from scanning the range of all possible values which may affect performance.