Metrics

The Nuxeo code is instrumented using Dropwizard Metrics.

This library enables to report different type of metrics:

Counters: a value that can increase (or decrease), for instance the number of created documents.
Gauges: this is an instantaneous measurement of a value. For example, the size of a cache.
Timers: it measures both the rate that a particular piece of code is called and the distribution of its duration. For example, it provides the min, max, average and percentiles for a request duration but also the number of requests per second, the total number of requests and, the total elapsed time.

Nuxeo metrics can be reported to various Monitoring systems.

To have a complete monitoring you should also monitor the operating system where Nuxeo is running as well as other services: databases, Elasticsearch, Redis or Kafka.

Supported Monitoring Systems

Nuxeo uses a MetricReporter interface to publish its metrics and provides multiple implementations, the system can be extended to support other monitoring systems.

By default, since Nuxeo 11, we have the following metric reporters:

JMX
Graphite
Datadog
Prometheus
GCP Metrics (previously known as Stackdriver)

It is possible to activate multiple reporters at the same time.

Depending on the chosen metric reporter, Nuxeo will push metrics or the metric system will poll Nuxeo for metrics. For instance, metrics are pushed to Graphite and Datadog while Prometheus is polling Nuxeo for metrics.

Monitoring systems are either hierarchical or dimensional:

Hierarchical Monitoring system uses a flat metric name, like a file system tree. Example of such system are: JMX, Graphite, StatsD, Ganglia, ...
Dimensional Monitoring system uses metric names enriched with tag key/value pairs. Example of such system are: Prometheus, Datadog, influxDB, ...

Since Nuxeo 11 the instrumentation of the code uses metric names with tags (dimensional metric). Hierarchical monitoring reporters flatten the tags into the metric name.

For instance, a gauge that measures the pending video conversions on a Nuxeo cluster is named nuxeo.works.global.queue.scheduled and has the following tag: queue=videoConversion that express the work queue dimension, the metric name is translated into:

Monitoring system	type	metric
JMX	hierarchical	#mbean = metrics:name=nuxeo.works.global.queue.scheduled,queue=videoConversion,type=gauges
Graphite	hierarchical	${PREFIX}+nuxeo.works.global.queue.videoConversion.scheduled
Datadog	dimensional	nuxeo.works.global.queue.scheduled queue:videoConversion host:nuxeo
Prometheus	dimensional	dropwizard5_nuxeo_works_global_queue_scheduled_gauge{instance="nuxeo:8888",job="nuxeo",queue="videoConversion"}
GCP Metrics (Stackdriver)	dimensional	custom.googleapis.com/nuxeo/dropwizard5_nuxeo.works.global.queue.scheduled queue:videoConversion

Note that the tag value (videoConversion in the above example) can be rewritten as a valid identifier, for instance a value of audit/writer will be translated into audit-writer

JMX Reporter

To activate the JMX reporting you need a nuxeo.conf with the following options:

metrics.enabled=true
metrics.jmx.enabled=true

Metrics are exposed in a metrics MBean domain.

Note that to access the JMX MBean you need to activate JMX using JAVA_OPTS:

JAVA_OPTS=$JAVA_OPTS -Dcom.sun.management.jmxremote=true

You then have to manage security for this access, since there is no authentication by default.

Visit this page to see how to dump all metrics.

$ echo "get -b metrics:name=nuxeo.works.queue.timer,queue=default,type=timers *" | java -jar /tmp/jmxterm-1.0.1-uber.jar -l localhost:1089 -n
Welcome to JMX terminal. Type "help" for available commands.
#mbean = metrics:name=nuxeo.works.queue.timer,queue=default,type=timers:
50thPercentile = 0.0;
DurationUnit = microseconds;
Max = 76000.0;
75thPercentile = 0.0;
95thPercentile = 0.0;
98thPercentile = 0.0;
99thPercentile = 0.0;
999thPercentile = 24000.0;
Mean = 67.96712231400461;
StdDev = 1250.994358152994;
Min = 0.0;
Count = 60;
RateUnit = events/second;
Sum = 509000.0;
FifteenMinuteRate = 0.021929133134494853;
FiveMinuteRate = 0.011975928378019026;
MeanRate = 0.047353163048175675;
OneMinuteRate = 5.608847244891202E-4;

Graphite Reporter

Metrics are pushed to Graphite every pollInterval. Default reporting is done using TCP but if possible UDP should be chosen. If unspecified the default prefix is server.<HOSTNAME>.nuxeo.

The nuxeo.conf options are:

metrics.enabled=true
metrics.graphite.enabled=true
metrics.graphite.host=GRAPHITE_HOST
metrics.graphite.port=2003
metrics.graphite.udp=false
metrics.graphite.pollInterval=60
metrics.graphite.prefix=

Datadog Reporter

Configuration:

metrics.enabled=true
metrics.datadog.enabled=true
metrics.datadog.pollInterval=60
metrics.datadog.apiKey=YOUR_API_KEY
metrics.datadog.tags=nuxeo
metrics.datadog.hostname=
metrics.datadog.udp=false
metrics.datadog.host=localhost
metrics.datadog.port=8125

The metrics.datadog.apiKey is required when publishing directly to Datadog, in this case host and port are not required.

When UDP is chosen, the metrics.datadog.host and metrics.datadog.port must refer to a Datadog agent in charge of forwarding metrics into the cloud.

Additional tags can be added to all metrics using metrics.datadog.tags. Metrics are polled from Nuxeo and pushed to Datadog using the defined pollInterval.

Prometheus Reporter

Configuration:

metrics.enabled=true
metrics.prometheus.enabled=true
metrics.prometheus.port=9090

Nuxeo is exposing a Prometheus endpoint on metrics.prometheus.port. You have to manage security for this access and make sure it is accessible from Prometheus in order to scrape the metrics.

Note that the Dropwizzard Metrics are exposed using Opencensus, they are automatically prefixed by dropwizard5_ and suffixed depending on metric types (_gauge, _count and _timer).

GCP Metrics (Stackdriver)

You need to provide a valid GCP project identifier.

Configuration:

metrics.enabled=true
metrics.stackdriver.enabled=true
metrics.stackdriver.gcpProjectId=MY-GCP-PROJECT-ID

The metrics will be prefixed by custom.googleapis.com/nuxeo/dropwizard5_

Metrics Filtering

A counter or gauge instrumentation create a single metric, on the other hand a timer generates 16 metrics:

min: minimum measured time
max: maximum measured time
mean: average measured time
stddev: standard deviation
p50: median or percentile 50%
p75, p95, p98, p99, p999: percentiles
m1_rate: moving average on the last minute of the throughput
m5_rate, m15_rate: moving average on 5 and 15 minutes
mean_rate: average throughput
count: number of timer call
sum: total cumulative time

The number of metrics depends on the deployed repositories, directories, caches, work queues, computations, streams ... which can be a huge number. Fortunately it is possible to filter metrics.

Nuxeo comes with a default filter that:

Reduces timer metrics by removing some percentiles and rates
Accept only a few caches and directories metrics
Deny other caches and directories
Deny log4j metrics for the level below WARN

This filter can be found in common-base template in the metrics-config.xml file:

  <extension target="org.nuxeo.runtime.metrics.MetricsService" point="configuration">
    <configuration enabled="${metrics.enabled}">
      ...
      <filter>
        <allow>
          <prefix>nuxeo.cache.default-cache.</prefix>
          <prefix>nuxeo.cache.user-entry-cache.</prefix>
          <prefix>nuxeo.cache.group-entry-cache.</prefix>
          <prefix>nuxeo.directories.directory.userDirectory</prefix>
          <prefix>nuxeo.directories.directory.groupDirectory</prefix>
        </allow>
        <deny>
          <prefix>nuxeo.cache</prefix>
          <prefix>nuxeo.directories</prefix>
          <prefix>nuxeo.ActionService</prefix>
          <prefix>org.apache.logging.log4j.core.Appender.debug.count</prefix>
          <prefix>org.apache.logging.log4j.core.Appender.info</prefix>
          <prefix>org.apache.logging.log4j.core.Appender.debug</prefix>
          <prefix>org.apache.logging.log4j.core.Appender.trace</prefix>
          <prefix>org.nuxeo.ecm.core.management.standby.StandbyComponent</prefix>
          <!-- Timer expansion to remove -->
          <expansion>stddev</expansion>
          <expansion>p75</expansion>
          <expansion>p98</expansion>
          <expansion>p999</expansion>
          <expansion>m5_rate</expansion>
          <expansion>m15_rate</expansion>
          <expansion>mean_rate</expansion>
        </deny>
      </filter>
    </configuration>
  </extension>

Note that the prefix value used in the allow/deny rules is the beginning of the flatten metric name.

Metrics Description

This section describes the main metrics available by categories.

You will find the metric name and its associates dimensions.

Nuxeo Scaling metrics

Since Nuxeo 2021.35 there is a scaling metric that indicates if the current load requires scale up (add worker nodes) or scale down (remove worker nodes).

The best number of worker nodes is determined to give the maximum concurrency on active computations, the concurrency being limited by the number of partitions in their input streams (Kafka topic).

Note that we don't scale to 0 worker node, because there is always a worker node needed in order to process async tasks that are necessary to report metrics, log audit entries, process scheduled tasks ...

nuxeo.streams.scale.metric: It indicates the number of node to add (> 0) or to remove (<0), 0 should be the target for optimal processing.
nuxeo.cluster.worker.count: The current number of worker nodes

The sum of the 2 above metrics is the optimal number of worker nodes to process the current load.

You can use the Management API endpoint /management/stream/scale to get more information and understand the value of the metric.

Nuxeo Pubsub publish

Since Nuxeo 2021.35 there is a counter per node that count the number of message published to the PubSub service.

nuxeo.pubsub.publish: Number of messages published from this node since the instance is started.

Nuxeo Cache

All the following metrics have a cache dimension with the name of the cache. Common caches are: default-cache or low level caches used by directories like: user-entry-cache or group-entry-cache.

nuxeo.cache.hit: Number of cache hits
nuxeo.cache.miss: Number of cache miss
nuxeo.cache.hit.ratio: Hit ratio of the cache
nuxeo.cache.invalidation: Number of cache invalidation
nuxeo.cache.size: Size of the cache
nuxeo.cache.read: Number of cache get (miss + hit)
nuxeo.cache.write: Number of cache put

Nuxeo Directories

Directory metrics have a directory dimension with the name of the directory. Common directories are: userDirectory, groupDirectory

nuxeo.directories.directory.sessions.active: Number of active Sessions
nuxeo.directories.directory.sessions.max: Maximum number of Sessions used
nuxeo.directories.directory.cache.hit: Number of entry hits
nuxeo.directories.directory.cache.hit.null: Count hit on non-existing entries
nuxeo.directories.directory.cache.invalidation: Count invalidation
nuxeo.directories.directory.cache.miss: Number of cache miss
nuxeo.directories.directory.cache.size: Cache size

Nuxeo Elasticsearch

Nuxeo Elasticsearch service use timers.

nuxeo.elasticsearch.service.timer: Timer on the service dimension below
- service=search: Invoking an elasticsearch query
- service=scroll: Invoking a next scroll query
- service=index: Invoking indexation of a single document
- service=bulkIndex: Invoking indexation of one or more documents
- service=delete: Invoking a deletion query
- service=fetch: Fetching documents from the repository to build the search response

Nuxeo Stream Importer

When using the nuxeo-platform-importer/nuxeo-importer-stream some metrics are reported to follow producers and consumers progress.

nuxeo.importer.stream.producer: Timer on poducer hook
nuxeo.importer.stream.producers: Number of producers
nuxeo.importer.stream.consumers: Number of consumers
nuxeo.importer.stream.consumer.committed: Number of committed documents
nuxeo.importer.stream.consumer.batchCommit: Timer on batch transaction commit
nuxeo.importer.stream.consumer.batchFailure: Counter of batch failure

Nuxeo Instance UP

Each time a Nuxeo node is started it reports the following metrics that can be aggregated to count the number of Nuxeo nodes in the cluster.

nuxeo.instance-up: One if up

Nuxeo Repositories

Repository metrics have a repository dimension with the name of the repository.

nuxeo.repositories.repository.documents.create: Count the number of documents created since the node is up
nuxeo.repositories.repository.documents.delete: Count the number of documents deleted since the node is up
nuxeo.repositories.repository.documents.update: Count the number of documents updated since the node is up
nuxeo.repositories.repository.query: Timer on NXQL queries
nuxeo.repositories.repository.save: Timer on session saves
nuxeo.repositories.sessions: Counter of active sessions
nuxeo.repositories.repository.connection.count: Number of connections used
nuxeo.repositories.repository.connection.idle: Number of connections idle
nuxeo.repositories.repository.connection.killed: Killed connections

Repository document cache stats:

nuxeo.repositories.repository.cache.eviction.count
nuxeo.repositories.repository.cache.hit.count
nuxeo.repositories.repository.cache.hit.ratio
nuxeo.repositories.repository.cache.miss.count
nuxeo.repositories.repository.cache.read.count
nuxeo.repositories.repository.cache.size

Repository children documents cache stats:

nuxeo.repositories.repository.childCache.eviction.count
nuxeo.repositories.repository.childCache.hit.count
nuxeo.repositories.repository.childCache.hit.ratio
nuxeo.repositories.repository.childCache.miss.count
nuxeo.repositories.repository.childCache.read.count
nuxeo.repositories.repository.childCache.size

Nuxeo Streams

Nuxeo Stream Processor relies on a topology of computations. It is important to monitor failure because a manual intervention is required.

nuxeo.streams.failure: Total number of computation threads terminated because of failure for the node (all computations)

Each computation has the following metrics, the computation name being a dimension, for instance audit/writer, bulk/csvExport ...

nuxeo.streams.computation.failure: Count the number of computation threads terminated by a failure
nuxeo.streams.computation.skippedRecord: Count the number of record skipped after failure, the computation thread is not terminated
nuxeo.streams.computation.processRecord: Timer around processing of incoming record
nuxeo.streams.computation.processTimer: Timer around the computation timer processing
nuxeo.streams.computation.running: Number of active computation thread processing records or timers

All these metrics are local to each Nuxeo node.

The following metrics represent the lag and latencies of computations, these metrics are global to the Nuxeo cluster and have 2 dimensions:

stream obviously the stream name like audit/audit or bulk/command -group the name of the consumer group which the computation name like audit/writer or bulk/scroller
nuxeo.streams.global.stream.group.lag: Number of records between the end of the stream and position of the consumer group
nuxeo.streams.global.stream.group.pos: Offset of the consumer group on this stream
nuxeo.streams.global.stream.group.end: Offset of the last record in the stream
nuxeo.streams.global.stream.group.latency: Latency of consumer group when there is a lag

Understanding the Lag and Latency Metrics

Given a producer that appends 3 records in a stream source:

record1 at time t1, offset 1
record2 at time t2, offset 2
record3 at time t3, offset 3

A processor is started with a computation C1 reading from the previous stream, until the first record of the stream has been processed there is no lag or latency reported.

Once the first record has been processed we have the following metrics for group: C1, stream: source:

pos: offset 1
end: offset 3
lag: 2 (which is end - pos)
latency: now() - t1

After processing the second record we have:

pos: offset 2
end: offset 3
lag: 1
latency: now() - t2

After the third and last record, there is no more lag or latency, the computation has caught up with the producer:

pos: offset 3
end: offset 3
lag: 0
latency: 0

Nuxeo Transactions

nuxeo.transactions.concurrency: Current number of concurrent transaction
nuxeo.transactions.concurrency.max: Maximum registered concurrent transactions
nuxeo.transactions.rollbacks: Number of transaction rollbacks
nuxeo.transactions.timer: Timer on all transactions

Nuxeo Web

Metrics reported by the Web NuxeoAuthenticationFilter:

nuxeo.web.authentication.logged-users: Number of users successfully logged and that didn't logout explicitly, the session timeout is not taken into account
nuxeo.web.authentication.requests.concurrency: Current concurrent requests in the filter
nuxeo.web.authentication.requests.concurrency.max: Registered maximum number of concurrency
nuxeo.web.authentication.requests.timer: Timer on the request web filter

Nuxeo WorkManager

The WorkManager service provides metrics by queue (default, fulltextUpdater ...)

Some are global to the Nuxeo cluster:

nuxeo.works.global.queue.scheduled: Number of scheduled Work
nuxeo.works.global.queue.completed: Number of completed Work
nuxeo.works.global.queue.running: Number of running Work
nuxeo.works.global.queue.canceled: number of canceled Work

Some are local to the node:

nuxeo.works.dlq: Number of Work in failure moved in the dead letter queue
nuxeo.works.queue.timer: Timer for Work execution

JVM

Can be disabled using:

metrics.jvm.enabled=false

Some metrics depends on the chosen garbage collector or the version of the JVM:

jvm.buffers.direct.capacity
jvm.buffers.direct.count
jvm.buffers.direct.used
jvm.buffers.mapped.capacity
jvm.buffers.mapped.count
jvm.buffers.mapped.used
jvm.files
jvm.garbage.G1*
jvm.memory.heap.committed
jvm.memory.heap.init
jvm.memory.heap.max
jvm.memory.heap.usage
jvm.memory.heap.used
jvm.memory.non-heap.committed
jvm.memory.non-heap.init
jvm.memory.non-heap.max
jvm.memory.non-heap.usage
jvm.memory.non-heap.used
jvm.memory.total.committed
jvm.memory.total.init
jvm.memory.total.max
jvm.memory.total.used
jvm.threads.blocked.count
jvm.threads.count
jvm.threads.daemon.count
jvm.threads.deadlock.count
jvm.threads.deadlocks
jvm.threads.new.count
jvm.threads.runnable.count
jvm.threads.terminated.count
jvm.threads.timed_waiting.count
jvm.threads.waiting.count

Tomcat

Can be disabled using:

metrics.tomcat.enabled=false

tomcat.activeSessions
tomcat.bytesReceived
tomcat.bytesSent
tomcat.currentThreadBusy
tomcat.currentThreadCount
tomcat.errorCount
tomcat.jdbc-numActive
tomcat.jdbc-numIdle
tomcat.processingTime
tomcat.requestCount

Log4j

Can be disabled using:

metrics.log4j.enabled=false

org.apache.logging.log4j.core.Appender.all: Count number of log traced (all level)
org.apache.logging.log4j.core.Appender.warn: Count the number of log at level WARN
org.apache.logging.log4j.core.Appender.error: Count the number of log at level ERROR
org.apache.logging.log4j.core.Appender.fatal: Count the number of log at FATAL level

Monitoring Dashboard

The above metrics can be rendered in dashboards depending on the chosen monitoring systems.

Grafana

Grafana can be used to render Graphite or Prometheus metrics.

The Nuxeo Stacks tool that generates dev and testing environment for Nuxeo provides a Grafana dashboard using a Graphite backend.

The JSON dashboard can be found in its sources.

Datadog

The same dashboard can be build using Datadog.