Nuxeo can be clustered between several nodes (a.k.a. instances or machines) with the appropriate configuration. In addition, an HTTP load balancer with session affinity must be used in front of the nodes.
Requirements
To enable clustering, you must have at least two nodes with:
- A shared database
- A shared filesystem (unless you use an external binary store like S3)
- A dedicated Elasticsearch cluster, if using Elasticsearch
- A Redis server
- A load-balancer with sticky sessions
The shared filesystem is usually an NFS mount. You must not share the whole Nuxeo installation tree (see below).
The load balancer must use sticky sessions if the clustering delay is not 0. Having a non-0 clustering delay is recommended for performance reasons. See below for more.
Shared Filesystem Configuration
The complete Nuxeo instance hierarchy must not be shared between all instances. However a few things must or should be shared.
Binaries
The repository.binary.store
(nxserver/data/binaries
by default) directory must be shared by all Nuxeo instances in order for VCS to function correctly.
Temporary Directory
The temporary directory configured through nuxeo.tmp.dir
must not be shared by all instances, because there are still a few name collision issues that may occur, especially during startup.
However, in order for various no-copy optimizations to be effective, the temporary directory should be on the same filesystem as the binaries directory. To do this, the recommended way is to have each instance's nuxeo.tmp.dir
point to a different subdirectory of the shared filesystem.
Transient Store
The caching directory used by any Transient Store accessed by multiple Nuxeo instances must be shared by all instances. This caching directory is located in nxserver/data/transientstores/<transientstore_name>
.
By default there is only one Transient Store contribution named default
:
<extension target="org.nuxeo.ecm.core.transientstore.TransientStorageComponent"
point="store">
<store name="default" class="...">
...
</store>
</extension>
Therefore you need to create in the nxserver/data/transientstores
directory a symbolic link named default
pointing to a shared directory, and do the same for any other TransientStore
you might have contributed if it is intended to be shared by multiple instances of the cluster.
VCS Cluster Configuration
Setup
The cluster nodes must only share the binaries
folder (configured with repository.binary.store
), not the entire data directory (configured with nuxeo.data.dir
): the reason is the data directory contains data related to features that are not working in a cluster environment, in particular everything related to the Nuxeo Package management.
To set up clustering, please update the following parameters in nuxeo.conf
:
**repository.clustering.enabled**
must betrue
to enable clustering.repository.clustering.id
: it is now highly recommended to set an explicit cluster node id. The id must be an integer for all databases, unless you are using Oracle which accepts a string. Please see NXP-17180 for more explanations.repository.clustering.delay
is expressed in milliseconds, and specifies a delay during which invalidations don't need to be processed. Using a non-0 value is an important optimization as otherwise every single transaction, even a read-only one, would have to hit the database to check invalidations between several nodes. However this means that one node may not see immediately the changes made on another node, which is a problem if you don't use sticky session on the load balancer.repository.binary.store
must point to a shared storage unless you use an external binary store like S3. Under Windows, the path value can be UNC formatted, for instance\\servername\sharename
.**nuxeo.db.validationQuery**
must contain a SELECT clause for validating connections in the pool according to your database type. For instanceSELECT 1
used on PostgreSQL orSELECT 1 FROM dual
on Oracle.
There is a dedicated page detailing all the VCS configuration options.
Checking the SQL Tables Initialization
- Start the SQL server, all Nuxeo nodes (the first alone and the other afterwards to avoid concurrent initialization of the SQL tables) and the load balancer.
- Log in on the HTTP user interface on each cluster node.
- Check on the database that the
cluster_nodes
table is initialized with one line per node:
nuxeo-db=# select * from cluster_nodes;
nodeid | created
--------+----------------------------
25767 | 2009-07-29 14:36:08.769657
32546 | 2009-07-29 14:39:18.437264
(2 lines)
Checking VCS Cache Invalidations
- Create a document and browse it from two different nodes.
- Edit the title from one node.
- Navigate back to the document from second node to check that the change is visible.
- You can also monitor what's happening in the
cluster_invals
table to see cache invalidation information.
Quartz Scheduler Cluster Configuration
A clustered Nuxeo environment should be configured to use Quartz scheduling. The Quartz scheduling component allows nodes to coordinate scheduled tasks between themselves - a single task will be routed to a single node for execution on that one node. This ensures that scheduled events, like periodic cleanups or periodic imports, are executed only on one node and not on all nodes at the same time.
Standard configuration is available from Nuxeo templates for Tomcat for PostgreSQL, Oracle and SQL Server.
In most cases, each node in the cluster should be configured to include this template.
- Populate the database with the tables needed by Quartz (names
QRTZ_*
). The DDL scripts come from the standard Quartz distribution and are available in the Nuxeo templates in$NUXEO_HOME/templates/<database>-quartz-cluster/bin/create-quartz-tables.sql
. - Enable the Quartz-specific cluster templates by adding the template
<database>-quartz-cluster
. In most cases you will include this template on each node in the cluster.
In cluster mode the schedule contributions (deployed from plugins or configuration files) must be the same on all nodes.
Any instance using a clustered Quartz configuration tries to get a lock on the next scheduled job execution. Those locks are managed and shared through the database. The time must be synchronized on all instances. You should use NTP for that.
While performing a rolling upgrade on Nuxeo servers, the lock may be swapped between the instances. In which case, you may encounter a warning on startup:
This scheduler instance (host-nuxeo.example.com1478524375548) is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.
This message is not a problem if the NTP configuration is fine.
About Session Affinity
We advise to use a session affinity mechanism: when a user is connected to a node, they should always be redirected to that node.
There are several reasons why we advise this configuration.
Invalidations
The Nuxeo Cluster system takes care about propagating invalidations between all nodes of the clusters.
However, for performances reasons, there is a small delay by default: this means that without affinity you could have one call creating a document and the second one not seeing the document. Of course this state is transient, and after a few milliseconds it will be ok. However in the context of a "multi-page transaction" this could be an issue.
Having session affinity does solve the visible issues. If the session affinity can not be restored, for example because the target server has been shutdown, in 99,99% of the case, this won't be an issue.
Authentication
The Nuxeo Platform requires all calls to be authenticated. Depending on your architecture, authentication can be stateless (ex: Basic Auth) or stateful (ex: Form + Cookie). Either way, you probably don't want to replay authentication during all calls.
That's why having a session based authentication + session affinity can make sense: you don't have to re-authenticate each time you call the server.
If the session affinity can not be restored, for example because the target server has been shutdown:
- stateless authentication will be automatically replayed (ex: Basic Auth)
for stateful authentication:
- if you have a SSO this will be transparent
- if you don't have a SSO, user will have to authenticate again.
State Management and UI Rendering
The UI can be stateful or stateless:
- Default back office is based on JSF that is stateful
- The Nuxeo Platform also provides Stateless UI like WebEngine/Freemarker and AngularJS.
If the UI layer you use is stateful, you have to use stateful load balancing.
However, in the case of Nuxeo JSF, since most of the navigation links are Restful, switching server won't be an issue. But of course for real JSF POST, since the server side state is not shared, session affinity is required.
Technically, we don't push for using shared server side state: JSF state is complex and changes a lot, replicating this state between servers is too costly.
HTTP Load Balancer Configuration
Set up an HTTP or AJP load balancer such as Apache with mod_proxy
or mod_proxy_ajp
or Pound, and configure it to keep session affinity by tracking the value of the JSESSIONID
cookie and the ;jsessionid
URL parameter.
If you use a stateless load balancer such as Apache modules such as mod_jk
and mod_proxy_balancer
, you need to make the HTTP server generate JSESSIONID
cookies with values that end with .nxworker_n_
, where nxworker_n_
is a string suffix specific to each node (you can use any string).
- In
nuxeo.conf
specify a differentnuxeo.server.jvmRoute
for each node, for instancenuxeo.server.jvmRoute=nxworker1
. This will instruct the Nuxeo preprocessing phase to correctly fill thejvmRoute
attribute of theEngine
element in the generatedserver.xml
. - Configure you stateless balancer to follow these routes, for instance here is the relevant configuration fragment when using
mod_proxy_balancer
:
ProxyPass /nuxeo balancer://sticky-balancer stickysession=JSESSIONID|jsessionid nofailover=On
<Proxy balancer://sticky-balancer>
BalancerMember http://192.168.2.101:8080/nuxeo route=nxworker1
BalancerMember http://192.168.2.102:8080/nuxeo route=nxworker2
</Proxy>
To enable automatic unhealthy instance eviction on your balancer, you may require an health check.
The following ensures Nuxeo runtime is initialized and up: HTTP:200:/nuxeo/running_status?info=reload
.
Troubleshooting Session Affinity Problems
To test that the load balancer forwards the HTTP requests of a given session to the same node:
Add a new file on each node (after Tomcat started),
$NUXEO_HOME/nxserver/nuxeo.war/clusterinfo.html
,On the first node:
<html><body>Node 1</body></html>
and on the second node:
<html><body>Node 2</body></html>
Using a browser with an active Nuxeo session (an already logged-in user), go to
http://yourloadbalancer/nuxeo/clusterinfo.html
and check that you always return to the same node when hitting the refresh button of the browser.