Addons

Amazon S3 Online Storage

Updated: November 15, 2024

The Amazon S3 Online Storage is a Nuxeo Binary Manager for S3. It stores Nuxeo's binaries (the attached documents) in an Amazon S3 bucket.

Before You Start

You should be familiar with Amazon S3 and be in possession of your credentials.

Installation

This addon requires no specific installation steps. It can be installed like any other package with nuxeoctl command line or from the Marketplace.

Nuxeo Configuration

In order to configure the package, you will need to provide values for the configuration parameters that define your S3 credentials, bucket and encryption choices.

For the case of a single repository, you can do the configuration using the nuxeo.conf properties described below. For more complex setups, you will need to use an XML extension point, see further down for details.

Specifying Your Amazon Credentials and Region

In nuxeo.conf, add the following lines:

nuxeo.aws.accessKeyId=your_AWS_ACCESS_KEY_ID
nuxeo.aws.secretKey=your_AWS_SECRET_ACCESS_KEY
nuxeo.aws.region=your_AWS_REGION

If your Nuxeo instance runs on Amazon EC2 or Amazon ECS, you can also transparently use IAM instance roles, in which case you do not need to specify the AWS ID and secret (the credentials will be fetched automatically from the instance metadata). The same applies to the region.

If you used explicit configuration, the file nuxeo.conf now contains S3 secret access keys, you should protect it from prying eyes.

The region code can be found in the S3 Region Documentation. The default is us-east-1. At the time this documentation was written, the list is:

  • us-east-1: US East (N. Virginia) (default)
  • us-east-2: US East (Ohio)
  • us-west-1: US West (N. California)
  • us-west-2: US West (Oregon)
  • eu-west-1: EU (Ireland)
  • eu-west-2: EU (London)
  • eu-west-3: EU (Paris)
  • eu-central-1: EU (Frankfurt)
  • ap-south-1: Asia Pacific (Mumbai)
  • ap-southeast-1: Asia Pacific (Singapore)
  • ap-southeast-2: Asia Pacific (Sydney)
  • ap-northeast-1: Asia Pacific (Tokyo)
  • ap-northeast-2: Asia Pacific (Seoul)
  • ap-northeast-3: Asia Pacific (Osaka-Local)
  • sa-east-1: South America (São Paulo)
  • ca-central-1: Canada (Central)
  • cn-north-1: China (Beijing)
  • cn-northwest-1: China (Ningxia)

Specifying Your Amazon S3 Parameters

You must specify the S3 bucket to use, and optionally a prefix:

nuxeo.s3storage.bucket=your_BUCKET
# the following is optional
nuxeo.s3storage.bucket_prefix=yourfolder/

The bucket name is unique across all of Amazon, you should find something original and specific.

The optional bucket prefix is used to localize your binaries within a specific S3 folder (the bucket_prefix syntax is available since Nuxeo 7.10-HF03).

If you are using an S3-compatible storage service, then you will most likely also need to set the endpoint parameter in nuxeo.conf and you may want (depending on the service vendor, check their documentation) to configure connections to use path-style access for the bucket name:

nuxeo.s3storage.endpoint=hostname
nuxeo.s3storage.pathstyleaccess=true

If you installed the bundle JAR manually instead of using the Nuxeo Package you will also need:

nuxeo.core.binarymanager=org.nuxeo.ecm.blob.s3.S3BlobProvider

The following are compatibility properties that can still be used but are deprecated (you should use global AWS configuration or IAM instance roles as described above):

nuxeo.s3storage.awsid=your_AWS_ACCESS_KEY_ID
nuxeo.s3storage.awssecret=your_AWS_SECRET_ACCESS_KEY
nuxeo.s3storage.region=your_AWS_REGION

Client-Side Crypto Options

With S3 you have the option of storing your data encrypted using S3 Client-Side Encryption. Note that the local cache will not be encrypted.

The S3 Binary Manager can use a keystore containing a keypair, but there are a few caveats to be aware of:

Don't forget to specify the key algorithm if you create your keypair with the keytool command, as this won't work with the default (DSA). The S3 Binary Manager has been tested with a keystore generated with this command:

keytool -genkeypair -keystore </path/to/keystore/file> -alias <key alias> -storepass <keystore password> -keypass <key password> -dname <key distinguished name> -keyalg RSA

If you get keytool error: java.io.IOException: Incorrect AVA format, then ensure that the distinguished name parameter has a form such as: -dname "CN=AWS S3 Key, O=example, DC=com".

Don't forget to make backups of the /path/to/keystore/file file along with the store password, key alias and key password. If you lose them (for instance if the EC2 machine hosting the Nuxeo instance with the original keystore is lost) you will lose the ability to recover any encrypted blob from the S3 bucket.

With all that above in mind, here are the crypto options that you can add to nuxeo.conf (they are all mandatory once you specify a keystore):

nuxeo.s3storage.crypt.keystore.file=/absolute/path/to/the/keystore/file
nuxeo.s3storage.crypt.keystore.password=the_keystore_password
nuxeo.s3storage.crypt.key.alias=the_key_alias
nuxeo.s3storage.crypt.key.password=the_key_password

Server-Side Crypto Options

Alternatively, you can use S3 Server-Side Encryption with the following option:

nuxeo.s3storage.crypt.serverside=true

Client-Side Encryption is safer than Server-Side Encryption. With Client-Side Encryption an attacker needs both access to the AWS credentials and the key to be able to access the unencrypted data while Server-Side Encryption will only require the potential attacker to provide the AWS credentials.

If you want to use Server-Side Encryption with AWS KMS–Managed Keys, specify your key with:

nuxeo.s3storage.crypt.kms.key = your-key-id

Cache Options

Files retrieved from S3 are cached locally for speed. You can configure the maximum cache size (in bytes or with the standard KB, MB, GB or TB suffixes), the maximum number of files in the cache, and the minimum age (in seconds) a file should have before being eligible for purge (the age is the time since last file access).

nuxeo.s3storage.cachesize=100MB
nuxeo.s3storage.cachecount=10000
nuxeo.s3storage.cacheminage=3600

cachecount and cacheminage are available since Nuxeo 7.10-HF03.

Digest Algorithm

By default the blobs are stored in S3 based on their MD5 digest (hash). The digest algorithm can be changed if required, for instance:

nuxeo.s3storage.digest=SHA-256

Download From S3 Options

You can configure downloads to be directly served to the user from S3 without going through Nuxeo. To do so, use:

nuxeo.s3storage.directdownload=true
nuxeo.s3storage.directdownload.expire=3600

The expire time is expressed in seconds (the default is one hour) and determines how long the generated S3 URLs are valid. Having short-lived URLs is better for security, but too short an expiration time could be problematic if your server clock is not exactly in sync with the absolute official time use by S3.

Before Nuxeo 7.10 the configuration was done using property nuxeo.s3storage.downloadfroms3 instead of nuxeo.s3storage.directdownload (same with expire). This is still available for backward compatibility after Nuxeo 7.10 but will be removed in a future version, so the nuxeo.s3storage.directdownload version above should be preferred.

CORS Configuration

Web UI triggers some blob downloads from XHR (e.g. Bulk Download, CSV Export, etc.) and will require the following CORS configuration on your S3 bucket:

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <CORSRule>
    <AllowedOrigin>http://localhost:8080</AllowedOrigin>
    <AllowedMethod>GET</AllowedMethod>
    <ExposeHeader>Content-Disposition</ExposeHeader>
  </CORSRule>
</CORSConfiguration>

Make sure to replace http://localhost:8080 by the address where your Nuxeo instance is deployed.

Transient stores are much easier to configure now thanks to the combination of NXP-26594 and NXP-26581.
Since Nuxeo Platform 10.10 we are no longer using SimpleTransientStore because it is not cluster-compatible.

Now, unless configured otherwise, storage for the blobs of transient stores shares the S3 configuration of the default binary store to stock the transient blobs in a "subfolder" of the S3 bucket. It still has a separate TTL/GC cleanup lifecycle from the default one, everything is per-"folder".

So given that now transient blobs come from S3, if direct download is enabled and JavaScript generates the download, a CORS configuration is needed on the bucket.

Connection Pool Options

You can configure the internal S3 connection pool. This pool has a size of 50 by default, so if you've configured Nuxeo to use more sessions than this and all the sessions are accessing S3, you may run out of connections.

The following parameters can be used to change some connection pool parameters (the defaults are shown):

nuxeo.s3storage.connection.max=50
nuxeo.s3storage.connection.retry=3
nuxeo.s3storage.connection.timeout=50000
nuxeo.s3storage.socket.timeout=50000

The timeouts are expressed in milliseconds.

You can read more about these parameters on the AWS ClientConfiguration documentation page.

Checking Your Configuration

To check that installation went well, you can check your startup logs and look for a line like:

INFO  [CachingBinaryManager] Registering binary manager 'default' using S3BinaryManager

Don't forget to enable the INFO level for the group org.nuxeo in $NUXEO_HOME/lib/log4j2.xml to see INFO level messages from Nuxeo classes.

If your configuration is incorrect, this line will be followed by some error messages describing the problems encountered.

AWS Configuration

AWS S3 Permissions

You must have appropriate permissions set on your bucket. In particular, note that the less commonly-used permissions s3:AbortMultipartUpload and s3:ListMultipartUploadParts are needed on the bucket objects, and s3:ListBucketMultipartUploads and s3:GetBucketVersioning are needed on the bucket itself.

If you plan on using Retention, you'll also need s3:PutObjectRetention and s3:PutObjectLegalHold on the bucket objects, and s3:GetBucketObjectLockConfiguration on the bucket itself. When testing Retention in Governance mode, you'll need a user with s3:BypassGovernanceRetention in order for blob garbage collection to work correctly.

Here is a sample AWS S3 Policy that you can use; make sure that you replace yourbucketname with your own bucket name.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": "arn:aws:s3:::*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetBucketVersioning",
                "s3:ListBucketMultipartUploads",
                "s3:GetBucketObjectLockConfiguration"
            ],
            "Resource": "arn:aws:s3:::yourbucketname"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts",
                "s3:PutObjectRetention",
                "s3:PutObjectLegalHold"
            ],
            "Resource": "arn:aws:s3:::yourbucketname/*"
        }
    ]
}

AWS S3 Cleanup Lifecycle Rule

If versioning is enabled on your s3 bucket, you should define a Cleanup Lifecycle rule to remove expired object delete markers. As a matter of fact, in the case of S3 versioning enabled, the Orphaned Blobs GC will only add a delete marker on the garbage-collected object. This object will be permanently deleted only if such a lifecycle rule is defined.

Nuxeo Configuration Through Extension Point

The above configuration uses nuxeo.conf properties prefixed with nuxeo.s3storage, which is useful for simple configurations. However if you plan on using several S3 blob managers, you must configure them using an XML extension point. The following is an example for the default blob manager:

<extension target="org.nuxeo.ecm.core.blob.BlobManager" point="configuration">
  <blobprovider name="default">
    <class>org.nuxeo.ecm.core.storage.sql.S3BinaryManager</class>
    <property name="awsid">your_AWS_ACCESS_KEY_ID</property> <!-- optional -->
    <property name="awssecret">your_AWS_SECRET_ACCESS_KEY</property> <!-- optional -->
    <property name="region">us-west-1</property> <!-- optional -->
    <property name="bucket">your_s3_bucket_name</property>
    <property name="bucket_prefix">myprefix/</property>
    <property name="digest">SHA-256/</property>
    <property name="directdownload">true</property>
    <property name="directdownload.expire">3600</property>
    <property name="cachesize">100MB</property>
    <property name="crypt.keystore.file">/my/keystore.jks</property>
    <property name="crypt.keystore.password">password</property>
    <property name="crypt.key.alias">mykey</property>
    <property name="crypt.key.password">password</property>
    <property name="connection.max">50</property>
    <property name="connection.retry">3</property>
    <property name="connection.timeout">50000</property>
    <property name="socket.timeout">50000</property>
  </blobprovider>
</extension>

Note that this needs to override the default configuration present in the default Nuxeo template default-repository-config.xml.nxftl, which already defines the standard configuration for the default blob provider. You may need to <require>default-repository-config</require> in order for the override to be correctly taken into account.

S3 Direct Upload

By default, binaries are uploaded to the Nuxeo server which upload them to S3.

Another possibility is for the client to ask the Nuxeo server temporary S3 credentials to a second S3 bucket, used as a facade bucket and called transient, where the client (basically Web UI) directly uploads binaries. Then the S3 reference is passed to the server which moves it to its actual S3 bucket.

To unblock this capability you will need a dedicated S3 bucket and a IAM Role that can write in it and which can be assumed with the Nuxeo server AWS configuration.

The role must possess at least the right s3:PutObject on the transient bucket.

Please note that the S3 transient bucket has to be configured to allow CORS on PUT and POST methods, this can be done in the permissions tab from the AWS bucket configuration page.

The following CORS configuration allows Web UI to send files to S3, please feel free to adapt it if needed.

<?xml version="1.0" encoding="UTF-8"?>
<CORSConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<CORSRule>
    <AllowedOrigin>*</AllowedOrigin>
    <AllowedMethod>DELETE</AllowedMethod>
    <AllowedMethod>PUT</AllowedMethod>
    <AllowedMethod>POST</AllowedMethod>
    <MaxAgeSeconds>3000</MaxAgeSeconds>
    <ExposeHeader>ETag</ExposeHeader>
    <AllowedHeader>*</AllowedHeader>
</CORSRule>
</CORSConfiguration>

To activate S3 direct upload you have to declare the mandatory fields from nuxeo.defaults in the nuxeo.conf.

The optional bucket_prefix allows you to use a "subfolder" of the bucket. The optional crypt.serverside allows you to use server-side encryption (SSE-S3).

The awsid, awssecret, awstoken and region are deprecated and should instead be configured through nuxeo.aws.accessKeyId, nuxeo.aws.secretKey,nuxeo.aws.sessionToken and nuxeo.aws.region or through implicit IAM instance roles (see above).

S3 direct upload is implemented by a BatchHandler and a TransientStore using contributions that can be found in the s3binaries template file s3directupload-config.xml.nxftl