In this section:
- Document in Nuxeo
- Security management
- Indexing and Query
- Other repository features
- Repository Storage
- Advanced features
Document in Nuxeo
Document vs File
Inside the Nuxeo Repository, a document is not just a simple file.
A document is defined as a set of fields.
Fields can be of several types:
- simple fields (String, Integer, Boolean Date, Double),
- simple lists (multi-valued simple field),
- complex type.
A file is a special case of a complex field that contains:
- a binary stream,
- a filename,
- a mime-type,
- a size.
As a result, a Nuxeo Document can contain 0, 1 or several files.
In fact, inside the Nuxeo repository, even a Folder is seen as a document because it holds meta-data (title, creation date, creator, ...).
Document structure is defined using XSD schemas.
XSD schemas provide:
- a standard way to express structure,
- a way to define meta-data blocks.
Each document type can use one or several schemas.
Here is a simple example of a XSD schema used in Nuxeo Core (a subset of Dublin Core):
Inside the Nuxeo Repository, each document has a Document Type.
A document type is defined by:
- a name,
- a set of schemas,
- a set of facets,
- a base document type.
Document types can inherit from each other.
By using schemas and inheritance you can carefully design how you want to reuse the meta-data blocks.
At pure storage level, the Facets are simple declarative markers. These marker are used by the repository and other Nuxeo EP services to define how the document must be handled.
Default facets include:
Here are some Document Types definition examples:
At UI level, Document Types defined in the Repository are mapped to high level document types that have additionnal attributes:
- display name,
Nuxeo Core includes a Life-Cycle service.
Each document type can be bound to a life-cycle.
The life-cycle is responsible for defining:
- the possible states of the document (ex: draft, validated, obsolete, ...),
- the possible transitions between states (ex : validate, make obsolete, ...).
Life-cycle is not workflow, but:
- workflows usually use the life-cycle of the document as one of the state variable of the process,
- you can simulate simple review process using life-cycle and listeners (very easy to do using Nuxeo Studio and content automation).
By default, security is always on inside Nuxeo Repository: each time a document is accessed or a search is issued, security is verified.
Nuxeo Repository security relies on a list of unitary permissions that are used within the repository to grant or deny access. These atomic permissions (Read_Children, Write_Properties ...) are grouped in Permissions Groups (Read, Write, Everything ...) so that security can be managed more easily.
Nuxeo comes with a default set of permissions and permissions groups but you can contribute yours too.
The main model for security management is based on an ACL (Access Control List) system.
Each document can be associated with an ACP (Access Control Policy).This ACP is composed of a list of ACLs that itself is composed of ACEs (Access Control Entry).
Each ACE is a triplet:
- User or Group,
- Permission or Permission group,
- grant or deny.
ACP are by default inherited: security check will be done against the merged ACP from the current document and all its parent. Inheritance can be blocked at any time if necessary.
Each document can be assigned several ACLs (one ACP) is order to better manage separation of concerns between the rules that define security:
- document has a default ACL: the one that can be managed via back-office UI,
- document can have several workflows ACLs: ACLs that are set by workflows including the document.
Thanks to this separation between ACLs, it's easy to have the document return to the right security if workflow is ended.
The ACP/ACL/ACE model is already very flexible. But is some cases, using ACLs to define the security policy is not enough. A classic example would be confidentiality.
Imagine you have a system with confidential documents and you want only people accredited to the matching confidentiality level to be able to see them. Since confidentiality will be a meta-data, if you use the ACL system, you have to compute a given ACL each time this meta-data is set. You will also have to compute a dedicated user group for each confidentiality level.
In order to resolve this kind of issue, Nuxeo Repository includes a pluggable security policy system. This means you can contribute custom code that will be run to verify security each time it's needed.
Such polices are usually very easy to write, since in most of the case, it's only a match between a user attribute (confidentiality clearance) and the document's meta-data (confidentiality level).
Custom security policy could have an impact on performance, especially when doing open search on a big content repository. To prevent this risk, security policies can be converted in low level query filters that are applied at storage level (SQL when VCS is used) when doing searches.
Indexing and Query
All documents stored in Nuxeo Repository are automatically indexed on their metadata. Files contained in Documents are also by default Full Text indexed.
For that, Nuxeo Core includes a conversion service that provides full text conversion from most usual formats (MSOffice, OpenOffice, PDF, Html, Xml, Zip, RFC 822, ...).
So, in addition to meta-data indexing, the Nuxeo Repository will maintain a fulltext index that will be composed of: all meta-data text content + all text extracted from files.
Configuration options depend on the storage backend, but you can define what should be put into the fulltext index and even define several separated fulltext indexes.
Of course, indexing is only interesting if you can issue queries.
The Nuxeo Repository includes a Query system with a pluggable QueryParser that lets you do search against the repository content. The Nuxeo Repository supports 2 types of queries:
- NXQL: Native SQL Like query language
- CMISQL: Normalized query language included in CMIS specification
Both query languages let you search documents based on Keyword match (meta-data) and/or full text expressions. You can also manage ordering.
In CMISQL you can do cross queries (i.e. : JOINS).
Here is an example of a NXQL query, to search for all non-folderish documents that have been contributed by a given user:
As you may see, there is no security clause, because the Repository will always only return documents that the current user can see. Security filtering is built-in, so you don't have to post-filter results returned by a search, even if you use complex custom security policies.
Other repository features
The Nuxeo Repository includes a versioning system.
At any moment, you can ask the repository to create and archive a version from a document. Versioning can be configured to be automatic (each document modification would create a new version) or on demand (this is bound to a radio button in default Nuxeo DM UI).
Each version has:
- a label,
- a major version number,
- a minor version number.
The versioning service is configurable so you can define the numbering policy. In fact, even the version storage service is pluggable so you can define your own storage for versions.
The Nuxeo Repository includes the concept of Proxy.
A proxy is very much like a symbolic link on an Unix-like OS: a proxy points to a document and will look like a document from the user point of view:
- the proxy will have the same meta-data as the target document,
- the proxy will hold the same files as the target documents (since file is a special kind of meta-data).
A proxy can point to a live document or to a version (check in archived version).
Proxies are used to be able to see the same document from several places without to duplicate any data.
The initial use case for proxies in Nuxeo DM is local publishing: when you are happy with a document (and eventually successfully completed a review workflow), you want to create a version for this document. This version will be the one validated and the live document stays in the workspace where you created it. Then you may want to give access to this valid document to several people. For that, you can publish the document into one or several sections: this means creating proxies pointing to the validated version.
Depending on their rights, people that can not read the document from the workspace (because they can not access it) may be able to see it from one or several sections (that may even be public).
The second use cases for proxies is multi-filling.
If a proxy can not hold meta-data, it can hold security descriptors (ACP/ACL). So a user may be able to see one proxy and not an other.
When the Nuxeo Repository performs an operation, an event will be raised before and after.
Events raised by the Repository are:
- aboutToCreate / emptyDocumentModelCreated / documentCreated
- aboutToRemove / documentRemoved
- aboutToRemoveVersion / versionRemoved
- beforeDocumentModification / documentModified
- beforeDocumentSecurityModification / documentSecurityUpdated
- documentLocked / documentUnlocked
- aboutToCopy / documentCreatedByCopy / documentDuplicated
- aboutToMove / documentMoved
- documentPublished / documentProxyPublished / documentProxyUpdated / sectionContentPublished
- beforeRestoringDocument / documentRestored
- aboutToCheckout / documentCheckedOut
- incrementBeforeUpdate / aboutToCheckIn
These events are forwarded on the Nuxeo Event Bus and can be processed by custom handlers. As for all Events Handlers inside Nuxeo Platform, these Handlers can be:
- Synchronous: meaning they can alter the processing of the current operation
(ex: change the Document content or mark the transaction for RollBack).
- Synchronous Post-Commit: executed just after the transaction has been committed
( can be used to update some data before the user gets the result).
- Asynchronous: executed asynchronously after the transaction has been committed.
Inside the Nuxeo Repository this event system is used to provide several features:
- some fields are automatically computed (creation date, modification date, author, contributors ...),
- documents can be automatically versioned,
- fulltext extraction is managed by a listener too,
Using the event listener system for these features offer several advantages:
- you can override the listeners to inject your own logic,
- you can deactivate the listeners if you don't need the processing,
- you can add your own listeners to provide extract features.
The Nuxeo Repository consists of several services.
One of them is responsible for actually managing persistence of Documents. This service is pluggable. Nuxeo Repository can have two different persistence backends:
- Nuxeo VCS.
- Apache Jackrabbit (only up to Nuxeo 5.3.2),
Choosing between the two backends depends on your constraints and requirements, but from the application point of view it is transparent:
- The API remains the same,
- Documents are the same.
The only impact is that VCS has additional features that are not supported by the Jackrabbit backend.
Nuxeo Visible Content Store (VCS)
Nuxeo VCS was designed to provide a clean SQL Mapping. This means that VCS does a normal mapping between XSD schemas and the SQL database:
- a schema is mapped as a table,
- a simple field is mapped as a column,
- a complex type is mapped as a foreign key pointing to a table representing the complex type structure.
Using such a mapping provides several advantages:
- a DBA can see the database content and fine tune indexes if needed,
- you can use standard SQL based BI tools to do reporting,
- you can do low level SQL bulk inserts for data migration.
Binary files are never stored in the database, they are stored via BinaryManager on the file system using their digest. Files are only deleted from the file system by a garbage collector script.
This storage strategy as several advantages:
- storing several times the same file in Nuxeo won't store it several time on disk,
- Binary storage can be easily snapshotted.
VCS being now the default Nuxeo backend, it also provides some features that are not available when using the JCR backend:
- Tag Service,
- Possibility to import a Document with a fixed UUID (useful for application level synchronization).
In addition, VCS provides a native Cluster mode that does not rely on any external clustering system.
This means you can have 2 (or more) Nuxeo servers sharing the same data: you only have to turn on Nuxeo VCS Cluster mode.
Advantages of VCS:
- SQL Storage is usage by DBAs and by BI reporting tools,
- supports Hot Backup,
- supports Cluster mode,
- supports extra features,
- supports low level SQL bulk imports,
- VCS scales well with big volumes of Documents.
Drawbacks of VCS:
- storage is not JCR compliant.
Apache Jackrabbit Backend
This backend is not present in new Nuxeo versions.
The Jackrabbit backend is compliant with the JSR-170 standard (JCR).
This is the "historical" backend, since first versions of Nuxeo were using this backend by default (it was the only one available).
Jackrabbit provides a storage abstraction layer and can be configured:
- to store everything on the file system (meta-data + files),
- to store everything in a SQL DataBase (meta-data + files),
- to store meta-data in the SQL DataBase and store files on the filesystem (recommended solution).
Advantages of this backend:
- it is JSR-170 compliant so you can use any compliant browser to access your Nuxeo Documents, even without Nuxeo code,
- it can run on a pure filesystem (not recommended for production).
Drawbacks of this backend:
- SQL storage is cryptic (Database stores serialized java objects),
- JackRabbit uses a Lucene index on filesystem (so clustering and hot-backup are complicated),
- doing reporting on JackRabbit data is complex.
Lazy Loading and binary files streaming
In Java API, a Nuxeo Document is represented as a DocumentModel object.
Because a Document can be big (lots of fields including several files), a DocumentModel Object could be big:
- big object in memory,
- big object to transfer on the network (in case of remoting),
- big object to fetch from the storage backend.
Furthermore, even when you have very complex documents, you don't need all these data on each screen: in most screens you just need a few properties (title, version, life-cycle state, author...).
In order to avoid these problems, the Nuxeo DocumentModel supports Lazy-Fetching: a DocumentModel is by default not fully loaded, only the field defined as prefetch are initially loaded. The DocumentModel is bound to the Repository Session that was used to read it and it will transparently fetch the missing data, block per block when needed.
You still have the possibility to disconnect a DocumentModel from the repository (all data will be fetched), but the default behavior is to have a lightweight Java object that will fetch additional data when needed.
The same kind of mechanism applies to files, with one difference: files are transported via a dedicated streaming service that is built-in. Because default RMI remoting is not so smart when it comes to transferring big chuck of binary data, Nuxeo uses a custom streaming for transferring files from and to the Repository.
The Nuxeo Repository uses the notion of Session.
All the modifications to documents are done inside a session and modifications are saved (written in the backend) only when the session is saved.
In a JTA/JCA aware environment, the Repository Session is bound to a JCA Connector that allows:
- the Repository Session to be part of the global JTA transaction,
- the session to be automatically saved when the transaction commits.
This means that in a JTA/JCA compliant environment you can be sure that the Repository will always be safe and have the expected transactional behavior. This is important because a single user action could trigger modifications in several services (update documents in repository, update a workflow process state, create an audit record) and you want to be sure that either all these modifications are done, or none of them: you don't want to end up in an inconsistent state.
In a lot of cases, Documents are used to represent Business Object: Invoice, Subscription, Contract...
The DocumentModel class will let you design the data structure using schemas, but you may want to add some business logic to it:
- provide helper methods that compute or update some fields,
- add integrity checks based on business rules,
- add business methods.
For this, Nuxeo Core contains an adapter system that lets you bind a custom Java class to a DocumentModel.
The binding can be made directly against a document type or can be associated to a facet.
By default, Nuxeo EP provides some generic adapters:
- BlobHolder: lets you read and write Binary files stored in a document,
- CommentableDocument: encapsulates Comment Service logic so that you can easily comment a document,
- MultiViewPicture: provides an abstraction and easy API to manipulate a picture with multiple views,