What is a persistence layer?
The persistence layer is an architectural layer whose job is to provide an abstract interface to information storage mechanism(s). An API to such an interface should be abstract, and independent of storage technologies and vendors. It would typically have features such as the following:
- store / retrieve of whole object networks by key
- logical database cursor abstraction for accessing some / all instances of a given type
- transaction support - open, commit, abort
- session management
- some basic querying support, e.g. how many instances of a given type exist etc
Usually such layers are built from at least two internal layers of software: the first being the abstract interface, the second being a set of bindings, one for each target database. In practice, there may be three layers since there may be an internal division between the logic for object and relational (and other) storage mechanisms.
Is persistence different in openEHR?
The openEHR architecture is different from other architectures in the health domain, and in most domains. The main difference is that it doesn't only have an object model from which to create software, database schemas etc, it also has also a layer of domain models called archetype FAQs. As a consequence, the part of the architecture that is defined as object models (known as the "reference model" or RM) is smaller and more generic than many models. The RM can be considered for most purposes as a typical object model. To get a feel for the architecture, the Architecture Overview (PDF) is a good place to start.
On one level therefore, persisting openEHR data is no different from persisting data in any object model of an equivalent number of classes (around 100, including all data types, EHR types, demographic types). There are two main challenges:
- bridging the object/relational gap: how to persist object data in relational databases which may be a requirement of the deployment environment
- the query trade-off: in any database, storing and retrieving numerous fine-grained objects is costly in terms of disc access time, yet the finer the granularity of storage, the more likely it is that the database's inbuilt query engine can query the data directly rather than forcing the application to retrieve and process the query itself..
These problems are related in their solution since if we opt for very coarse-grained data storage, the object/relational gap diminishes as well.
How can openEHR data be characterised?
To understand how to persist openEHR objects, we need to have a feel for the data. The business objects are known as "top-level structures" in openEHR, and are the items that are version controlled, including:
- Compositions (see EHR IM, ehr.composition package; Composition XML-schema)
- Folder structures (see Common IM, common.directory package)
- Ehr_status object in an EHR (see EHR IM, ehr package)
- Party structures in the demographic model, e.g. Person, Role instances (see Demographic IM)
Each of these types is equivalent to a document in the sense that it defines the level of granularity of store and retrieve from a service. None of the top-level structure types contain any "live" references to objects outside its own hierarchy. This means that a "store" operation on such an object will cause only that object hierarchy to be stored.Changing the interior parts of such object is generally done by retrieving the whole thing and modifying it in memory, then committing the changed version. The real question of interest with respect to how data are pesisted is to do with querying rather than the store/retrieve/modify/store cycle.
The following figure illustrates typical openEHR data and the scope of archetyping and templating.
Where openEHR data differs from most other object data is that it is archetyped, meaning that contains archetype node identifiers in every data node (archetype_node_id attribute inherited from LOCATABLE class). Further, every node (descendant of LOCATABLE) has a unique name attribute (also inherited from LOCATABLE). These attributes guarantee that an Xpath-style path can be defined for every single node in openEHR data, to the leaf ELEMENTs (the ELEMENT holds a DATA_VALUE instance). See the Architecture Overview for a detailed explanation.
All openEHR data contains two node identifiers - archeytpe_node_id and name, both inherited from the LOCATABLE class. These enable the creation of Xpath-style paths that can be used to uniquely identify every node in a data composition (see the openEHR Architecture Overview for details on paths). However, these paths differ from Xpath paths in that they carry the node meanings from archetypes, not just the reference model attribute names as typical Xpaths do. For example, consider the following path:
- [openEHR-EHR-COMPOSITION.birth_note.v1]/content[at0001]/items*[openEHR-EHR-OBSERVATION.Apgar.v1]* /data/events[at0003]/data/items[at0025]/value/magnitude
This is a path to an Apgar result in a note recording a birth. In English, this path translates to:
- [birth_note]/content[Objective]/items*[Apgar result]*/data/events[1 minute]/data/items[Apgar total]/value/magnitude
The meanings of the "at" codes are defined this way in the relevant archetypes (i.e., within the birth-note archetypes, "at0001" is the code for "Objective" in English). In the above paths, the parts marked in blue are attribute names from classes in the reference model; the parts in green are from archetypes. The bolded parts are archetype "root points", i.e. the data is based on a new archetype.
The availability of such paths means that every node in openEHR data is addressable using a meaningful path, opening the way for some novel possibilites in data storage, particularly relational storage.
Is openEHR a proprietary data format?
No, it's the opposite of proprietary. openEHR data are defined by the Reference Model specifications published by openEHR since 2001. These specifications define every detail of openEHR data, and are available in UML (XMI) form and XML Schema form.
But what if openEHR data are stored on a proprietary (i.e. commercial) database?
A great deal of production data (probably the majority) in the world in all industries, including health, are stored on proprietary databases such as Oracle, IBM DB2, Microsoft SQL Server. Indeed some of the openEHR vendor solutions are deployed on these databases. However, the data always follow the openEHR Reference Model, and can always be retrieved in the standard open XML Schema form, as well as the object form defined by the RM, via the EHR Service interface, whose calls return openEHR RM structures as Java, C# or other programming language objects.
This is in contrast with EMR solutions that define a proprietary schema for the database and no logical Reference Model.
What about Performance?
Regardless of what kind of persistence mechanism is chosen, performance of storage and retrieval is important, if the system is to be scalable to large numbers of users and databases accesses. Object-oriented data generally takes the form of fine-grained hierarchical structures, and openEHR data is no exception. Storing data at its finest granularity is almost guaranteed to be infeasible for scalable systems. Retrieval tests on typical object data stored in fine-grained form almost always reveal extreme inefficiency. Addressing this problem usually means storing the data at a coarser granularity, i.e. converting the fine-grained in-memory data into "blobs" and storing them instead. The questions raised by doing that include:
- what is the right level of granularity?
- how to store the part of the data not being serialised into blobs (i.e. higher parts of the hierarchy)?
- what about queries that need data from serialised-blob parts of the data?
In tests at UCL on the Java implementation of openEHR, retrieval of 1 openEHR Party object stored as fine-grained objects via Hibernate over MySQL and queried by primary key took seconds; retrieval of the same object stored in a "blob" form took a few milliseconds.
Are openEHR data Versioned?
Yes. Versioning is a key part of the reference model. Its semantics are defined by the Common Information Model specification.
How versioning is implemented will have a major impact on the storage approach. Logically, every top-level object in openEHR is versioned, meaning that separate versions from each commit are always available. Further, the Contribution concept means that any particular commit causes a set of versions (often called a "change-set") to be committed in one go. Rolling back to previous states of the data means retrieving the state of the data at each Contribution commit point, not just at arbitrary previous points in time.
We also have to be mindful of the requirements of versioned openEHR data - any solution should take account of the following features of openEHR data:
- only a very small number of Compositions (probably less than 20) in a given EHR will have many versions. These are the Compositions representing things like:
- problem list
- medications list
- allergies and alerts
- family history
- social history
- patient preferences
- other objects like the EHR Folder tree and EHR_status object may well have numerous versions
- for the vast majority of EHR data, there will only be one version; new versions will only be created to correct errors.
- for the vast majority of accesses, only the latest version of the EHR is of interest. Previous versions are only likely to be requested for reasons of medico-legal investigation, or process-improvement studies.
The options for implementing versioning in databases are broadly similar to those used in software version control systems like CVS, Subversion, BitKeeper etc. There are two major schemes:
- version control top-level items (i.e. document-level items like Compositions) singly, and manage overall versioning of the repository as lists of paths and version ids. Systems like CVS and BitKeeper do this, using RCS and SCCS respectively to achieve versioning of each item.
- version the entire database in such a way that any nominated version corresponds to a complete view of all the data. Either the first or latest version of the repository is usually known in full, with all other versions containing changed items only with respect to the adjacent version. Subversion works like this.
The big difference between software version control systems and versioned data systems is usually that the former are file-based, whereas the latter are database system-based, allowing for fine-grained data storage, retrieval, querying and better multi-user session management.
Object databases often provide versioning. The Matisse database for example has inbuilt low-level versioning, achieved efficiently with its never-write-in-the-same-place storage approach. Such facilities on their own probably won't do what is required by openEHR, since openEHR has an explicit notion of identified versions of each top-level object. Given that the main data access need is for the latest version, it may be quite reasonable to treat the latest version as a normal database, and to manage older versions in a way that is not tightly bound to the latest version. This probably favours storing the latest version in full, and earlier versions as differences.
What are the Options for Storage?
There is a wealth of knowledge on the subject of persisting object data. One useful general reference is the Barry & Associates site, another is Scott Ambler's site. Here we try to cover just some of the major ideas, in rough order of priority. To summarise at the outset, we need to consider issues such as:
- why not use an object database?
- do the data need to be retrievable from the database by software written in other languages?
- what granularity of data needs to be queryable?
- can we use openEHR paths to help?
Object databases and persistence frameworks
One option may be to forget all about relational databases for your persistence, depending on whether you have constraints in your deployment environment on what kind of database or persistence mechanism you are allowed/encouraged to use. The attraction of object databases and other native object mechanisms is that you don't have to think too much about how your data fits the database - because there is no semantic gap between your objects and the database. If an object database or framework satisfies all the needs of the service you aim to provide then this is a good option. You have to carefully consider all your requirements and assess them against the product you are considering. Issues to consider...
- Adding an object database to an existing environment means adding more database administration, including start, stop, backup, archive and other operations, most likely in a tool that existing sysadmin / operations staff have never seen. Make sure the overhead is acceptable. An object persistence framework probably won't be visible at all to such people.
- Some object databases and most object persistence frameworks store data in native object form, e.g.. Java objects are stored in native binary form, only retrievable by the same software and instantiable as Java objects. This may be fine, but you need to be sure.
- What is the finest grain of query you need to be able to do? There is probably no point in storing data smaller than this granularity.
An object persistence framework is typically a fairly lightweight library that provides: a persistence API, a method of persisting data to disk, and a smart cache. The API is typically of the form where calls like store(an_object) can be made, where an_object is the root object in a network of objects that together make up a whole top-level structure. Object persistence frameworks don't usually provide all the session management, querying, security, and transactional power of full database systems. They may or may not be scalable to large numbers of users, in may be more oriented to client-side persistence rather than server-side persistence. Examples for Java include: db4o.
An object database on the other hand is a proper scalable and secure database management system that supports querying as well as persistence - in other words, like a relational databases system, except that it deals directly with objects rather than tables. Usually some object-flavoured SQL will be supported. Example products include Matisse (a language-neutral database with SQL querying). There are also clinical information systems based on object databases such as InterSystems Caché and Jade. Zope is a Python-based object database that is quite widely used behind active websites and has been used in health information systems, e.g. FreePM, OIO.
An object/relational (O/R) product is one that ultimately relies on an underlying relational database to store the data but does all the hard work of turning objects into relational form to write into the database. From the programmer's point of view, it may look just like an object database. The advantage of this approach is that it allows you to use an existing relational database in your environment that is already required for some other purpose. O/R products solve the problem of performing the object/relational mapping in a generic way, but they don't a priori know anything about your data. In particular, they don't know about what the patterns of querying are, where the business object boundaries are, or anything else. Some products may allow such things to be specified.
The default situation will be that using an O/R product on a typical object model over a relational database will result in numerous tables and extremely fine-grained object storage and retrieval, with the consequent performance penalty. Most likely, an O/R product will not know about business object boundaries and will do the same thing as an object database with a naively designed object model: store and retrieve everything reachable by reference-following. Avoiding these problems means at a minimum reducing the granularity of the objects being stored; see below.
Object data can be directly stored in a relational database, but the schema design is a greater issue. If the intention is that schema is a derivative of the object model - i.e. the "classical" approach to mapping (typical strategies) then the schema design may not be trivial. This kind of schema design is what many of the O/R tools try to automate and/or hide. However, other strategies are available, including one very interesting one which is possible due to the paths in openEHR data.
See this wiki page for an approach called 'node+path' which shows how a relational database could be used to store path-based archetyped data such as that found in openEHR.