Terminology Querying for AQL

(a long read, maybe read faster the items you already know about)

In short, 4 points:

  1. SNOMED-CT is able to describe and query every clinical idea.
  2. SNOMED-CT has an enormous amount of clinical knowledge in its smart databases.
  3. OpenEHR can use this, cooperate with this, but not out of the box.
  4. Read below how to change this.

The problem in short is in fact a combination of a few problems which are possible to solve in one solution. Read the post linked above for a more extensive description.

  • Medical data in classical systems are stored without context. The context is in GUI’s, not in the data and cannot be processed by machines.
  • Medical data in classical systems are a labyrinth in which it is hard to discover their meaning without studying the GUI's which were used to produce the data, and what if one table serves more GUI’s and data in one table can have more semantical meanings? This happens a lot. Reuse of classes for similar but different purpose, reuse of table-structures. I have seen that many times in the hands of well known companies.
  • Medical data (also) in openehr systems are hard to query, because the large number of data-constellations (archetypes in OpenEHR) and their internal structures
  • Medical data (also) in openehr systems impose a level of granularity in the result-sets of queries.

To understand the solution we have to become familiar with some terms I use, and which aspects of them I see as important.

Archetypes and Templates

Archetypes

OpenEHR-Archetypes are structures which define datasets.Besides OpenEHR-Archetypes there are also ISO13606 archetypes, but there is not one system in the world that I know about using ISO13606 archetypes, and there are many (but not very many, but also large systems) using OpenEHR-Archetypes. The idea is a bit similar then XML Schema which defines a XML dataset. But that is about where similarities end.

Archetypes define OpenEHR datasets. But there are many differences between XML Schema and Archetypes. Archetypes define much more than XML Schema does, and archetypes they are constrained by a Reference Model focussed on clinical ideas. Archetypes define data which are structured, they have paths to locations where they are in the structure. Archetype-paths have node-identifiers to identify items in a path. The archetype-node-id’s are unique in an archetype.

What does an archetype do? Let me picture an example Archetype. The boring blood pressure example. The designer of the archetype wants the system using the archetype not only to store the systolic and diastolic blood pressure, but also some details about the moment of measurement. That is demanding for context, one of the advantages of archetypes. The designer can enforce to describe context. The patient was standing, sitting, just sporting, etc. Other systems, classical systems can do that too, but when using archetypes, it is stored in the dataset which records the blood pressure. And after 50 years a researcher can still see that the blood pressure was measured under certain circumstances.

That is the big advantage of archetypes. There is necessary context defined, and the data are together with the context stored in a clear way. The ontology section and the meta-data section explain the data.

This is the big achievement of OpenEHR, you always know what data mean. Semantic rich. You can find a nice collection of archetypes here.

But now the disadvantages.

  • Archetypes offer a lot of freedom how to constrain data and how to structure data. So, in one archetype, Code System X is used, and in another archetype Code System Y is used. It is impossible to query archetyped datasets without studying the archetype. At the moment, OpenEHR is a success, but not a big success. Imagine it would be a big success, and devices and medications would be accompanied with vendor designed archetypes, which fit very good to that what the vendor sells. A hospital would have repositories of thousands of archetypes. It would become impossible to study them all, and for that reason it would be impossible to query archetyped datasets. Imagine you are researching pain, and how it is treated. You are searching all datasets to find situations in which there is pain, and see how it is treated, and what works best. This is very difficult in a hospital which have thousands of archetypes, because you don’t know on which paths the pain is described, and you don’t know the structures where you need to go from that point. So generic query is not an option.
  • Archetypes do not have an hierarchical structure like SNOMED concepts have. There are some efforts to create that, but it is mimicking the work of IHTSDO, and without the possibility to do it right, and there by: Why do something which is already done very well? The disadvantage of not having a hierarchical structure is like this: You can search for “laparoscopic appendectomy” and miss “emergency appendectomy” or “excision of intestinal structure” and in this way miss an important situation in the body of a patient. I call that imposed granularity, imposed by the archetype which does not offer out of the box a semantic hierarchical structure of medical data.
  • Patient: “How could you not know that I was treated for cancer 20 years ago, you would have known about the risk of having it again? It was even treated in this hospital.”. Specialist: “Well, dear patient, our system does not support search on coarse grained generic terms. If you were treated by an oncologist connected to this hospital, we would have found it, we have searched for that (our supported way of some kind of generic searching), but apparently it was treated by another specialty, we got no alarm bells. That that was also in this hospital does not make it look better for us. Another problem is that that system-migration was in between.”

Templates

Templates use part of archetypes to define use of data-representation. Templates can be used for screen presentation. There can also be templates for other purposes, messages, reports, etc.

Archetypes, except for its original purpose: data-structure-representation, need to be purpose independent. Archetypes are the base to build templates on, and depending of the purpose of the template, the data will be represented in a different way.

So archetypes need to be as much as possible machine-processable in a generic way. If you use all kind of convenient terms in different archetypes to, for example, indicate pain, then you will never be able to datamine (research in the history of a patient, or in a population) in a more generic way, because you will need to know (fully) about the specific archetypes, which can be thousands in a hospital with a history of ten years or more, of using archetypes.

Keep this in mind for the rest of my explanation. Templates are good, they are pre-defined presentation-definitions, and they are the constructions for which archetypes are the building blocks. We need archetypes and templates. I repeat it because this is not meant to oppose against OpenEHR. I am searching for an improved use of it.

SNOMED-CT

Since this post is about combining SNOMED CT and OpenEHR to something better, the combination being richer than the two parts apart. I need to explain which parts of SNOMED-CT are important in  this idea.

As SNOMED-CT is often used now, it is a dictionary to look up codes and retrieve descriptions. It is about being sure that the other side, the message-receiver, or the reader of the data is understanding exactly what the data are. That is important, and nothing against it. But if you use SNOMED-CT in that way, you could, apart from the translations, as well not use it. SNOMED-CT is so much more than that. It are medical ideas, clinical ideas, they are called concepts. It describes relations between them and give a semantic meaning to relations. (and of course, translated in many languages). It has hierarchies, like “laparoscopic appendectomy” IS A “appendectomy”. “Emergency appendectomy” IS (also) A “appendectomy”. It has about 450.000 clinical ideas pre-defined in the concepts, and defined over 3 million relations between them. Besides this huge base of information, unique in the world, and it is growing….., it has an expression language to define clinical ideas which are not predefined in those 450.000 concept.

Besides this, it has reference-sets, which can be of all kind. Mappings to other code-systems, selections of clinical ideas based on specific medical conditions (extensional), or on hierarchies, machine processed (intensional), and machine processable.

And it is growing…..

OpenEHR is not using this at all, and has no syntax to use it. That is what I want to change in this post. This post is an idea how to combine the full potential of OpenEHR and SNOMED-CT. And maybe my ideas are not very elegant and need refinement, see it is a signal: “He, we need to do something about it.”

SNOMED-CT Expressions

In SNOMED CT there are two kinds of expressions to record clinical meanings.

Pre-coordinated Expressions

These are pre-coordinated concept definitions. These are the simple ones, they are defined in the SNOMED CT releases, and connect a ConceptID to a description and to relations, and enable users to disclose in this way over the internal relations everything what is known about that specific clinical term in SNOMED CT.  For example: “31978002 |fracture of tibia|“. SNOMED allows a way to give a description with a code between the pipelines: “|”. This is to make expressions readable.

Because this is a pre-coordinated expression, we can find its relations in the database. Let’s take a look, just for fun. You see, it has two parents, and 8 direct children, and many more children from children. Quite a thing, “fracture of tibia”, and the good thing about being a pre-coordinated expression: already researched on relations and defined and not only from hierarchical point of view

Let us take a look at that (on the left). We see the parents in purple (IS A relation) and we see some attributes. The finding site and associated morphology. All these items can have further child and other parent-items. This is so much information. When a system does not use it, it will be, in the end, hard to explain to the stakeholders (patients).

Post-coordinated Expressions

The other kind is post-coordinated expressions. This is an expression containing two or more concept identifiers. They exist so that not every clinical meaning needs an own new conceptId. It avoids an explosion of the SNOMED CT terminology numbers of concepts. The syntax of post-coordination is in use of description similar to pre-coordination expressions: conceptId’s and optionally descriptions.

There are no absolute boundaries between pre-coordinated and post-coordinated expressions. For example, the post-coordinated expression:  “22253000|pain|:363698007|finding site|=56459004|foot|” is a specialization of pre-coordinated expression:  “10601006| pain in lower limb | ” which is defined as:  “22253000|pain|:363698007|finding site |= 61685007|lower limb structure| ” and  “56459004|foot| ” is a subtype of “61685007|lower limb structure | ”

Important to remember from this information is that it is possible to translate and write expressions at any detail-level, and that post-coordination is the way to escape from the SNOMED CT pre-coordinated content-boundaries. Without doubt, you can express every clinical idea in SNOMED-CT. This is not (yet) the case for OpenEHR.

Compositional Grammar

Compositional Grammar is the “language” to write expressions. It has a syntax which allows complex structures, like subexpressions, attribute groups, concrete values, subsumptions.  So it can be used for authoring, validating, storing, displaying and many more purposes.

It can be used to represent the definition part of an archetype. Because SNOMED CT can describe every clinical concept, every archetype can be represented in an expression, mostly post-coordinated. So, it is not necessary to lose information. Not necessarily all elements of an archetype definition need to be translated, this is a design decision of the information analyst which creates the archetype. I come back to some small syntax adjustment which can be very helpful to use post-coordinated expressions to represent archetypes.

Expression Constraint Language

The Expression Constraint Language is a bit similar to the Compositional Grammar, but has characteristics which are designed for querying instead of defining clinical concepts. For example:  “<19829001 |disorder of lung|: 116676008|associated morphology| = <<79654002 |edema| ” would return datasets which have  “40829002|acute edema| ” and  “103619005 |inflammatory edema| ”

The Expression Constraint Language needs its own engine, and is able to query everything which is in SNOMED-CT.  See the link for description.

AQL

AQL (archetype query language) is a query language for archetyped data, using the archetype-paths as roadmap through the data-structures. The syntax reminds of SQL, because it has keywords like SELECT WHERE FROM and the structure is also SQL-like. But it also reminds of XQuery, because the FROM, WHERE and SELECT arguments are path-based. AQL is very powerful, it can disclose the complete archetype-structured data. It can also query for SNOMED CT concepts, when these are part of the archetype structure. But it cannot, at this moment of writing, search using the full potential of SNOMED CT. It does not have syntax for that. The engine will not support that. See in section Process, below, how this can change.

Process

1) There is need of a post coordinated expression representing all facets (if appropriate) of an archetype definition. This expression needs to be created by hand or by supporting tooling. It cannot be generated on existing archetypes. It will be a string which will exist in a specific and predictable path in an archetype.

There are no changes in the existing syntax or semantic rules in archetypes, and it can be used in ADL 1.4 as also in ADL 2.0. This post coordinated expression needs to augmented with archetype-node-id’s. (this is necessary to map the data, in step 2, to the SNOMED CT expression-parts.

So the expression will look like: “272741003 | laterality | id123 | =7771000 | left | id456 |” or when omitting the optional description: “272741003 || id123 | =7771000 || id456 |“. Another possible way is to make the expression less readable, by omitting the description and use the space for the archetype-node-id. In this way, the expression remains interoperable with other SNOMED-based-software. Also, the added expression does not conflict with the rest of an archetype, and already existing archetypes can add this expression without losing backwards compatibility.

2) When processing a dataset, validating it against the archetypes and other process steps which may be in a specific OpenEHR implementation, there will also be a step needed which completes the post-coordinated expression from point 1 with the datavalues. This is a separate software process which will not conflict with other already existing software processes.

3) The query, all archetyped data which have pre- or post-coordinated expressions can be queried using AQL, and OpenEHR query arguments can be combined with SNOMED CT arguments. All old queries remain functional. The AQL query engine needs a small extension, it must call (if appropriate) the SNOMED expression-constraint-engine (which will be a separate software-module). That is the only change in the existing software in a specific OpenEHR implementation (one line in a Spring configuration ;-). Because the expression constraint is a boolean-constraint as all expressions in the WHERE section of AQL, so the result can be evaluated in the same way as other clauses in the WHERE section are evaluated.

4) An example of a SNOMED CT query integrated in AQL. Note that this is a proposal, but I expect it will look like this, however, I expect also some small changes in keywords.

SELECT
    e/ehr_status/subject/external_ref/id/value, diagnosis/data/items[at0002.1]/value
FROM
    EHR e
        CONTAINS Composition c[openEHR-EHR-COMPOSITION.problem_list.v1]
            CONTAINS Evaluation diagnosis[openEHR-EHR-EVALUATION.problem-diagnosis.v1]
WHERE
    c/name/value='Current Problems'
    AND diagnosis/data/items[at0001]/value/value matches { terminology://Snomed-CT/expressionConstraint? "<19829001:116676008=<<79654002"}

Purpose

Data-processing and data-mining

For example, in OpenEHR you can only query for things of which are in archetypes. It does not have clinical idea’s hierarchical implemented. For example, OpenEHR could have 30 different archetypes for 30 subtypes of a disease. They all can have their own specifics information requirements and can need 30 archetypes to be well represented.

SNOMED CT has another approach, it has hierarchical information (is-a relationships) and can use inferred attributes (inheritance). Using SNOMED CT, it would be possible to do a coarsed grained query to return all people which have a specific disease as a supertype. This query could also return all archetyped data of people which had one of the subtypes of that disease.

So doing SNOMED-CT queries gives more opportunities to define the result of a query.

Templates and SNOMED-GUI’s

Templates are already described . They serve for data-representation. SNOMED-CT also is advertised for data-representation, but it does not have a real communication-form to generate or build on RAD a data-representation. There is still hard coding necessary in the form as When selected this, build that GUI. This is the buggy part where SNOMED-CT can lead to.

Templates are much more elegant. See the GUI builder from Marand, how to come from data to archetypes, and from archetypes to templates and from templates to GUI.

We need to make that available for SNOMED-CT, archetypes can help with this.

Entry types

The OpenEHR reference models knows about six entry-models, these are the top hierarchies in which clinical ideas are arranged. These entry models are: AdminEntry, Observation, Evaluation, Instruction, Activity and Action. That it is. And behind every entry-model is a specific data-structure. This data-structure is thought of and it is created in that way for reasons of the normal use of the clinical ideas behind the entry-type.

This is also a problem. You will not always be able to define how data are stored. There is no genericity. It described this problem before. SNOMED-CT can solve this problem in two ways. One is that the archetype contains a SNOMED-CT expression representing the archetype. This SNOMED-CT expression can be read and queried generic.

I like to go one step further, because, instead of creating (generating by humans) of post coordinated expressions representing the archetype, we can think the other way around. We can create a post-coordinated (or use a pre coordinated expression) to generate an archetype. This will then always be an archetype of the Generic Entry Type, and it will have a similar derivable data-structure as represented in the expression.

Then we unleash the real power of SNOMED-CT in OpenEHR, and in templates, and can OpenEHR help to extend the use of SNOMED-CT.

Conclusion

OpenEHR can help SNOMED-CT to come to a better communication between GUI-layers and the post/pre coordinated expressions.

SNOMED-CT can help OpenEHR to do deeper and more sophisticated queries on hierarchies, and on knowledge expressed in SNOMED CT clinical idea’s

Not SNOMED-CT nor OpenEHR need (many) adjustments in their specifications. The only thing really needed is good tooling.

The adjustments desired are an optional small augmentation of the compositional grammar specification (archetype-node-id’s) and a small augmentation of the AQL specification (query expression as argument in AQL query). The rest what is needed can be done within the current specifications, and there is no problem regarding to compatibility with current software and practices.

That is the whole story on one page. Think about it.

Thanks

Bert Verhees