Date: Thu, 28 Mar 2024 17:37:53 +0000 (UTC) Message-ID: <1442989357.3.1711647473891@daeeb5886584> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_2_1404707611.1711647473891" ------=_Part_2_1404707611.1711647473891 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
This treatise on the pros and cons of codes and coding is based = on this initial p= aper by Eric Browne, October 2008
One often hears that coded data are essential for semantic interoperabil= ity and decision support. Coding is the use of symbolic, or alphanume= ric identifiers to tag data items as referring to concepts or terms from an= agreed vocabulary or ontology. Coding, may, in many circumstances have som= e value. But it also comes at a price. This article looks at the balance sh= eet to tease out the issues facing those making recommendations for electro= nic health records and semantic interoperability. =E2=80=A8Although some mo= re general aspects of terminologies are touched on, the focus is not on cli= nical terminologies and classifications per se, but rather the use of codes= attached to, or used in lieu of words and terms.
Codes have been around since before the days of computers, but all digit= al computers must rely on codes for their very operation. Each instruction = is represented by a combination of 0s and 1s. So too is every piece of data= . With respect to data, codes can be applied at different structural levels= . Thus, each character is represented by a code or codes, from the simple 1= 27 common ASCII characters used to write this article, to = the complex kanji, chinese and other characters and symbols that are repres= ented by more complex coding schemes such as UNICODE. So, by combining ch= aracter codes, we can represent and store words and phrases - i.e. te= xt strings. The string "openEHR" is built from the ASCII c= odes:
= 01101111 01110000 01100101 01101101 01000101 = 01001000 01010010
In the early days of computing, memory, storage space and communications=
bandwidth were very limited. It made sense to not only code individual cha=
racters within a string, but even to code text strings themselves, by repla=
cing the set of codes representing each character of the string by a single=
code representing the entire text string. Particularly so if the string wa=
s likely to be repeated in other locations. "Diabetes Mellitus" co=
uld be replaced by code "1101110011010001", for example. Or even by just a =
shorter string, say "DM". It saved precious space and bandwidth. C=
odes were easier for computers to identify in searches and to place into pr=
edefined message structures.
But most of these barriers have evaporated over the years, as computer sto=
rage first increased a thousandfold, then a millionfold and onwards. The ba=
ndwidth of many of our network links have scaled by several orders of magni=
tude each decade. Now our programming languages and programmers can support=
sophisticated pattern matching through "regular expressions" and other adv=
ances which allow them to operate directly on text strings instead of codes=
. Still, the legacy of those old constraints live on in the specifications =
and the mindset of many current authors and standards development bodies.
The above history tells only part of the story. Surely there are other fac=
tors influencing the scene when it comes to representing clinical terms and=
data through the use of codes?
One argument cited for coding clinical text strings is to ensure agreeme= nt between communicating humans on a set of terms that describe their unive= rse of discourse - the concepts related to health and healthcare, or some s= ubset thereof. These are variously known as vocabularies, nomenclatures, va= lue sets, termsets, codesets, terminologies, classifications, etc. The comm= unicating parties agree on the set of valid or prescribed terms, and from t= hen on they use the pre-allocated code as a reference to (proxy for) each t= erm in the prescribed set. The set constrains the permitted vocabulary for = a specific scope and purpose. Thus the set can simplify the processing for = that purpose. This raises two issues.
Some termsets allow for concepts to be represented independently of the =
words used to describe each concept. This allows a concept to be desc=
ribed by more than one term. Synonyms and language translations are the two=
examples commonly cited. The concept "level of glycosylated haemoglobi=
n in blood plasma" might legitimately be known and referred to as "hemoglobin A1c", "HbA1C", "HBA1C level", "glyca=
ted hemoglobin concentration", "la hemoglobina glucosilada" o=
r a host of other terms.
Conversely, a term can describe more than one concept. A "left ventric=
le" could be a compartment of the heart or part of the brain.
Though not strictly necessary, codes can help handle the 'many to one' and=
'one to many' relationships that might be needed, particularly in a large =
complex terminology where such requirements are certainly likely to occur.<=
/p>
As medical knowledge expands and evolves, so to do the concepts and the =
language used to describe concepts, change. Sometimes, the concept morphs b=
ut the term used remains. Sometimes the concept remains the same but an alt=
ernative term is used. Sometimes the two occur simultaneously. One old conc=
ept is cleaved into two or more new concepts and each new concept given a n=
ew term. The original concept may even fade out of our daily lexicon. It is=
desirable, particularly in a longitudinal health record spanning many deca=
des for the current viewers and users of the record to somehow be able to m=
ake sense of the concepts and language of yesteryear. But the terminology h=
as to be designed and managed well, for these purposes, the electronic heal=
th architecture needs to support this, and the current (at the time of view=
ing) implementations could need access to the prior state of the terminolog=
y at the time the entries were made.
Concept permanence is probably of even greater importance for statistical =
comparisons and research analyses that span long periods of time and patien=
t cohorts. One only has to trace the history of the classification of, say,=
the various manifestations of hepatitis through successive versio=
ns of The International Statistical Classification of Diseases and Related =
Health Problems (most recent release is ICD-10), t=
o appreciate the complexities and interaction of changing concepts and chan=
ging codes.
Codes provide a mechanism for anchoring a term to a particular spot in a=
terminology where such terminologies provide relationships between concept=
s. In SNOMED CT, for instance, codes play a pivotal role in =
uniquely identifying concepts, and for uniquely identifying relationships b=
etween concepts via relationship types. Codes are also used to support othe=
r management and searching functions within the terminology.
Compound, multi-axis terminologies, like LOINC use codes to refer to leg=
al combinations from a set of components. In the case of the LOINC terminol=
ogy, a catalogue of laboratory tests are each given a code that describes t=
he test in terms of 6 components, including the name of the component or an=
alyte measured, its property (substance concentration, mass, volume), the t=
iming of the measurement, the type of sample (serum, urine, etc.), the scal=
e of measurement (qualitative vs. quantitative, etc.), and the method. The =
compound, or "pre-coordinated" codes, embody knowledge beyond that ascribed=
to the individual components, since only valid combinations of analyte, sa=
mple, scale etc. are constructed. Each compound concept can be processed as=
a single entity. If the processing system has access to the LOINC table, t=
hen each of the components of the compound entity can be also accessed sepa=
rately.
In even more complex terminologies, such as SNOMED CT, the =
terminology can provide the ability to combine terms according to rules. SNOMED CT uses the codes attached to each atomic concept, a co=
mpositional grammar, and formal "concept model"s based on Description Logic=
for some key clinical topics such as clinical findings, to govern the prod=
uction of compound concepts. Some of the compound, or pre-coordinated conce=
pts, are already provided in the terminology. Given the existence of approp=
riate software, many more can be constructed by users of the terminology (p=
ost-coordination) for specific data entry requirements on an as-needs basis=
. Currently there is no standardised mechanism for labeling or coding these=
post-coordinated concept codes - they simply exist as expressions of codes=
together with their syntactic glue. The issues associated with using these=
SNOMED CT expressions in clinical systems and electronic =
health records are complex and significant. A mere hint of some of the issu=
es can be gleaned from the extensive analyses conducted by =
David Markwell [MAR2008] for the UK's National Health Service.
Codesets have been used for dealing with a vast range of differing size =
value sets for a considerable range of different purposes. Most of these co=
desets are local to a geographical region, local to individual manufacturer=
s of clinical systems, or nationally mandated statistical data collections.=
The proliferation of codesets and the variability in the complexity of cod=
esets combine in a way that inhibits semantic interoperability if they need=
to co-exist in a given clinical system.
There are many codesets that have been developed and intended for very spe=
cific data fields, and which are exhaustive for the intended field. They ma=
y only have 2, 3, 17, several dozen terms at most to cover. Examples of suc=
h simple codesets include administrative gender:
Code |
Meaning |
---|---|
M |
male |
F |
female |
U |
undifferentiated |
Other codesets are larger, but still often only flat lists of terms and =
corresponding codes.
HL7 v3, for example, has some 250 'supporte=
d vocabularies', about half of which are managed by HL7, and half m=
anaged by organisations external to HL7. Even of those mostly supposedly fl=
at codesets internal to HL7, many are not stable from release to release in=
any sense, have contradictory definitions in different places, and have a =
plethora of different code forms and inconsistent information. Some codeset=
s have a specialisation hierarchy implicit in the set. Some codesets have a=
specialisation hierarchy encoded into their codes. Some have a combination=
of both approaches. If HL7 International cannot manage their own codeset=
s consistently and effectively, then how can systems trying to parse incomi=
ng HL7-based messages ever be expected to cope?
For many of these codesets, it is left to local implementers, national sta=
ndards bodies, vendors and possibly even clinicians and others to decide if=
the codeset is appropriate for their scope of implementation. If not, then=
they must decide to either replace the set, modify it, or augment it with =
the codes and corresponding terms peculiar to their scope. The ongoing sync=
hronisation often becomes an impossible treadmill of reaction to chan=
ge, well beyond the control of the clinicians and clinical institutions try=
ing to provide health care based on such an ad hoc approach.
A small number of large, well designed terminologies offer much for =
decision support. However, terminologies of this ilk, such as SN=
OMED CT are much more complex than simple flat codesets. One sing=
le release of SNOMED CT has millions of codes, pointing to c=
oncepts, terms, relationships. It's codes form a multipurpose polyhierarchy=
of concepts and terms, with multiple relationship types. It has mechanisms=
for extension, multi-language translations and subsetting for defined purp=
oses. It's compositional grammar, as already mentioned, allows for terminol=
ogical expressions to be constructed as needed, based on the concepts avail=
able. Even without SNOMED CT's significant and documented pr=
oblems, implementing and harnessing all this power in any one real system i=
s a profound challenge. Deploying it broadly and effectively across a range=
of systems to aid semantic interoperability is taking the challenge to eve=
n greater heights.
As a general principle, codes are for computers not humans. Codes should= work behind the scenes and not be exposed to users, particularly busy clin= icians. They should not be deliberately exposed, unless absolutely necessar= y, to those who are only peripherally likely to understand their meaning, s= uch as software developers, or data modellers. Writing standards and specif= ications for humans, that are littered with abbreviations and codes often d= reamed up on a whim, that have to be understood, transcribed, embedded in p= rogram code, put into test scripts and test specifications and otherwise di= scussed and manipulated, and above all remembered, is fraught with danger. = It is not sound engineering practice. It dramatically narrows the pool of e= xperts who can understand and use the specifications, and risks misundersta= nding and transcription errors and the resultant clinical errors that can e= nsue.
The above notwithstanding, there are places where codes and humans legit=
imately need to meet. These situations are where textural descriptions are =
too awkward to use. Common examples in daily life are things like postal co=
des and bus codes. It is far simpler to refer to postcode "5068", than "that area bounded by the Sunnybank River to the North, Franklin Bridge, Ra=
insford Rd and Elm St. to the east, holes 6-14, 17 and 18 of the Royal Plun=
kett Golf Course to the South, and ....", or to bus "J1E" instead of&n=
bsp; " the bus that departs from the corner of Edmund St. Walkerville a=
nd ...".
In health IT, examples might be genes and gene sequences, tumour staging, =
or the classification of diseases. In these circumstances, it is often easi=
er for the humans involved to refer to these concepts by codes. Where codes=
are to be used by humans, it is sensible for the codes to carry additional=
meaning or representational hints to aid the humans disambiguate the codes=
and reduce the chance of error during human processing and transcription. =
Thus in the bus code "J1E" , the 'E' might denote express. Similarly, where=
codes are to be used by humans, the shorter the code the less likelihood f=
or error. In Australian hospitals, it is common practice for a patient's id=
entity to be verbally cross-checked by nurses prior to procedures, includin=
g administration of some drugs. This cross check usually uses the hospital'=
s own Unit Record Number for the patient, which usually has few digits and =
so is relatively human-friendly.
It is probably the history of human abbreviations in the early days of c= oding and messaging that has lead to the proliferation of a vast array of s= emi-interpretable "codes" creeping into what should only be computer-proces= sable identifiers of many codesets. Even in the most recent versions of HL7= , these are variously and conflictingly referred to as "mnemonics"= , "codes", "conceptIds".
EHR systems impose requirements on data far exceeding those required in =
messages. Data may come from a vast array of sources, including direct inpu=
t by humans, messages from laboratories, pharmacies etc., referral and othe=
r documents from other healthcare providers, etc. Data may have to be avail=
able for decades. Data may have to be processed into different forms for di=
fferent users and purposes - e.g. aggregation across time, and other variab=
les. Data may be needed to be searched using search criteria expressed at a=
variety of levels of detail. Data may have to be presented in different fo=
rms to humans whose medical knowledge varies considerably.
Codes can help in this process, but they can also hinder. They can hinder,=
because they are always at least one step away from the human meaning conv=
eyed by the code, and so their processing is critically dependent a) on the=
availability of the code system that can provide the link to the term or m=
eaning of the code, b) on the quality of the code system and underlying ter=
minology c) on the capability of the processing system to deal with problem=
s when the links can't be resolved or generate conflicts, d) on the ability=
of the EHR system to handle evolution of the coding scheme or terminology =
over time.
When small termsets such as gender are used within a given language real= m, what possible gain is there by replacing the value "male" by a = code such as "1", or "M"? There certainly is plenty to lo= se!! Why should every information system that receives such a code have to = deal with this? Humans can understand "male" easily. Computers can= process "male" easily. Humans cannot understand the code "1" in a= ny meaningful way!. Computers cannot process "1" in any meaningful way, oth= er than perhaps saying that "1" (male) is less than "2" (= female)! This may be true in one sense, but is this the intention of the se= nder of such coded data - to obfuscate and compromise patient safety? The c= ode is absolutely useless without access to the accompanying meaning - e.g = via some code table. Who can guarantee that that access will always be avai= lable? Why place such a burden on every clinical system needing to process = gender for absolutely no benefit.? It is far more important to give the cli= nicians definitional information about the meaning of appropriate terms in = the particular context of the data field. Does this refer to administra= tive or physical gender?
The more small codesets that information systems have to deal with, wher=
e disambiguation of multiple-meaning terms is not required, the less likely=
we will have of achieving a reasonable level of useful information exchang=
e. We should not be blindly advocating that all data be coded. We should st=
op and think of the ramifications of such recommendations.
One ramification is that we are forced to build code maps between many dif=
ferent standards and coding systems in order to meet the coding requirement=
s demanded by each system. There is no longer oportunity for consideration =
being given to the importance of insisting on an appropriate code for each =
data item. Instead, software developers and implementers and message "integ=
rators" are left trying to force square pegs into round holes. Continuing t=
he simple gender example above, we have many examples such as the following=
internet discussion forum snippet:
> We= have a case where a HIS system has added some definitions to their > possible values for patient sex. They are: > > M =3D male > F =3D female > T =3D transgender > U =3D Undifferentiated > ? =3D Unknown > > However, DICOM only supports: > > F - Female > M - Male > O - Other > > And I found this table in HL7 2.4: > > 3.4.2.8 PID-8 Administrative sex (IS) 00111 > Definition: This field contains the patient's sex. Refer to > User-defined Table 0001 - Administrative sex > for suggested values. > User-defined Table 0001 - Administrative sex > Value Description > F Female > M Male > O Other > U Unknown > A Ambiguous > N Not applicable
Evaluation of a specific terminology or codeset is often undertaken in i= solation of its use. Criteria such as Jim Cimino's 12 desiderata [CIM1998] = are often considered for this task. Cimino [CIM2006] further augmented thes= e structural requirements with desirable characteristics to support the pur= pose of a terminology, citing the following:
Whilst not exhaustive, these useful criteria can also be used to judge t= he utility of the codes used to underpin the functioning of the terminology= . But judging a terminology, and the coding thereof, should be undertaken i= n the context of the entire clinical information system(s), the clinicians = and other users of the data, the flows of the data from system to system, a= nd all of the other information and terminology components that are also in= volved, both now and into the future. A truly daunting assessment task.
The consequence of requiring so many codes and coding systems is that co=
mprehensive electronic health record systems need many code tables for each=
implementation; they need to maintain versions of those code tables; they =
need the capability to process the code tables; they need to process the ve=
rsioning of the code tables; they often need mapping tables to map between =
code tables; they need the capability to parse and interpret and map based =
on the peculiarities of both the source and target coding systems; they may=
need to hold multiple versions of the mapping tables; they need the capabi=
lity to process the versioning of the mapping tables; they need the capabil=
ity to map between different versions of different mapping tables. And in a=
lmost every case, the maps are not one to one. Compromises and arbitrary de=
cisions are made on an institution by institution, map by map and code by c=
ode basis.
Another cost of the variability in the formulation of coding systems is th=
e difficulty in providing generic tools. In Australia, there are published =
examples informing general practitioners how to access key hidden informati=
on locked in their patient records - information important for managing the=
ir patient's health, referencing arcane codes and SQL queries peculiar to o=
ne particular system, using one particular coding system at one particular =
point in time.
SELECT = CM_PATIENT.PATIENT_ID, CM_PATIENT.SURNAME, CM_PATIENT.FIRST_NAME, MAX(MD_PA= THOLOGY_ATOM.RESULT_DATE) AS MaxResultDate, MAX(VISIT.VisitDate) AS MaxVisi= tDate FROM MD_PATHOLOGY_ATOM RIGHT OUTER JOIN MD_PATHOLOGY ON MD_PATHOLOGY_ATOM.P= ATHOLOGY_ID =3D MD_PATHOLOGY.PATHOLOGY_ID RIGHT OUTER JOIN CM_PATIENT ON MD_PATHOLOGY.PATIENT_ID =3D CM_PATIENT.PATIE= NT_ID FULL OUTER JOIN VISIT ON CM_PATIENT.PATIENT_ID =3D VISIT.PatientNo WHERE DateDiff(yy, VISIT.VisitDate, GetDate()) < 1 AND (CM_PATIENT.DECEA= SED_DATE IS NULL) AND (CM_PATIENT.GENDER_CODE =3D 'M') AND (DATEDIFF(yy, CM= _PATIENT.DOB, GETDATE()) > 50) AND (DATEDIFF(yy, CM_PATIENT.DOB, GETDATE()) < 74) AND (MD_PATHOLOGY_ATOM.LOINC =3D '2857-1' OR MD_PATHOLOGY_ATOM.LOINC IS NUL= L) GROUP BY CM_PATIENT.PATIENT_ID, CM_PATIENT.SURNAME, CM_PATIENT.FIRST_NAME
In the above example, one keen clinician wanting to send recall notices = to patients deemed to be candidates for Prostate Specific Antigen = (PSA) tests, has delved into the bowels of his patient records, deter= mined the relevant database tables, determined that LOINC has been used to = code test names, determined the LOINC code (from some 40,000+ codes) histor= ically used by the particular pathology lab in their HL7 message for the = PSA test, determined how gender is coded in this specific clinical system, = built and run the requisite SQL query; and hopes that nothing changes next = time the query is run! A great piece of detective work, but clearly n= ot an acceptable nor sustainable way to empower clinicians with usable, sem= antically interoperable electronic health records that meet their requireme= nts.
There are still some significant areas related to representing clinical = concepts that need further, substantial research and which may affect the d= ecisions we make about the coding of data, including:
So, at the end of this short treatise, how does the balance sheet look? = Is the answer to code, or not to code?
For want of a better cliche, "semantic interoperability is a jo=
urney, not a destination". It is a long, slow, expensive journey that will =
probably never end. As with most journeys, it is cheaper and wiser to make =
the right steps, at the right time, in the right order. It is sensible to a=
void steps that will later need retracing. The journey should not start wit=
h a mad rush to "code" data as fast as we can, particularly if it means eve=
ry system is beholden to a raft of separate, inconsistent coding schemes. F=
ar better to apply some sound architectural principles and sufficient=
engineering to ensure that as far as possible we take steps in the right d=
irection and take steps that we won't inevitably have to retrace.
----
[CIM1998] Cimino J, Desiderata for Controlled Medical Vocabularies in th= e Twenty-First Century , Methods of Information in Medicine, 1998
[CIM2006] Cimino J, In defense of the Desiderata =E2=80=A8=
http://www.dbmi.columbia.edu/cimino/Publications/2006%20-%20JBI%20-%20In%20=
Defense%20of%20the%20Desiderata.pdf
[CEU2006] Ceusters W and Smith B, Strategies for Referent Tracking in Elec=
tronic Health, Journal of Biomedical Informatics, 2006:39(3):288-98
[MAR2008] Markwell D, Terminology Requirements and Principles, NHS publica=
tion, 2008 =E2=80=A8( http://www.ehr.chime.ucl.ac.uk/downlo=
ad/attachments/3375121/TerminologyBindingRequirementsAndPrinciples_v1.0.pdf=
)
Rector A et al, and various publications at http://www.semantiche=
alth.org/
openEHR Architecture Overview =E2=80=A8http://www.openehr.org/releases/1.0.1/architecture/overview.pdf =