Terminology Identification

The problem

There are various ways of identifying terminologies. Two key categories include machine identifiers, such as Oids and UUIDs, and human-readable identifiers such as "SNOMEDCT". Here I am concerned with the latter. We need a reliable system of identifiers for terminologies that can be used in specifications, documents, and data. The identifiers need to be:

  • human readable
  • computable (i.e. mutual uniqueness is assured)
  • take account of versions / releases in a computable way
  • do not require dereferencing (e.g. like Oids) to know which terminology is being identified
  • controlled

The first question we have to ask is 'what is a terminology', or more practically, when are two 'terminologies' distinct, and when are they the same? Similarly, when are two physical 'terminologies' in fact just two variants of the same one? Part of the answer should rely on solid design principles, including of never re-using codes. However to be practical we might accept a couple of violations in a very large terminology release with respect to a previous one.

ICD and variants

A seemingly simple case is the editions of ICD, e.g. ICD9, ICD10, ICD11 - these are usually considered separate terminologies, since they are issued years apart and have significant differences. But we have also ICD9CM (US clinical modifications), ICD10-AM (Australian modifications) and many other variants. Can a code within ICD10-AM be considered to have the same meaning (classify the same real-world phenomena) as the same code in the original ICD10?

SNOMED CT

SNOMED CT makes things even more complicated. It seems clear that SNOMED CT (20090101) and SNOMED CT (20090701) are two releases of the 'same' terminology, and that meanings don't change. But SNOMED CT also has the concept of 'modules', i.e. variants and extensions, such as 'the Australian extension', 'UK extension' and so on. The module concept can (apparently) be taken to any level of specificity, e.g. 'Australian pathology'. How are these extensions identified? For practical purposes we need to know if a term comes from the Australian extension or the international release. How do we identify the Australian extension? Is it 'SNOMED CT-AU' or somesuch?

Terminology identification systems

The NLM Versioned Source Identifiers List

NLM has a list of such identifiers called Versioned Source Abbreviation (VSAB) identifiers, such as 'ICD9CM_2010', 'SNOMEDCT_20090101' and so on, for all terminologies known in the UMLS.

These identifiers are reliable, human readable, but it is not clear if they properly take account of versions. For example, two releases of SNOMEDCT are just two different strings like 'SNOMEDCT_2009_01_01' and 'SNOMEDCT_2009_01_07'. These are two releases of the same terminology and terms in data from each release should be comparable. Different variants of IDC10 exists, e.g. 'ICD10AM_2000', 'ICD10_1998' and so on. It appears that the part before the first underscore may be a reliable id of the terminology, and the remaining part can be taken as a version release id. If this can be trusted, then ids from this list encountered as part of coded data can be safely processed. In openEHR we use the form 'SNOMEDCT(20090101)', which makes the distinction between terminology id and release id easier to see, but the effect should be the same.