Idea: reduce all intervals in the OPT XML to a String notation like [0..*)

Raise

Analysis

Raise

Analysis

Description

In OPT XML, thousands of bytes are used just to represent int intervals for occurrences, existence and cardinality. Note that those nodes appear on all C_OBJECT and C_ATTRIBUTE notes in the template. We can reduce the size and overall readability of OPT XML if instead of having this: <existence> <lower_included>true</lower_included> <upper_included>true</upper_included> <lower_unbounded>false</lower_unbounded> <upper_unbounded>false</upper_unbounded> <lower>0</lower> <upper>1</upper> </existence> Simplified to this: <existence>[0..1]</existence> Note the first existence content is 186 bytes and the second is 6. Then multiply that by the number of nodes in an OPT, maybe 200? That's 37KB that could be reduced on every OPT without losing semantics, it's just the XML notation, this doesn't affect the OPT/AM model. With more nodes, the size grows linearly by around 180 bytes per node. [ = lower included true ( = lower included false ] = upper included true ) = upper included false 0..0 = not allowed, lower and upper bounded 0..1 = optional, lower and upper bounded 0..5 = optional, multiple, lower and upper bounded 2..5 = required, multiple, lower and upper bounded 0..* = optional, lower bounded, upper unbounded 1..* = required, lower bounded, upper unbounded *..* = lower unbounded, upper unbounded (1..*) = required but lower not included, so it's equivalent to [2..*) since lower is integer, so not included means it will include the limit+1. Since the new notation is just a text node, the XSD won't be able to verify it's contents. An alternative that I think can be verified by a regex match or enumeration in the XSD is to use an attribute: <existence value="[0..1]" /> Which is even shorted than: <existence>[0..1]</existence> Sample XSD: <xs:attribute name="value"> <xs:simpleType> <xs:restriction base="xs:string"> <xs:pattern value="__REGEX_HERE__" > </xs:pattern> </xs:restriction> </xs:simpleType> </xs:attribute> OR: <xs:simpleType name="intervalExpression"> <xs:restriction base="xs:string"> <xs:enumeration value="[0..1]" /> .... __ALL_OPTIONS_HERE__ </xs:restriction> </xs:simpleType> <xs:attribute name="value" type="xs:intervalExpression" />

Activity

Show:

Pablo Pazos February 14, 2024 at 9:30 PM

I agree kilobytes are cheap, that isn’t important for this issue, though it affects it.

The point is: what is expensive is manually reviewing templates which could have 100s of nodes, and about 80% of those have the same existing/occurrences elements, which have 8 lines that could be represented as 1. So it’s more about readability.

Consider an OTP with tens of these:

<existence> <lower_included>true</lower_included> <upper_included>true</upper_included> <lower_unbounded>false</lower_unbounded> <upper_unbounded>false</upper_unbounded> <lower>0</lower> <upper>1</upper> </existence>

Then an OPT with tens of these:

I generally don’t like custom micro-syntaxes, in fact I think the syntaxes for the C_DV_ORDINAL and the C_DV_QUANTITY in ADL make implementation more complicated, because the same DOMAIN constraints could be expressed with the standard ones. But in this case, it’s only wins IMHO.

I also know OPTs are meant to be read by programs, not people, and I would be glad if I didn’t need to read them manually, but until all the modeling tools out there spit exactly the same XML for the same OPT, I will be manually reviewing these to be sure they don’t have extra invalid nodes or custom stuff from vendors, like the is_integral that is added by some editors, which shouldn’t be there. /end_rant

Matija Polajnar February 14, 2024 at 10:09 AM

Although it doesn’t really affect us, I’m with Thomas on this: standard is fine and preferable, the few kilobytes are cheap.

Pablo Pazos February 10, 2024 at 8:45 PM

“But they are not interoperable with generic tools for those syntaxes, since obviously XML stream processors”

That would mean, some functionalities in the system depend on the XML Object Model, which adds a lot of extra complexity to the system which would be simpler if only has to handle the OPT Object Model. I did that in the past with many functionalities in my architecture, including the LOCATABLE instance generators from OPTs, and it is bad design (should be an antipattern). Then I changed and isolated the XML Object Model just to be used inside the parsers, and used only openEHR Models instances in the rest.

So I don’t think the XML parser not knowing about the value (it’s just a string for the parser) is an issue, since the value should end up being mapped to the OPT model either way.

In our case, what we do is:

OPT (XML String) = (parse) => XML Object Model = (map) => OPT Object Model

Then all functions interact with the OPT Object Model instance, not with the XML Object Model, which is only used by the OPT parser internally. That is independent from any parser and technology you want to use to parse, also independent from any XML or JSON Object Model (the result of the string parsing ends up in those models).

Though, this is just an idea because I’m tired of scrolling OPTs which most of the vertical size is taken by occurrences and existence XML objects.

Thomas Beale February 10, 2024 at 8:12 PM

This problem is what ODIN (which we can consider obsolete these days) was designed for, and one of the huge problems with XML, JSON, and even OWL - they don’t have basic extended primitive types like Interval<Ordered> etc, which instances of which could be represented with very few characters. We have in the past produced versions of JSON (and I though also XML) output that have these micro-syntax strings to represent things like intervals, saving hundreds of lines in files. But they are not interoperable with generic tools for those syntaxes, since obviously XML stream processors, or Jackson framework (for JSON serialise/deserialise) don’t know about them. Personally I’ve given up on using anything ‘smart’ in these dumb syntaxes - it’s better to just use them in the standard way.

A way to do what you want is what we do with various parts of openEHR, including AOM and BMM - there is a P_XX version of the classes that converts e.g. Interval<Integer> to a string like “0..1” etc, which is of course valid XML or JSON. In this case, you are instantiating P_XXX objects, and you then need to call some function that traverses the structure and generates the original XXX objects. The logic to do that is usually simple, and Archie, ADL WB and other tools use it to simplify the serial form of archetypes for example.

Details

Reporter

Pablo Pazos

Components

ITS - XML Schemas

Priority

Lowest

Created February 10, 2024 at 5:28 PM

Updated February 14, 2024 at 9:30 PM

Configure