Description of the cartesian product problem.

So far, we’ve had a few attempts to describe the nature of the cartesian product problem, as it is called by many members of SEC and openEHR community. This page in discourse have useful pointers to these past discussions.

I don’t want to produce yet another attempt at formally describing the problem. Instead, I’m writing this page to clarify why we’ve been unable to provide a formal description of the problem as per SPECQUERY-9 and maybe suggest a way forward so that we can overcome this obstacle.

What’s the problem?

Users of AQL are getting query results with some extra rows, which they do not expect. The users then do one of these two things:

Suggest that the implementation is doing the wrong thing
Trust the implementation and ask for an explanation for the results they did not expect

Responding to both of these actions requires that an implementer can offer a well defined model of query and query execution semantics. Ideally, all openEHR implementations should offer the same couple, but that’s not where we are now.

What makes this problem a tricky one to solve is that whether or not the problem exists is subjective. it is entirely possible that the user is getting incorrect results, according to the query semantics the user has in mind, which may not the query semantics the implementer has in mind. Without the existence of a shared and agreed semantics of query and its execution, how can we know if the user is right or wrong?

Therefore, the only description of the so called “cartesian product problem” we can offer is “the user is getting rows in a result set that they did not expect”.

However, it is worth noting that the name of the problem, at least accepted among vendors, hints at the shared perception of its nature: a cartesian product emerging as a result of the query execution. It is not a great stretch of imagination to suggest that we are comfortable with a relational algebra based view of query semantics, at least to some extent, because we’re borrowing a concept from this view to name this problem.

Rather than attempting to define a formal model based on this vague proof of sympathy to relational algebra, I’ll just present it as a strong proof for all of us subconsciously looking for a formal model.

How we can solve the problem?

We define a formal model of query semantics and execution. Such a model should satisfy the following criteria:

It should define behaviour in a consistent way. I.e. it’s semantics should not change with the logical query operators, functions, or the underlying information model references used in the query.
It’s information model scope should be clearly stated. Does it apply to demographics? Does it generalise to any information model?
It should be documented for implementers and users, considering the characteristics of the audience.
<more points are welcome>

TODO: any other suggestions to solve the problem are welcome, if anybody can think of one.

What are the risks?

The following are the risks associated to defining a formal model as suggested above.

It’ll almost certainly break some, or all of the AQL implementations at varying degrees.
This means we’ll have to commit to spending money as vendors.
We may have to make some tough choices re scope.
We may have to consider limiting the scope of AQL if we cannot find a way of supporting everything under the openEHR umbrella. Different vendors and stakeholders may have different visions, and consequently different views of scope for AQL. We need to have a discussion and define a clear scope.
<more points are welcome>

Next steps

We would probably like to have a discussion in SEC and attempt to identify the expectations, goals and scope based on input from SEC members. I think if the previous attempts and discussions to solve this problem delivered anything, it is the realisation that we do not seem to have a shared view or vision of AQL. That’s what we need to establish first.