Detailed review of DIRECTORY operations on the Service Model
Description
Attachments
- 22 Jun 2019, 01:12 AM
relates to
Activity
Pablo Pazos October 1, 2019 at 12:54 AM
This is what I proposed on HiGHmed
Current operations are:
has_directory(ehr_id): Boolean
has_path(ehr_id, path): Boolean // path from the root EHR.directory, also the idea of this path is that is defined by archetypes (this is another issue mentioned below)
create_directory(ehr_id, folder) //root directory
get_directory(ehr_id): FOLDER // this might need to be VERSION<FOLDER>
get_directory_at_time(ehr_id, time): FOLDER // this might also be VERSION<FOLDER>
update_directory(ehr_id, folder) // folder is the full EHR.directory modified
delete_directory(ehr_id)
has_directory_version(ehr_id, version_uid): Boolean
get _directory_at_version(ehr_id, version_uid): FOLDER // this might also be VERSION<FOLDER>
get_versioned_directory(ehr_id): VERSIONED_FOLDER
Issues:
To update, the client has to get the full EHR.directory structure, do changes on the client side (that means the management happens on the client), then the update needs to commit the whole structure with the modifications, this adds a lot of complexity on the client side and might no be the most natural way of managing an EHR.directory. (Luis: agree. FOLDERS in my view should be self standing structures and the the system should allow for updating only one of them (e.g. change its items or details) as long as the edition of the FOLDER does not introduce inconsistencies in other FOLDERs.)
The has_path operation uses a path that should be defined by an archetype (mentioned by Thomas on the SEC Slack), my interpretation was those where instance paths considering the EHR.directory tree structure, which makes sense since it is unpractical to have the whole EHR.directory structure defined by archetypes, and even some of those FOLDERs will be created in an ad-hoc way (IMO most will be created this way and using a generic archetype for definition, this is also the approach of Code24 which has been using folders for 6 years). Also paths are name-based, which makes them language dependent and creates the need for a constraint to have sibling FOLDERs with unique names. (Luis: the uid inherited from LOCATABLE (now optional 0..1) should be mandatory in our implementation for implementation reasons (1..1). At the moment is is the Primary Key in the Data Base, thus it is mandatory and unique).
Also related to paths, the current spec shows name-based paths to reference internal FOLDERs, but to reference to items in a FOLDER, the path uses numeric indexes, which seems inconsistent. One possibility is to use the item name on the path, the issue that creates is the items are really VERSIONED_OBJECT, which doesn't have a name, but the VERSIONED_OBJECT.latest_version() which is VERSION<T> has a name if T is LOCATABLE, so FOLDER.item[i].latest_version().data.name could be used in the path, but again, that creates another couple of issues: a. the name is not really form the item but from the contained data, and b. since the data could be updated, the name could change, changing the path. So the name-based path IMO is not really useful for any use case.
That last part makes me think, of the name-based paths for FOLDERs, since FOLDER.name could also change, since FOLDERs could be created, renamed, deleted, etc. so the paths that were valid at one point could be invalid later. And one idea of these paths was to use them also for AQL, but IMO is almost impossible to get something very detailed from AQL using paths for FOLDERs, since I think most FOLDERs will be created ad-hoc and might not have a full structure defined by archetypes, only the basic structure, and maybe the new FOLDER.details structure, which could be archetyped but also could be used in an ad-hoc way.
Not issues from the operations but from the model: a. a FOLDER could have more than one parent, b. a FOLDER could have an ancestor as subfolder. These break the tree structure and openEHR needs to add some invariants to prevent this on the model.
We should clearly commit to implement FOLDERs directory as trees in the computational sense. The aim of this is to guarantee some performance issues (aprox. O(lgn) when rearranged optimally) and avoid possible cycles that may derive from graph-like directories. This is in contradiction with some implementations that allow to virtually define graphs using the LINK class.
The operation "has_directory(ehr_id): Boolean" makes sense in EHRs, however for phenotyping in clinical research it may be actually the oposite. For example, a clinical study on back pain surgery may have a folder containing many EHRs rather than the other way around.
Proposals for operations:
has_directory(ehr_id): Boolean // MAINTAIN
has_folder(ehr_id, folder_uid): Boolean // NEW, uses uid not path
has_path(ehr_id, path): Boolean // MAINTAIN - 1. spec needs to explicitly state "path" is an archetype path, not an instance path, 2. add an example with archetype paths to show how this operation will work, I think looks good on paper but it can be difficult to implement
create_directory(ehr_id, folder) // MAINTAIN - discuss about the EHR and support self-standing FOLDERs without belonging to a EHR.
get_directory(ehr_id): FOLDER // MAINTAIN
get_directory_at_time(ehr_id, time): FOLDER // MAINTAIN
get_folder(ehr_id, folder_uid): FOLDER // NEW, like cd + ls commands (this is optional since the information will be included in the result of get_directory). This will return the latest version of the directory provided that folder is not versioned.
create_folder(ehr_id, parent_folder_uid, new_folder) // NEW, like mkdir command, if no parent_folder_uid is provided, the new_folder will be created under the EHR.directory
update_folder(ehr_id, updated_folder) // NEW, allows to modify an individual FOLDER and what it contains, including name, details, folders and items. The updated_folder contains it's uid so there is no need for an extra parameter. If subfolders are deleted in the updated folder, they are deleted in the directory as well in EHRbase.
remove_folder(ehr_id, folder_uid) // NEW, like rmdir -r (removes also subfolders and items)
add_item(ehr_id, folder_uid, versioned_object_uid) // NEW, like the touch command, adds the item to the FOLDER.items via OBJECT_REF (TODO: verify OBJECT_REF needs namespace and type values but I think those could be set to default values set in the server config so we might not need to add extra parameters for those)
remove_item(ehr_id, folder_uid, versioned_object_uid) // NEW, like the rm command, removes the versioned object reference from the FOLDER.items
delete_directory(ehr_id) // MAINTAIN, but is contained in remove_folder when it is invoked with the EHR.directory.uid as folder_uid value
has_directory_version(ehr_id, version_uid): Boolean // MAINTAIN
get _directory_at_version(ehr_id, version_uid): FOLDER // MAINTAIN
get_versioned_directory(ehr_id): VERSIONED_FOLDER // MAINTAIN
Notes:
Referencing FOLDERs by uid requires that the FOLDER.uid is set for all FOLDERs by the server. In the RM the uid is optional, so this could be an implementation constraint but still "spec valid".
The added operations seem to be a more natural way of managing with FOLDERs and their items like a user could do on a Linux Terminal and avoids the extra complexity of managing the whole EHR.directory on the client side for creating new FOLDERs and adding new references to items, also for deleting stuff, instead of having one big operation, we could map one action from a user to one operation on the Service Model. Still the create_folder() operation could receive a full FOLDER structure with subfolders and references to items, or just the basic data like name and details, and then it could be modified using the other operations, or the same create_folder() to add subfolders to it. That also adds more flexibility for client-side implementation.
About versioning, from the spec, the only versionable FOLDER is the EHR.directory, no internal FOLDERs could be versioned. Considering the new operations, each creation, update and removal of FOLDERs and items, would generate a new version of the containing EHR.directory, so this is an implementation consideration. Either way this should be done with the current operations in the SM spec, this is just to note that individual FOLDERs shouldn't be versioned (Code24 is versioning individual FOLDERs and they might propose a change request to make that valid in the spec, but won't be any time soon).
Using the parent_folder_id to create new FOLDERs prevent the generation of non-tree structures, since a. FOLDER.uid should be always assigned by the server and 2. only children to a given parent could be created.
TODO: we still need to discuss AQL requirements for FOLDERs and what will be needed to support those (from archetype modeling to internal implementation).
Pablo Pazos July 14, 2019 at 12:52 AM
Related question, for these operations that return FOLDER, wouldn’t be better to return VERSION<FOLDER> instead?
get_directory(ehr_id): FOLDER
get_directory_at_time(ehr_id, time): FOLDER
get _directory_at_version(ehr_id, version_uid): FOLDER
Pablo Pazos July 11, 2019 at 5:51 AM
In 1.b I detected a potential issue:
I mention to use the name of the COMPOSITION in the path instead of an index to be consistent with the FOLDER names that are also used in the path.
item.name is referenced, but item is really a VERSIONED_COMPOSITION, so to get the name we really need to use item.last_version.data.name but this generates a new problem: what if a new version of the item has a different name?
We have two options: keep the inconsistent path structure using names for FOLDERs but indexed for COMPOSITIONs, or we avoid paths as pointing/identification mechanisms.
Related to that last part, talking with Thomas in Slack, he was considering paths as the ones that are taken form an archetype, and I was considering those were instance paths, also for this issue report I was working under that consideration. So for me paths point to parts of a directory instance, not to a model for directory, since archetypes might not define the whole structure, parts could be defined ad-hoc by software, also archetypes might miss some constraints needed to create the paths, like FOLDER names. So IMO archetype paths to reference FOLDERs doesn't work as it works on COMPOSITION internals. Another point mentioned is: having an archetype, we will have many COMPOSITIONs that comply with that archetype, but for FOLDERs we will have just one instance, if the archetype is for the EHR.directory (this of course in the context of one EHR).
Thomas also mentioned the use of archetype paths to query FOLDERs on AQL, but because of the aforementioned, I'm not sure that is practical at all.
And another argument was: if a client system needs to point to a specific FOLDER using a path, they need to get the full EHR.directory first, then calculate the path (instance path) then use that path to do some operations on the server (maybe using the REST API) like adding a subfolder or an item to the FOLDER referenced by the path. When using just the FOLDER.uid avoids the task of generating paths on the client side (currently I don't think the server can provide FOLDER paths to the client, at least we don't have an operation for that, we could only retrieve the full EHR.directory).
For 1.c. I agree with Thomas' comment about: the only versionable FOLDER is the EHR.directory (by the current spec), so my argument about using FOLDER.uid for versioning doesn't apply. Also point 2. is dicarted by the current spec: no, we can't version internal FOLDERs.
Those are strong arguments to try using uids to identify FOLDERs, the EHR.directory and internal ones.
In response to Thomas' comment " I don't think it will help create comprehensible paths", I don't propose to use uids on paths, my proposal is to use uids directly instead of paths to locate one folder or item in the directory. But also not saying "don't use paths", both mechanisms could work in parallel, and my opinion about paths is those generate extra complexities to the client systems (mentioned above).
And about "3.b .... note that what is happening in the server is that the directory structure is being retrieved, and being modified with the changed bit, and then rewritten - the change audit is still on the whole directory, not a single FOLDER", to clarify:
a. yes the retrieval is needed by the current update_folder operation, of course is another operation, and after the modified bit is added by the client (the client needs to manage the whole structure on their side! = extra complexity) then the modified directory is updated in the update_directory, so we need a couple of service calls and logic on the client for the update.
b. with the proposed operations, that use a path or just the uid of the parent folder where the modifications happen, still on the server side the whole directory could be versioned very easily (just copy the current structure, mark that as a previous version, and set the current updated version as the latest, this is also quick to do algorithmically if we know which folder or item was modified, might not need a full tree traverse to copy the previous structure). Also I think the operations I proposed are insufficient, and having more granular functionality might be a better option, something like:
- create_folder(ehr, parent_folder_uid, new_folder)
- add_item(ehr, parent_folder_uid, versioned_object)
Similar to modify and remove/delete for specific folders and items. Modify could be used, for instance to change the FOLDER.details or FOLDER.name, then if we need to add a subfolder, instead of update we use the create_folder and set the correspondent parent_folder_uid. IMO this is simpler than passing whole FOLDER structures with references to sulfolders all at once, because also the client need to create those structures (more logic) the commit everything. Creating FOLDERS one by one is more natural, like mkdir in the terminal.
Sebastian Iancu July 9, 2019 at 7:50 PMEdited
One thing to keep in mind for SM is to which RM are we referring/using. RM 1.1 has some changes on FOLDER which, in my opinion, make versioning or folder necessary (if they are used next to EHR.directory).
On a recent discussion on slack @Pablo Pazos :
one is how to reference folders (paths ands uids)
check if it makes any sense to have folder archetypes (my guess is no but not sure how others operate),
check if the paths we use for folders are archetype paths or instance paths
then check how versioning should work for folders (I'm OK for versioning the whole directory on each change but don't agree of providing the whole directory for every change we need to make, and this is liked to item 1. since will need to identify the parent folder where changes will happen)
also was mentioned recently to add an invariant to prevent cycles in a folder structure (assure directory is a tree)
@Sebastian Iancu
0.) directory is a must, but to version the whole tree is a pain - that's why i salute RM 1.1 changes which also allows easier handling of folders
1.) instance identification we do it based on name-based predicate path (ODIN like above), or UID of the folder in the tree, or natural (name) path (root/f1/f2/f3/etc) which resolves to name-based predicate paths
1b) name/value are unique per siblings
2) we have them (archetype based id) to distinguish between types of folders, especially in RM 1.1 would be very useful with support of /details (archetyped)
3) dont get this q ?!
4) it will work for RM 1.1, and with a bit of hacky inventive mind can be done also in RM 1.0.x, although not really correct
5) we can add it (edited)
Thomas Beale July 9, 2019 at 8:58 AM
With respect to point 3a, probably we need to distinguish between has_path(an_exact_path: String) and has_matching_path (a_path_matcher: String).
W.r.t. 3b, updating just a single folder, there's no reason not to support this, but note that what is happening in the server is that the directory structure is being retrieved, and being modified with the changed bit, and then rewritten - the change audit is still on the whole directory, not a single FOLDER.
1. Directory paths clarifications
REF: https://specifications.openehr.org/releases/RM/latest/common.html#_paths
[See also: https://specifications.openehr.org/releases/BASE/latest/architecture_overview.html#_paths]
Current spec for directory paths is name-based, allow duplicated sibling folder names, and proposes a “uniqueness modifier” to handle the duplicated sibling names. And shows examples like: /folders[hospital episodes]/items[1]
Potential issues / improvements / discussion points:
a. it is not stated how that “uniqueness modifier” is actually used and what happens if the folder name has the brackets proposed to signal the modifier
b. in the example, there is an inconsistent pointing mechanism: for subfolders their name is used, for items their index is used, why not use the item.name also in the path?
c. the main idea of the path is reference a folder or item in the directory structure, why not consider using uids? (I know names are mandatory and uids are optional, but uid might be needed either way for versioning folders, commented bellow)
d. should each path reference just one folder/item at most? would it be useful to reference a list of siblings with same name by one path? (like an XPath when multiple siblings have the same name)
Pros / Cons / Actions for name-based paths:
+ Pro: name is mandatory on LOCATABLE
+ Con: name is not unique between siblings, a path can point to many siblings
+ Actions: make the name sibling-unique for folders if we want one path-one node, or add explicit support for referencing many siblings with the same name by a path.
2. Considering UIDs to reference FOLDERs and items (mainly COMPOSITIONS)
Currently we agreed on setting the OBJECT_VERSION_ID to the COMPOSITION.uid field to have some version management without the VERSION<COMPOSITION>. So to reference COMPOSITIONS (items) we are covered and really don't need a path to get a COMPO from the directory. To do things like "remove item from folder" we can also use the COMPO.uid, but also need a reference to the parent FOLDER.
Since FOLDER is also a top-level versionable object, wouldn't make sense to also set the OBJECT_VERSION_ID to FOLDER.uid? This would make the uid mandatory for FOLDERs, as I think we have de-facto for COMPOSITIONs.
If that makes sense, then, there is no need to reference FOLDERs by path, we could use the UID, and even use the UID to reference a specific version of the FOLDER, since it is an OBJECT_VERSION_ID (same as we currently use COMPOSITION.uid).
3. Service Model operations
a. has_path(ehr_id, path): Boolean
We need to review the analysis on point 1. to be sure the correct result is returned from this operation, e.g. if the path points to multiple folders, is that "true"? (I guess yes
but we need to specify if referencing multiple folders is possible or not)
b. update_directory(ehr_id, folder)
This is the only operation that allows to modify the folder structure and requires the whole EHR.directory structure given to make each modification, which IMHO is not an optimal solution.
We could provide something like:
+ update_folder(ehr_id, path_to_folder, folder)
+ update_folder(ehr_id, folder_uid, folder)
path_to_folder / folder_uid would be identifiers to the parent folder where the "folder" structure should be updated, requiring to pass only the internal structure that changes, not unrelated folders and references to items.
The idea is: simplify things to the client avoiding them to have to handle the whole directory structure to change one thing.
Considerations for versioning:
+ I know the current operation might lead to a really simple implementation for FOLDER versioning, but sees too much to pass the whole thing to update one thing.
+ With the proposed operations above I don't think FOLDER versioning would be much complex, for instance just one VERSIONED_OBJECT<FOLDER> for the EHR.directory is what is needed to version the whole thing when one internal FOLDER is modified. I would clone the current structure internally and apply the changes to that structure, then set that as the latest version for the EHR.directory, while maintaining the previous one. Of course this is lineal versioning and is full structure versioning, smarter people than me could think of delta versioning to save space.
c. delete_directory(ehr_id)
For deleting, we currently have one operation to delete the whole directory, which IMO doesn't make sense. For that I would define operations to delete specific items from FOLDERs or FOLDERs from FOLDERs.
Some alternatives:
+ delete_folder(ehr_id, path_to_folder, path_to_folder_to_del)
+ delete_folder(ehr_id, folder_uid, folder_uid_to_del)
+ delete_folder(ehr_id, folder_uid_to_del)
Operation 3 points directly to the folder to delete, which internally should be deleted from the parent in the server, but the client doesn't need to manage that.
I think deleting one folder would be more common use case than deleting the whole directory of an EHR. Still those operations include the ability of deleting the directory.