iSMART: Ontology-based Semantic query of CDA documents

June 19, 2017 | Autor: Jing Mei | Categoria: Natural Language Processing, Semantics, Electronic Health Records, Information Storage and Retrieval

Share Embed

Denunciar este link

Descrição do Produto

iSMART: Ontology-based Semantic Query of CDA Documents Shengping Liu, PhD, Yuan Ni, PhD, Jing Mei, PhD, Hanyu Li, PhD, Guotong Xie, MS, Gang Hu, PhD, Haifeng Liu, PhD, Xueqiao Hou, MS, Yue Pan, PhD IBM China Research Laboratory, ZGC Software Park 19A, Beijing 100193, China Abstract The Health Level 7 Clinical Document Architecture (CDA) is widely accepted as the format for electronic clinical document. With the rich ontological references in CDA documents, the ontology-based semantic query could be performed to retrieve CDA documents. In this paper, we present iSMART (interactive Semantic MedicAl Record reTrieval), a prototype system designed for ontology-based semantic query of CDA documents. The clinical information in CDA documents will be extracted into RDF triples by a declarative XML to RDF transformer. An ontology reasoner is developed to infer additional information by combining the background knowledge from SNOMED CT ontology. Then an RDF query engine is leveraged to enable the semantic queries. This system has been evaluated using the real clinical documents collected from a large hospital in southern China. Introduction As the use of Electronic Medical Records (EMRs) becomes more widespread, so does the need to efficient retrieval of clinical documents per user’s requirements. The IHE XDS (Cross Enterprise Document Sharing)1 profile provides an architecture for managing the sharing and retrieval of clinical documents between any healthcare enterprise. In XDS, the query of clinical documents is restricted to the metadata provided during the submission of documents, such as the submission time and patient ID. However, many of the users’ query requirements target the contents of the clinical documents, for example, finding patients with some clinical observations who are eligible for a clinical trial. In general, keyword-based search is used to search for the clinical documents based on the contents. But compared to the formal query languages, such as SQL in database and formal conjunctive query language in logic, it suffers from situations that: (1) the keywords can not fully capture the user’s requirements and, (2) the soundness and completeness of the results can not be guaranteed. In this paper, we propose the iSMART system to support the formal query answering on the clinical documents. The iSMART system makes use of Health Level 7 Clinical Document Architecture (CDA)2 to represent the electronic clinical documents as CDA is a widely

adopted standard. Besides the hierarchical structure of the documents, CDA also specifies the semantic meaning of the documents to avoid the ambiguity in information exchange. A key characteristic of CDA is the frequent use of ontological (terminological) references. Fragments of CDA documents are associated with the ontological concepts defined in some ontologies such as SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms)3. We observe that the ontological reference in CDA documents can be the key enabler for ontology-based semantic query of the CDA documents, because the CDA documents can be interpreted as the fact assertions about the ontology. The “ontology-based semantic query” means that: (1) The semantic query is a formal conjunctive query formulated by using the vocabularies of the ontology4; (2) The ontology inference should be leveraged to answer the query. That is to say, the definitions of the concepts in the ontology should be integrated with the information in CDA documents to infer new information such that more complete answers could be provided for queries on CDA documents. Example 1. Consider the scenario: a user wants to find the CDA documents of patients who have a disease at their respiratory systems. As shown in List 1, there is a CDA document fragment stating that the associated patient has an observation of asthma: Bronchial asthma

List 1. The sample CDA document fragment Initially, this document does not satisfy the user’s requirement. However, this fragment contains an

AMIA 2009 Symposium Proceedings Page - 375

ontological reference to the Asthma concept primitively defined in SNOMED CT which indicates that: Asthma is-a Disorder of respiratory system Finding site Structure of respiratory system

With a reasoner, it is known that Asthma is also a disease found at the respiratory system. Therefore, this document should be an answer to the user’s query. To enable such semantic queries on CDA documents, the iSMART system extracts the fact assertions about the ontology from CDA documents and represents them as RDF triples5. After that a reasoner is designed to infer new information from these triples with the domain ontology SNOMED CT. The semantic queries will be performed on the set of inferred triples.

Methods The overall design of the iSMART system is shown in Figure 1. The system consists of three components which correspond to three steps to process CDA documents to enable the semantic query on them. Firstly, the X2R Transformer extracts useful RDF triples from the CDA level 3 documents and outputs the RDF documents. Secondly, the EL+ Reasoner enriches the RDF documents with background knowledge from SNOMED CT and outputs the inferred RDF documents. Thirdly, the RDF query engine Semplore7 is leveraged to achieve the semantic query on CDA documents. The detailed functions and implementations of these components are elaborated in the following sections.

Example 2. The fact assertions extracted from the CDA document in the Example 1 could be represented in RDF format as follows (“ex” and “sct” are the respective namespace prefixes for the sample CDA document and the SNOMED CT ontology.): ex :CDA_doc_1 ex :CDA_doc_1 ex: obs_1

rdf:type ex :CDADocument . ex: hasObservation ex:obs_1 . rdf:type sct:Asthma .

After the inference by the reasoner, four additional triples could be added as follows: ex:obs_1 rdf:type sct: DisorderOfRespiratorySystem . ex:obs_1 rdf:type sct: Disease . ex:obs_1 sct:findingSite ex:node_1 ex:node_1 rdf:type sct:StructureOfRespiratorySystem.

The query to find the CDA documents with patients who have a disease at the respiratory system could be formulated as a formal conjunctive query: Q(x):-

ex:CDADocument(x), ex:hasObservation(x,y), sct:Disease(y), sct:findingSite(y,z), sct:StructureOfRespiratorySystem(z).

It is observed that “CDA_doc_1” is the answer with the inferred set of triples. The iSMART system makes use of CDA documents and the terminology SNOMED CT, whose expressivity is at the level of the recently proposed Description Logic language EL+6. However, the presented techniques and results are applicable to any XMLbased electronic clinical documents and any ontology under the expressivity of EL+. The main contributions of the paper can be summarized as: 1) 2) 3)

We have proposed an architecture for ontologybased semantic query on CDA documents. We have proposed an approach to convert CDA documents into RDF triples of clinical statements. We have developed an EL+ reasoner to enrich the clinical statements with knowledge from ontology.

Figure 1. The architecture of iSMART X2R Transformer. Effectively extracting RDF triples from XML-based CDA documents provides the opportunity to reason additional information in CDA documents. To the best of our knowledge, the only existing solution is GRDDL (Gleaning Resource Descriptions from Dialects of Languages)8. It basically constructs the RDF triples by concatenating texts in XSLT scripts but ignoring the semantics provided in the ontologies for the RDF triples. This leads to the procedure of writing mapping scripts between XML and RDF labor-intensive, error-prone and maintenancedifficult. In our system, we propose a new tool, X2R Transformer, to overcome the drawbacks mentioned above. The core of X2R Transformer is an engine that takes XML document and declarative X2R mapping as input, and outputs RDF as results. The X2R mapping defines the relationship between XML data and RDF triples, based on the structures of ontologies which the output RDF triples should follow. Due to the limited space, we only introduce two important features of the X2R mapping language, ClassMap and PropertyMap, to give a rough idea to the readers. ClassMap specifies the information to generate instances of RDF class. PropertyMap specifies the values of the properties or the objects the properties refer to. For example, List 2 shows a mapping fragment which converts the CDA fragment List 1 to

AMIA 2009 Symposium Proceedings Page - 376

the original RDF triples in Example 2. X2R:ClassMap Document; x2r:class ex:CDADocument; x2r:uriPattern ; //i is a variable x2r:location //ClinicalDocument; X2R:ClassMap Observation; x2r:class ; x2r:uriPattern ; // j is a variable x2r:location //Observation; X2R:PropertyMap Doc_Obs; x2r:belongsToClassMap Document; x2r:refersToClassMap Observation; x2r:property ex:hasObservation; x2r:relation $Document//$Observation;

List 2. The sample X2R mapping fragment In this example, two ClassMaps and one PropertyMap are given. In the ClassMap Document, x2r:location shows the position where the XML data (corresponds to class instance in RDF) occurs. Note that it utilizes standard XPath to navigate XML to find all such RDF instances. And all these instances will belong to the class ex:CDADocument (attribute x2r:class). In addition, ClassMap Document gives the method to generate URI for the instance, that is, a sequence involving a fix text ex:CDA_doc and a variable i which represents an integer starting from a pre-defined value. As a result, ClassMap Document will generate the RDF triple: ex :CDA_doc_1 rdf:type ex :CDADocument . The ClassMap Observation has the similar structure. The PropertyMap Doc_Obs links Document instances with Observation instances and requires the relationship between them must be ancestor-descendant ($Document//$Observation). This hence produces the triple: ex :CDA_doc_1 ex: hasObservation ex:obs_1 . Given a valid mapping and XML data, X2R engine will search the XML data to generate RDF resources based on the class locations in the mapping, and assign them URIs according to the given naming method. After that, each property in the mapping is processed to either attach the corresponding values to the existing resources, or connect the existing resources together based on the values of object properties. EL+ Reasoner. With the ontological references in CDA documents, the background knowledge in the ontology can be added to enrich the meaning of CDA documents. In our system, as the supporting ontology, SNOMED CT, has a formal expressivity of the Description Logic language EL+, we employ an EL+ reasoner to generate inferred RDF documents. Considering that SNOMED CT includes more than 300, 000 concepts and that millions of triples could be generated from CDA documents, we build our EL+

Reasoner on top of a relational database system for scalability issues. Firstly, we do normalization for SNOMED CT and store it into the database. Figure 2 depicts a storage schema for SNOMED CT ontology in a normal form according to the normalization rules for EL+ ontology9.

Figure 2. Storage Schema for SNOMED CT ontology All concepts (URIs) in the ontology will be assigned an internal ID as stored in the table IDURI. The tables ATOMICSUB, GCIINTER, GCIEXISTS, EXISTSSUB, SUBROLE and ROLECHAIN respectively correspond to EL+ normalized axioms subbsup, sub16sub2 b sup, $role.subbsup, subb$role.sup, role1 b role2 and role1 role2b role3 . We also store all CDA triples into two additional tables, i.e., TYPEOF(ind, concept) and RELATIONSHIP(ind, role, ind’). The TYPEOF table is designed for RDF membership triples, while RELATIONSHIP stores all relationship triples. Secondly, we do completion for CDA triples so as to elucidate implicit clinical information in CDA documents. Specifically, we defined Datalog rules for EL+ normalized axioms, as shown in Table 1. Taking sub b sup as an example, if a concept name sub is subsumed by another concept name sup and an individual ind is typed of sub, then ind is inferred as an instance of sup. Here, a more technical detail is that we elaborately did preprocessing for the existential axiom Ab$R.B whose expressivity is beyond Datalog rules. Actually, the underlying meaning of the existential axiom is that: A(x) $y, s.t. R(x, y) and B(y) However, Datalog rule is not powerful enough to do the existential individual generation. Even if we extend Datalog with a generation function, an infinite sequence of individuals might be generated. Taking Cb$R C and TYPEOF(ind0, C) as an example, the generation function will infinitely introduce individuals ind1, ind2, … such that RELATIONSHIP(indi-1, R, indi) and TYPEOF(indi, C), where i ‡ 1.

AMIA 2009 Symposium Proceedings Page - 377

As a result, we proposed the idea of canonical individual, i.e., for each existential axiom Ab$R.B, we upfront generated one canonical individual ind’ w.r.t. the role R and the class B, and then stored all canonical individuals in a table as CANONIND(ind’, R, B). In this respect, we are now allowed to define such a Datalog rule that if an individual ind is typed of a concept sub where subb$role.sup, and there is a canonical individual ind’ w.r.t. role and sup in the table CANONIND, then triples of RELATIONSHIP(ind, role, ind’) and TYPEOF(ind’, sup) are inferred out. Then, using the well-known bottom-up strategy, all these Datalog rules (in Table 1) are evaluated iteratively until no new inferred triple is generated. In theory, the rule evaluation can be done in polynomial time in the size of the original data, and in practice, our EL+ reasoner addresses the problem more efficiently. Besides, our theoretical work has proved the soundness and completeness of the semantic query approach for unary tree queries supported in our system. EL+ axiom

Datalog rule

sub b sup

TYPEOF(ind,

sub1 6 sub2 b sup $role. sub b sup

TYPEOF(ind, sup):- GCIINTER(sub1, sub2, sup) , TYPEOF(ind, sub1) , TYPEOF(ind, sub2)

sub b $role.sup

sup):- ATOMICSUB(sub, TYPEOF(ind, sub)

sup),

TYPEOF(ind, sup):GCIEXISTS(role, sub, sup), RELATIONSHIP(ind, role, ind’), TYPEOF (ind’, sub) RELATIONSHIP(ind, role, ind’), TYPEOF(ind’, sup) :- EXISTSSUB(sub, role, sup), TYPEOF(ind, sub), CANONIND(ind’, role, sup)

role1 b role2

RELATIONSHIP(ind, role2, ind’) :-SUBROLE(role1, role2), RELATIONSHIP(ind, role1, ind’)

role1 role2 b role3

RELATIONSHIP(ind1, role3, ind3) :role2, role3), ROLECHAIN(role1, RELATIONSHIP(ind1, role1, ind2), RELATIONSHIP(ind2, role2, ind3)

Table 1. Datalog rules for EL+ normalized axioms Semplore. In iSMART, Semplore7 is used to serve semantic query of CDA documents. The supported query is the unary tree-shaped conjunctive query q of the form: q(x):-$Y, conj(x,Y), where x is the only answer variable, Y is a vector of existentially quantified variables, and conj(x,Y) is a conjunction of atoms in form of C(u) and R(u, v), given u and v are constants or variables in x or Y. R is a relationship name and C is a concept name in the ontology. The query graph is a tree rooted from the answer variable x. Semplore leverages the inverted lists to store the keywords, the type information of resources and the structure information of triples. A facet search interface is provided where users could start from the keyword

searching followed by the relationship or type constraints navigation. Results The effectiveness and efficiency of our iSMART system is evaluated by an experimental study where 100, 900, 9000 documents are uploaded into iSMART in three runs. These documents are collected from a large hospital with protected privacy information. Table 2 lists the statistic information about these documents including the number of triples extracted from CDA documents and the number of triples generated by the EL+ reasoner. It is observed that the number of triples increases five folds after the inference. The additional inferred triples provide more comprehensive answers to the semantic queries. #documents

#triples from #triples from CDA reasoner 100 56,451 289,342 900 426,251 2,137,702 9000 4,550,055 23,129,897 Table 2. Statistic information for CDA documents Table 3 shows the running time for each step. This is to measure the time taken to make a batch of documents ready for semantic querying. It is observable that the time increases linearly with the number of triples. #documents

X2R

Reasonin 100 g

Indexing

100 105 278 900 749 628 2228 9000 8001 4550 25800 Table 3. Running time (second) for each step To illustrate the usefulness of the inference, we design two queries (shown in Table 4) to show the difference of queries on the triples without and with inference. Q1

Q2

Find documents of smoking and aged (>39) patients with bronchiolar disease whose associated morphology is narrowing Find documents of patients in pregnancy and having an upper abdominal pain Table 4. Sample queries

Q1 is a simplified query for finding eligible patients for clinical trial. Q2 is a query for alerting because upper abdominal pain in pregnancy may indicate preeclampsia. These two queries are performed on the dataset with 9000 documents. For Q1, no results are returned from triples without inference while 312 documents are returned with inference. The reason is that associated morphology is derived by the EL+

AMIA 2009 Symposium Proceedings Page - 378

reasoner. For Q2, 12 documents are returned with inference while 4 documents are returned without inference ， due to the subsumption inference on the concept “upper abdominal pain”. This comparison illustrates the usefulness of the semantic querying. With inference, users could obtain more complete results. Discussion Ontology can be used to improve precision and recall for keyword-based searching of medical information including literatures, and clinical documents10. Recently, it was observed that the ontological references in CDA documents and the relationships within the ontology can be leveraged to create and rank the search results10 for CDA documents11. We further demonstrate that the ontological references in CDA documents can also enable the semantic query on the CDA documents. We have presented a prototype system supporting semantic query on CDA documents. As far as we know, the only similar system is from Patel C, et al12, which investigates the feasibility to use ontology reasoning to match patients to clinical trials. The key insight is that a clinical trial criterion can be expressed as a semantic query, which a reasoner can then use together with SNOMED CT to infer implicit information that results in retrieving eligible patients. Our system is different in that we handle the standard CDA format of electronic clinical documents and the query answering engine is based on a relational database, which is more scalable than the memory-based reasoner used by Patel C, et al12. The CDA standard poses unique research challenges for semantic query. One is to handle the semantic overlap between CDA model and SNOMED CT. When used together SNOMED CT and CDA often offer multiple possible approaches to represent the same clinical information. For example, the attribute “targetSiteCode” in CDA is very close to the relationship “Finding site” in SNOMED CT. So when extracting semantic representation of CDA document, the tool must clearly understand the semantics of both in order to manage areas of overlap or apparent conflict. Another challenge is to handle the semantic ambiguity of CDA documents. For example, the code/value pairs of the observation element in a CDA document may have many different meanings. The HL7 TermInfo13 project is aimed to resolve both issues by defining guideline to use SNOMED CT codes in CDA documents. The X2R Transformer in iSMART currently assumes the CDA documents follow the HL7 TermInfo guide. Handling the CDA document not conformant to this guide is still a great challenge.

Conclusion This paper describes a prototype system to support ontology-based semantic query of CDA documents. The semantic query can be leveraged to provide clinical alerts and finding eligible patients for clinical trial. Its key components are the XML to RDF transformer and the EL+ ontology reasoner. Our system demonstrates that the ontological reference is the key enabler of the semantic queries. Acknowledgement We thank GuangDong Hospital of Traditional Chinese Medicine for their great supports. References 1. 2. 3. 4. 5. 6.

7. 8. 9. 10. 11. 12. 13.

ACC, HIMSS and RSNA. Integrating the Healthcare Enterprise IT Infrastructure Technical Framework Volume 1: Integration Profiles. 2007. Dolin RH, Alschuler L, Boyer S, et al. HL7 Clinical Document Architecture, Release 2.0. J Am Med Inform Assoc. 2006;13(1):30-9. SNOMED Clinical Terms. Northfield, IL: College of American Pathologists, 2007. Krotzsch M, Rudolph S, and Hitzler P. Conjunctive Queries for a Tractable Fragment of OWL 1.1. ISWC/ASWC Proc. 2007: 310-23. Manola F and Miller E. RDF primer. W3C recommendation, Feb. 2004. Suntisrivaraporn B, Baader F, Schulz S, and Spackman K. Replacing SEP-Triplets in SNOMED CT using tractable Description Logic operators. AIME Proc. 2007: 287-91. Zhang L, Liu Q, Zhang J, et al. Semplore: An IR approach to scalable hybrid query of semantic web data. ISWC/ASWC Proc. 2007: 652-65. Connolly D. Gleaning Resource Descriptions from Dialects of Languages (GRDDL), W3C Recommendation, 11 September 2007. Baader F, Brandt S, Lutz C. Pushing the EL envelope. IJCAI Proc. 2005. Bodenreider O. Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearb Med Inform. 2008. Farfán F, Hristidis V, Ranganathan A, et al. XOntoRank: ontology-aware search of electronic medical records. IEEE ICDE Proc. 2009. Patel C, Cimino JJ, Dolby J, et al. Matching patient records to clinical trials using ontologies. ISWC/ASWC Proc. 2007: 816-29. Cheetham E, Markwell D, Dolin RH, et al. Using SNOMED CT in HL7 Version 3; Implementation Guide, Release 1.4. 2007.

AMIA 2009 Symposium Proceedings Page - 379

Lihat lebih banyak...

iSMART: Ontology-based Semantic query of CDA documents

Descrição do Produto

Comentários