Structural group auditing of a UMLS semantic type’s extent

Share Embed


Descrição do Produto

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/5232162

Structural group auditing of a UMLS semantic type’s extent ARTICLE in JOURNAL OF BIOMEDICAL INFORMATICS · JULY 2008 Impact Factor: 2.19 · DOI: 10.1016/j.jbi.2008.06.001 · Source: PubMed

CITATIONS

READS

20

26

5 AUTHORS, INCLUDING: Huanying (Helen) Gu

Yehoshua Perl

New York Institute of Technology

New Jersey Institute of Technology

46 PUBLICATIONS 491 CITATIONS

173 PUBLICATIONS 2,916 CITATIONS

SEE PROFILE

SEE PROFILE

James Geller New Jersey Institute of Technology 174 PUBLICATIONS 1,396 CITATIONS SEE PROFILE

Available from: Yehoshua Perl Retrieved on: 03 February 2016

Journal of Biomedical Informatics 42 (2009) 41–52

Contents lists available at ScienceDirect

Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin

Structural group auditing of a UMLS semantic type’s extent Yan Chen a,b,*, Huanying (Helen) Gu c, Yehoshua Perl a, James Geller a, Michael Halper d a

New Jersey Institute of Technology, Newark, NJ 07102, USA Borough of Manhattan Community College, CUNY, New York, NY 10007, USA University of Medicine and Dentistry of New Jersey, Newark, NJ 07107, USA d Kean University, Union, NJ 07083, USA b c

a r t i c l e

i n f o

Article history: Received 27 November 2007 Available online 17 June 2008 Keywords: UMLS Partition Semantic network Semantic type assignment Auditing Group auditing Structural auditing Semantic refinement Refined semantic network Refined semantic type

a b s t r a c t Each UMLS concept is assigned one or more of the semantic types (STs) from the Semantic Network. Due to the size and complexity of the UMLS, errors are unavoidable. We present two auditing methodologies for groups of semantically similar concepts. The straightforward procedure starts with the extent of an ST, which is the group of all concepts assigned this ST. We divide the extent into groups of concepts that have been assigned exactly the same set of STs. An algorithm finds subgroups of suspicious concepts. The human auditor is presented with these subgroups, which purportedly exhibit the same semantics, and thus she will notice different concepts with wrong or missing ST assignments. The dynamic procedure detects concepts which become suspicious in the course of the auditing process. Both procedures are applied to two semantic types. The results are compared with a comprehensive manual audit and show a very high error recall with a much higher precision. Ó 2008 Elsevier Inc. All rights reserved.

1. Introduction The UMLS [1] is a two-level terminological knowledge base, consisting of the Metathesaurus (META) [2] with 1.5 million biomedical concepts and the Semantic Network (SN) [3–5], which is an upper-level abstraction network of 135 broad categories called semantic types (STs). The two levels are related via assignments of one or more STs to each concept of the META. These ST assignments capture an aspect of the semantics of a concept in the sense of identifying its nature. For example, a concept assigned Mammal1 has an aspect of its semantics defined by placing it within the animal kingdom. However, this is not a complete specification of the semantics of the concept, as there are many different kinds of mammals. The ST assignments are thus critical to the UMLS’s framework. The set of concepts assigned a semantic type T is called the extent of T, denoted by E(T). All the concepts of this extent share the semantics of T. The UMLS’s extensive size and inherent complexity make some wrong assignments unavoidable. Some categorization errors and inconsistencies may have been introduced into it. This may have been caused by the nature of the UMLS, integrating various source terminologies which are not always consistent with each other, or * Corresponding author. E-mail address: [email protected] (Y. Chen). 1 Semantic types are written in bold font. 1532-0464/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2008.06.001

by the different views of domain experts who categorize the concepts. Incorrect ST assignments (‘‘mis-assignments”) may, in fact, reflect various kinds of misunderstandings, such as inaccurate or wrong meaning or ambiguity of concepts, which may result in their misuse. Thus, an ST assignment error for a concept may indicate the potential presence of other errors, such as missing or incorrect hierarchical or lateral relationships or redundant or ambiguous concepts [6–8]. In a recent study of UMLS user preferences [9], users expressed a recommendation that 35% of a putative UMLS budget should be spent for auditing (more than for any other task). Wrong or missing semantic types were the users’ top concerns. Therefore, it is imperative to audit the META for semantic type assignments to ensure the overall quality and usability of the UMLS. In this paper, we describe a group-centered approach, applying a ‘‘divide and conquer” technique to an extent of a semantic type to facilitate the task of auditing of semantic type assignments of the concepts in the extent. We partition an extent into smaller logical units, which are more comprehensible due to their uniform semantics and are therefore easier to audit. The auditing methodology facilitates the finding of ST misassignments. Our basic premise is that a review of purportedly semantically similar concepts in a group is more likely to be effective in locating such errors than a review of random concepts with disparate semantics. In such a group, all concepts are intended to share an overarching broad meaning, so those that do not share

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

Experimental Model of Disease (EMD)

Neoplastic Process (NP)

Arthritis, Experimental

Melanoma, Experimental

26

Experimental Hepatoma



46

Disease Model



U

EMD

Mammal

NP

Pure EMD

Knock-in Mouse

1

EMD

U

42

Mammal

Fig. 1. Partition of the extent of EMD into the extents of its refined semantic types, E(pure EMD), EMD \ NP and EMD \ Mammal.

it may be more readily detected by a human auditor. This methodology utilizes semantic types of the Refined Semantic Network (RSN) [6,7] having the characteristic of semantic uniformity of their extents (see Section 2). The methodology utilizes an algorithm which suggests to the auditor that certain concepts are ‘‘suspicious” and warrant review. This algorithm uses a structural approach to define formal conditions regarding the ST assignments of parents and children for identifying suspicious concepts. These conditions utilize the dual level architecture of the UMLS, i.e., the META and the SN together. An interesting feature of the methodology is its dynamic nature, where a re-invocation after the correction of an ST mis-assignment at a parent concept can lead to the discovery of additional suspicious children, which were initially not suspicious. We demonstrate our new auditing techniques, designed for processing one semantic type at a time, by examining the extents of the STs Experimental Model of Disease (EMD) and Environmental Effect of Humans (EEH) of the UMLS 2006AB version. 2. Background The META is a large and complex collection of biomedical concepts. Each concept in the META is assigned one or more semantic types from the SN. Each semantic type T has an extent E(T), a set of all concepts it is assigned to. However, it may be that not all concepts in E(T) exhibit the same semantics. For example, in E(EMD), the concept Melanoma, Experimental2 is assigned EMD and Neoplastic Process (NP), while the concept Arthritis, Experimental is assigned only EMD. Thus, an extent, such as E(EMD), may be semantically non-uniform. That is, two concepts of the EMD extent may be different in their semantics expressed as a broad category; some are experimental diseases while others are experimental cancer diseases. In our previous research [6,7], we have developed a technique for automatically constructing a Refined Semantic Network (RSN), which makes the kind of distinction demonstrated with Melanoma Experimental and Arthritis Experimental explicit. The RSN is a semantically uniform abstraction network, for a two-level terminological system such as the UMLS, consisting of META and SN. We now provide a brief summary how the RSN is constructed. Given a set of original semantic types and their assignments to concepts, our technique creates an RSN consisting of two kinds of

2

Concept names are in italic font.

semantic types: pure semantic types (pure STs) and intersection semantic types (intersection STs). Fig. 1 uses a Venn diagram [10] to show how the RSN is constructed for E(EMD). Each ellipse represents the extent of the semantic type written above it. Each box represents a concept. Overlapping ellipses represent intersections of extents, corresponding to intersection STs. In Fig. 1, there are 46 concepts, for example, Arthritis, Experimental and Disease Model, which are assigned the pure ST EMD. Exactly 26 concepts, for example, Melanoma, Experimental and Experimental Hepatoma, are assigned the intersection ST EMD \ Neoplastic Process. The intersection is denoted by the mathematical operator \. One concept, Knock-in Mouse, is assigned EMD \ Mammal. A concept with an assignment of a pure ST has the semantics of a single category. We call this ‘‘simple semantics.” For example, Arthritis, Experimental has the simple semantics, Experimental Model of Disease. A concept with an assignment of an intersection ST has the semantics of more than one category. We call this ‘‘compound semantics.” For example, the concept Neoplasms, Experimental has the compound semantics, EMD \ Neoplastic Process. The meaning of the compound semantics is that Melanoma, Experimental is both an Experimental Model of Disease and a Neoplastic Process. Collectively, pure STs and intersection STs are called refined semantic types (refined STs). The RSN is the graph of all the refined STs. Each pure ST is derived directly from one of the original semantic types; however, only those concepts that were not assigned any other semantic type, e.g. Arthritis, Experimental, are still assigned this ST. A concept originally assigned more than one ST, e.g. Melanoma, Experimental, is now assigned a unique intersection ST. An intersection ST is defined for each non-empty intersection of extents, involving any number of original semantic types. Here, we are using ‘‘intersection” in the sense of the standard mathematical notion of set intersection, since extents are defined as sets. For example, EMD is a pure ST; the combination EMD \ Neoplastic Process is an intersection ST. In this paper, we are utilizing the framework of the RSN designed in [6,7] and utilized for auditing in [11], to propose the approach of group auditing for groups of uniform semantics. The extents of the refined semantic types provide such units. This approach enables effective auditing of large groups. We note that in [6,7] auditing was carried out based on intersection STs. However, it was only done with respect to intersection STs having very small extents, e.g. the extent of EMD \ Mammal. No effort was made to audit the whole ST extent. In [11], we extensively studied auditing of small extents of intersections of STs. In that study, we observed

43

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

that more errors were found in very small intersection ST extents (with up to 6 concepts) than in larger extents. The connection between the two levels of the UMLS, the META and the SN, used in this paper to identify suspicious concepts, has also been used to tackle a variety of other problems. For example, the semantic resources of META and the SN were applied to identify related biomedical concepts in a biomedical text summarizer [12]. Methods for aligning Metathesaurus relationships with their counterparts in the Semantic Network were investigated in [13]. The META, together with the associated semantic types, was used as a resource for a medically specific named entity tagger for Medical Question-Answering [14]. An ontology was created, using META and the SN, for the construction of a knowledge base for Bayesian Decision Models [15]. In [16], the semantic categorization of META concepts through the SN and semantic groups were utilized in exploring the semantics of the relationships between cooccurring UMLS concepts. A distributional similarity approach was discussed in [17,18] to classify and/or validate the semantic classification of the UMLS concepts. The huge size and complexity of the UMLS make human comprehension very difficult and error prone. Considerable research has been carried out on quality assurance of the UMLS. Semantic techniques complemented with lexical techniques, were used to detect classification errors [8,19]. The redundancy of hierarchical relations in the UMLS was investigated in [20]. Formal and naïve approaches for identifying and eliminating circular hierarchical relations in the UMLS, which lead to the creation of a directed acyclic Metathesaurus graph, have been proposed in [21,22]. In [23], an algorithm was presented for identifying all redundant ST assignments. In a redundant assignment a concept is assigned the semantic types X and Y, and X IS-A Y. Such redundant assignments are forbidden by the rules of the UMLS [5]. The respective intersection types are not legal and would not appear in the RSN after redundant assignments are removed. A method for finding undetected synonyms in the UMLS has been developed [24]. Object-oriented models have been employed to support navigation, maintenance, and auditing of the UMLS [25]. Revisions of the SN were described in [26] to facilitate terminology integration. Auditing is an important phase in any terminology’s life cycle [27]. Various techniques have been proposed and applied to different medical terminologies. For example, ontological and linguistic techniques have been utilized in auditing the NCI thesaurus and the SNOMED [28–31]. An analysis has been carried out to determine how well the SNOMED’s IS-A hierarchy adheres to certain ontological principles [32]. This latter work makes use of the SNOMED’s description-logic (DL) [33] formalism. Such DL representations have also been used for the development of algorithms to detect terminological inconsistencies [34] and synonymy [35]. Error detection [36] for the ‘‘Diagnoses for Intensive Care Evaluation” (DICE) System [37] is based on migration to DL. In [38], computational methods were demonstrated to automatically identify terms and definitions which are defined in a circular or unintelligible way in the Gene Ontology. Other techniques have also been used to find errors caused by design problems in the Gene Ontology [39–42]. Principles for enforcing ontological soundness were discussed in [43]. In the META, all relationships are recorded in the MRREL file, carrying a general label (REL). One type of hierarchical relationship is called child-of relationship. These hierarchical relationships, which are marked as PAR and CHD in the MRREL file, are migrated from hierarchical relationships in source terminologies [21]. The child-of relationships in the META also carry an additional label (RELA), such as is_a, branch_of and part_of [44], obtained from a source vocabulary. This label further specifies the child-of relationship and explains the nature of the relationship more exactly. Table

Table 1 RELA distribution of child-of in the UMLS RELA branch_of has_codesystem is_a member_of_cluster part_of subtype_of tributary _of null Total

Number of concepts 5036 1893 453,680 2674 16,841 5631 1502 711,849 119,9106

Label percentage (%) 0.42 0.16 37.83 0.22 1.4 0.47 0.13 59.36 100

Around 60% are not labeled, while 38% are labeled with is_a.

1 shows3 the RELA distribution of child-of in the UMLS. As we see, around 60% are not labeled, while 38% are labeled with is_a. The is_a labels are typically coming from well-designed source terminologies, such as SNOMED, NCI, GO, UWDA and NDFRT. Thus, we consider the child-of relationships with the is_a label as more expressive child-of relationships than those which are unlabeled. 3. Methods In the group-based approach underlying our methodology, we present a human auditor with a group comprising concepts purportedly exhibiting exactly the same overarching semantics. In such a group we expect to find similar concepts. Thus, a concept which is outstanding with regard to some aspects should be readily discernable. 3.1. Deriving refined semantic type extents A concept assigned more than one ST resides in several different extents. The extent of an ST is thus typically semantically non-uniform. As such, extents of STs are themselves not suitable groups for our group auditing purpose. However, every concept has exactly one assigned refined ST. Therefore, the refined STs derived from a semantic type T serve as a partition of its extent E(T). For example, in Fig. 1, E(EMD) is partitioned into three disjoint groups of concepts, E(pure EMD), E(EMD \ NP) and E(EMD \ Mammal). Importantly, an individual refined ST Ti, derived from T, is characterized by exhibiting a unique set of ST assignments across all its concepts. Therefore, we have chosen the extents E(Ti) of the refined STs Ti as the concept groups underlying our search for ST mis-assignments in E(T). An auditor is presented with the extents of the refined STs of an original ST T one by one. Note that the size of each such E(Ti) is smaller than that of E(T). Therefore, the auditor is not only given the advantage of reviewing semantically uniform groups, e.g. all neoplastic experimental diseases, but also the added benefit of reviewing smaller groups. 3.2. Creating ST assignment table Our algorithm for identifying concepts with possible ST misassignments uses a table of semantic type assignments, called ST table. The table contains all concepts of E(T) divided into sections for the refined STs of E(T). For each concept c, the table lists its ST assignments, denoted as Types(c), as well as its parent concepts and their respective ST assignments. Table 2 shows an excerpt of the ST table for E(EMD), where the following abbreviations are used: DS (disease or syndrome), IP (intellectual product), RA (research activity) and RD (research device). For example, Mouse Model of Human Cancer and its

3 The data in Table 1 were retrieved by Kuo-Chuan Huang of the New Jersey Institute of Technology.

44

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

Table 2 Sample ST table for the extent of Experimental Model of Disease (EMD) Concept

ST

Parent

Parent ST

EMD

Rodent Model

EMD

EMD EMD

Biological Models Cancer Model

RD \ IP EMD

EMD EMD

In vitro Model Animal Disease Models

RA EMD

Rodent Model

EMD

Liver Cirrhosis Animal Model

DS Animal

EMD \ NP Carcinoma 256, Walker

EMD \ NP

Carcinosarcoma

NP

Neoplasms, Experimental Experimental Organism Diagnosis

EMD \ NP

EMD Mouse Models of Human Cancer Cancer Model Predictive Cancer Model Tissue Model Liver Cirrhosis, Experimental

Rous Sarcoma EMD \ Mammal Knock-in Mouse

EMD \ NP

EMD \ Mammal Genetically Engineered Mouse

Classification

EMD

RD – research device, IP – intellectual product, RA – research activity, DS – disease or syndrome, NP – neoplastic process.

parent Rodent Model are both assigned EMD. However, Animal Model, the parent of Rodent Model, is assigned Animal. Some concepts have multiple parents, for example, Carcinoma 256, Walker, has two parents, Carcinosarcoma and Neoplasms, Experimental. In this case, the STs of each parent are included in the ST table. 3.3. Identifying suspicious concepts As discussed in [45], ideally, a concept in the META is either supposed to be assigned all ST assignments of its parent(s) or have ST assignments that are more specific than those of its parents. In this

paper, we use ‘‘subtype” to refer to the descendants of a semantic type along the IS-A relation in a path. For example, if B IS-A C and C IS-A D, then both B and C are subtypes of D. It is reasonable to suspect that a concept, say c, is in error if it satisfies the following condition: A parent p of c is assigned an ST Z such that neither is c assigned Z nor an ST Y, which is a subtype of Z (see Fig. 2). We note that if c were as-

signed such a Y, it is a legitimate configuration of ST assignments for a pair of parent and child concepts according to [45]. For example, the ST assigned to Mouse Choroid Plexus Carcinoma is EMD, while the ST assignments of its parents, Mouse Carcinoma and Mouse Choroid Plexus Tumors, is Neoplastic Process. Since EMD is different from Neoplastic Process and is not a subtype of Neoplastic Process in the SN, Mouse Choroid Plexus Carcinoma is identified as a suspicious concept with a possible ST mis-assignment. Upon review of its definition, we find that Mouse Choroid Plexus Carcinoma is ‘‘a malignant Choroid Plexus Tumor which shows anaplastic features and usually invades neighboring brain structures.” It should thus be reassigned the intersection ST EMD \ Neoplastic Process. With the reassignment, a legitimate ST assignment configuration of Mouse Choroid Plexus Carcinoma and its parents will be achieved. In another example, sewage is assigned Environmental Effect of Humans (EEH). Its parents waste product and waste management are assigned the STs Substance and Human Caused Phenomenon or Process (HCPP), respectively. In this case, the child’s ST EEH ISA the parent’s ST HCPP. According to the algorithm for identifying suspicious concepts (below), in a case where the ST assigned to the child concept IS-A the ST assigned to the parent concept, this ST contributes to a legitimate configuration. But the parent of sewage, Waste Product, is assigned the ST Substance, which is different from sewage’s ST EEH. EEH is not a subtype of Substance in the SN. Thus, the concept sewage is deemed suspicious by our definition. Upon review, the ST Substance was added for sewage by the auditor. With this addition, a legitimate configuration of STs for the concept sewage was achieved. The algorithm IdentifySuspiciousConcepts(G) uses a pseudo code description for identifying all suspicious concepts in a given set of concepts G. In an initial invocation of the algorithm, G is the extent of some refined ST of interest. Once all suspicious concepts in that refined ST’s extent have been identified by the algorithm, they are presented to the auditor for consideration.

3.4. Straightforward auditing procedure for semantic type assignment of suspicious concepts If the human auditor deems that an original ST assignment to some suspicious concept c is incorrect, then a reassignment is made and the ST table is updated. Procedure Straightforward-

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

45

AuditingSuspiciousSTAssignments (Susp) describes the process of correcting a suspicious concepts’ ST assignments.

AuditingSuspiciousSTAssignments(Susp) procedure for detecting all concepts which become suspicious during the auditing pro-

Using this procedure, an auditor can reassign the suspicious concept Mouse Choroid Plexus Carcinoma the intersection ST EMD \ Neoplastic Process and the suspicious concept sewage the intersection ST EEH \ Substance. The auditor needs only to focus on suspicious concepts rather than the whole refined ST’s extent. The number of suspicious concepts is expected to be much smaller than the number of concepts in the refined ST’s extent. This procedure thus significantly reduces the auditor’s scope of review and identifies possible ST mis-assignments with higher precision.

cess. In this procedure, the algorithm IdentifySuspiciousConcepts(G) will be re-applied to the set of c’s children to determine if any have become suspicious. The procedure will be repeated recursively until there are no more suspicious concepts in the extent. (A recursive procedure is reapplied to smaller subproblems of the original problem. Eventually, all the applications of the procedure to the subproblems result in a solution to the original problem.) Below is the pseudo code description of the dynamic procedure.

3.5. Dynamic auditing procedure for semantic type assignments

The dynamic nature of the recursion of the procedure enables the auditor to increase the number of errors found with only a little more effort. As an example, the concept Cancer Model is identified as a suspicious concept. According to its NCI definition: ‘‘Any model that can be used to study issues important in cancer such as cancer development or prediction,” Cancer Model is reassigned Intellectual Product. Due to this change, the procedure looks for the set of its children Breast Cancer Model and Predictive Cancer Model and finds that they are now suspicious. When using the straightforward auditing procedure, they were not deemed suspi-

In the previous subsection, we presented the straightforward procedure, in which auditing is conducted for every suspicious concept c. Suspicious concepts were identified by comparing the ST assignments of c and those of the parents of c. However, it is possible that c has children in the same extent with the same ST assignments as c. In such a case, if c’s ST assignments changed, the ST assignments of its children become different from those of c. To take such cases into account, we introduce the Dynamic-

46

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

semantics. The original extent does not lend itself easily to detecting ST assignment errors, since this extent is semantically non-uniform. However, the original EMD extent is divisible into three refined STs with semantically uniform extents, as shown in Fig. 1 and Table 3. As we shall see, these extents facilitate effective auditing due to their uniform semantics. The pure ST EMD is assigned to 46 concepts. The two intersection STs are EMD \ Neoplastic Process, assigned to 26 concepts, and EMD \ Mammal, assigned to one concept, knock-in mouse.

Z SN

IS-A

Y X

p

4.1.2. Creating ST assignment table Once refined STs have been derived, we create an ST table for E(EMD) separated into different refined ST portions. Table 2 shows an excerpt from the ST table for the original ST EMD.

child-of

META

c

Fig. 2. A child concept c is not assigned the semantic types (or their subtypes) of its parent p. c is suspected to have an erroneous semantic type assignment.

cious because they are assigned EMD just as their parent Cancer Model originally was. These two concepts are subsequently reviewed and also reassigned Intellectual Product, for reasons similar to those for their parent Cancer Model. 4. Results We have chosen to demonstrate our partitioning and auditing techniques for the extents of Experimental Model of Disease (EMD), defined as ‘‘representation in a non-human organism of a human disease for the purpose of research into its mechanism or treatment,” and Environmental Effect of Humans, (EEH) defined as ‘‘change in the natural environment that is a result of the activities of human beings,” of the UMLS 2006AB version. 4.1. Auditing the extent of Experimental Model of Disease 4.1.1. Deriving refined semantic type extents The original extent of EMD contains 73 concepts. If an auditor tries to review these concepts, they indeed seem to have an EMD

4.1.3. Identifying suspicious concepts Applying the algorithm IdentifySuspiciousConcepts(G) to the extent of EMD yields 30 suspicious concepts out of its total of 73 concepts. These suspicious concepts are listed in Table 4, where the following additional abbreviation is used: OTF (Organ or Tissue Function). 4.1.4. Correcting semantic type assignments of suspicious concepts After reviewing the 30 suspicious concepts, 12 of them, which are shaded in Table 4, were found to have incorrect ST assignments. The auditing was performed by Yan Chen, who has a degree in medicine. These 12 ST assignments were corrected and the ST table was updated by applying StraightforwardAuditingSuspiciousSTAssignments(Susp). The corrected ST assignments are shown as nonshaded entries in Table 5. Due to the modified ST assignments for these concepts, assignments of their children may also change. We applied the DynamicAuditingSuspiciousSTAssignments(Susp) procedure to the same 30 suspicious concepts and discovered three additional suspicious concepts, Breast Cancer Model, Predictive Cancer Model and Knockin Mouse. Breast Cancer Model and Predictive Cancer Model were reassigned the ST Intellectual Product, joining their parent Cancer

Table 3 Refined semantic types and their extents derived from Experimental Model of Disease (EMD) EMD (pure ST) (46 concepts) Alloxan Diabetes Animal Cancer Model Animal Disease. Models Arthritis, Adjuvant-Induced Arthritis, Collagen-Induced Arthritis, Experimental Autoimmune Myositis, Experimental Breast Cancer Model Cancer Model Decorticate CNS Diabetes Mellitus, Experimental Diencephalic Drain model Disease model Experimental Autoimmune Encephalomyelitis Experimental Autoimmune Myasthenia Experimental Epilepsy

Experimental Lung Inflammation Experimental Pneumococcal Meningitis Experimental Spinal Cord Ischemia Gene Knock-Out Model Genetically Engineered Mouse Hypokinesia, Experimental Knock-out Leukemia, Experimental Liver Cirrhosis, Experimental Models for Cancer Mouse Choroid Plexus Carcinoma Mouse Choroid Plexus Papilloma Mouse Glucagonoma Mouse Models of Human Cancer Gravis, Passive Transfer Xenograft Model

Myasthenia Gravis, Autoimmune, Experimental Nervous System Autoimmune Neuritis, Autoimmune, Experimental Non-Mammalian Organisms as Non-Rodent Model Parkinsonism, Experimental Predictive Cancer Model Rodent Model spinal model Streptozotocin Diabetes Tissue Model Transgenic Model Transient Gene Knock-Out Model Tumor Cell Graft Murine Acquired Immunodeficiency Syndrome

EMD \ Neoplastic Process (NP) (26 concepts) Carcinoma 256, Walker Carcinoma, Ehrlich Tumor Carcinoma, Krebs 2 Carcinoma, Lewis Lung Experimental Hepatoma Hepatoma, Morris Hepatoma, Novikoff Leukemia L1210 Leukemia L5178

Leukemia P388 Liver Neoplasms, Experimental Mammary Neoplasms, Experimental Melanoma, B16 Melanoma, Cloudman S91 Melanoma, Experimental Melanoma, Harding-Passey Neoplasms, Experimental Rons Sarcoma

Sarcoma 180 Sarcoma 37 Sarcoma, Avian Sarcoma, Engelbreth-Holm-Swarm Sarcoma, Experimental Sarcoma, Jensen Sarcoma, Yoshida Tumor Virus Infections

EMD \ Mammal (1 concept) Knock-in Mouse

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52 Table 4 Suspicious concepts identified in extents of refined STs EMD and EMD \ Neoplastic Process

47

el. Although Rodent Model was a suspicious concept, its assignment was not changed, so the dynamic procedure did not expose the above error, which was later found by an exhaustive human review. Fig. 1 used a Venn diagram to show the intersections involving EMD before structural auditing. Fig. 3 shows a Venn diagram for the same concepts after structural auditing. The numbers in the diagram indicate the numbers of concepts of the respective intersection STs and pure STs. For example, the intersection EMD \ Mammal in Fig. 1, which represents the concept Knock-in Mouse, is removed after the auditing (see Fig. 3). In total, there are six concepts that were moved from EMD to EMD \ NP. Nine concepts were moved away from EMD, two to Mammal, four to Intellectual Product, two to Research Activity and one to Organ or Tissue Function. 4.2. Auditing the extent of EEH

Shaded concepts are those found to have incorrect ST assignments by applying StraightforwardAuditingSuspiciousSTAssignments(Susp). DS – disease or syndrome, RA – research activity, IP – intellectual product, RD – research device, OTF – organ or tissue function, NP – neoplastic process.

Model. Knock-in Mouse was reassigned the ST Mammal, following its parent Genetically Engineered Mouse. Therefore, a total of 15 concepts’ ST assignments were changed with the aid of our procedure. The ST reassignments are listed in Table 5, where the three reassignments due only to use of the dynamic procedure are shaded. The only error missed by our dynamic auditing procedure was the assignment of Mouse Models of Human Cancer. It was not suspicious, as it has the same EMD assignment as its parent Rodent Mod-

4.2.1. Deriving refined semantic type extents In [6], we pointed out errors in the ST assignments of some of the EEH concepts assigned also other STs such as Finding. That is, we concentrated on intersections with small extents. Those errors were communicated to the NLM in the workshop ‘‘The future of the UMLS Semantic Network” [46]. In the release 2007AB of the UMLS, the assignments of the concepts with erroneous ST assignments reported in [6] were changed. The changes were not necessarily following our recommendations. We see the role of an auditor to raise questions and sometimes suggest alternative models. But it is up to an editor of the UMLS to make an authoritative decision about a change in the modeling. Only such an editor can be aware of the general approach used in systematic modeling, while designing a terminology or assigning STs to concepts of UMLS source terminologies. As a result of these changes following our auditing report in [6], the 2007AB release has only one intersection for EEH with Hazadorous or Poisonous Substance. All other intersections of EEH disappeared. In this paper, the assignments of the 2007AB release were our starting point, and we analyzed the whole extent of EEH. The original EEH extent consists of 61 concepts, which are also semantically non-uniform. For example, Second hand cigarette smoke is assigned only EEH and exhibits simple semantics, while Smoke has compound semantics since it is assigned both EEH

Table 5 Revised ST reassignments for erroneous concepts in E(EMD) by applying the DynamicAuditingSuspiciousSTAssignments(Susp) procedure

Experimental Model of Disease

Neoplastic Process

31 33

Mammal 2

Research Intellectual Organ or Tissue Activity Product Function 2

4

1

Fig. 3. Intersection STs and pure STs involving the original EMD extent after auditing. The numbers of concepts of each refined ST are shown.

The three shaded reassignments are due only to use of the dynamic procedure, while non-shaded entries are corrected ST assignments identified also from straightforward procedure. NP – neoplastic process, IP – intellectual product, RA – research activity, OTF – organ or tissue function.

48

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

Table 6 Sample ST table for Environmental Effect of Humans (EEH) Concept EEH Noise Pollution Air Pollution, Radioactive Automobile Emission Classroom Environment

Table 7 Suspicious concepts identified in extents of refined STs EEH and EEH \ HPS

ST

Parent

Parent ST

EEH EEH

Environmental Pollution Radiologic Health

EEH BOD

EEH

Smog

NPP

EEH

Academic Environment

IC Classification Classification

PBC airborne level

EEH

Sewage Garbage Indoor pollution

EEH EEH EEH

Student Characteristics Academic Environment (PsycINFO Subcluster Term) Specific Occupational Equipment and Hazards Air Pollution Waste Products Refuse Disposal Environmental Pollution

EEH \ HPS Industrial Waste

EEH \ HPS

Waste Products

Substance

Environmental Pollutants Environmental Contamination Industrial Product Air Pollution Substance Categorized Structurally Physical Forces Natural Physical Forces Gaseous Substance

HPS EEH CVS EEH Substance

Smoke

EEH \ HPS

Classification EEH Substance OA EEH

NPP Classification CVS

BOD – biomedical occupation or discipline, NPP – natural phenomenon or process, IC – idea or concept, OA – occupational activity, HPS – hazardous or poisonous substance, CVS – chemical viewed structurally.

and Hazardous or Poisonous Substance (HPS). When reviewing this semantically non-uniform extent, all concepts seem to have the EEH semantics, except four, which are college, drug-free school, classroom environment and educational environment. The last two were probably categorized as EEH by a string matching technique due to the word ‘‘environment.” In order to facilitate auditing semantically uniform extents, two refined STs, EEH (pure ST), EEH \ HPS (intersection ST) and their extents were derived. There are 56 concepts assigned only EEH, having simple semantics. The remaining five concepts are assigned EEH \ HPS, having compound semantics. For example, Air Pollution is assigned EEH, while Acid Rain is assigned EEH \ HPS. 4.2.2. Creating the semantic type table for EEH Once the refined STs have been derived, we create an ST table for E(EEH) separated into the different refined ST portions. Table 6 shows an excerpt from the ST table for the original EEH, where the following abbreviations are used: BOD (Biomedical Occupation or Discipline), NPP (Natural Phenomenon or Process), IC (Idea or Concept), OA (Occupational Activity) and CVS (Chemical Viewed Structurally). 4.2.3. Identifying suspicious concepts We apply the algorithm IdentifyingSuspiciousConcepts(G) to the extents of the refined STs EEH and EEH \ HPS, respectively. This application yields 27 suspicious concepts out of a total of 56 concepts in EEH, and four suspicious concepts out of five in EEH \ HPS. These suspicious concepts are listed in Table 7, where additional abbreviations are used as follows: MO (Manufactured Object), SC (Spatial Concept) and SB (Social Behavior). For example, Second hand cigarette smoke is assigned EEH, however, the set of ST assignments of its parents (Natural Physical Forces and smoke) is {Classification, EEH, Hazardous or Poisonous

Shaded concepts are those found to have incorrect ST assignments by applying StraightforwardAuditingSuspiciousSTAssignments(Susp). BOD — biomedical occupation or discipline, NPP — natural phenomenon or process, IC – idea or concept, MO – manufactured object, SC – spatial concept, PP — phenomenon or process, OA — occupational activity, SB — social behavior, HPS — hazardous or poisonous substance, CVS — chemical viewed structurally.

Substance}, which is a superset of the ST assignment of Second hand cigarette smoke, rather than a subset of it as it should be. Therefore, this is identified as a suspicious ST assignment. Another example is Garbage, whose parents (Refuse Disposal, Specific Occupational Equipment and Hazards and Occupational hazard) have ST assignments {Occupational Activity, Classification, Phenomenon or Process}. Although EEH is a subtype of Phenomenon or Process, the other two STs, Occupational Activity and Classification, are STs that Garbage is lacking. Thus, the ST assignment of Garbage is suspicious. All concepts with incorrect ST assignments are highlighted in Table 7. 4.2.4. Correcting suspicious semantic type assignments Out of the 30 suspicious concepts assigned EEH, 14 have erroneous ST assignments. For example, Second hand cigarette smoke is known as associated with an increased risk of developing lung cancer. Obviously it is a hazardous or poisonous substance. Therefore, it should be assigned EEH \ HPS, that is, both EEH and HPS. Another example is Garbage, which is assigned EEH, according to the definition of EEH, which emphasizes the ‘‘change of the environment.” However, Garbage is not only an environmental effect of humans but also a substance. Therefore, it should be reassigned EEH \ Substance. In this example, all the 14 erroneous concepts identified in the previous steps do not have children. In this case, dynamic and straightforward auditing procedures yield the same results. The ST assignments of EEH to 13 concepts and one assignment of EEH \ HPS were corrected (see Table 8). The only ST assignment

49

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52 Table 8 Corrected ST assignments of concepts in extents of refined STs EEH and EEH \ Hazardous or Poisonous Substance (HPS), by either dynamic and straightforward auditing procedures Concept

Correct ST

Automobile Emission Classroom Environment College Drug-free school Educational Environment Exhaust Fumes Factory Smoke Garbage Industrial Smog PBC Airborne Level Second Hand Cigarette Smoke Sewage Tobacco Smoke Pollution Industrial Waste

{EEH, HPS} {Organization} {Organization} {Governmental or Regulatory Activity} {IC, Classification} {EEH, HPS} {EEH, HPS} {EEH, Substance} {EEH, HPS} {Quantitative Concept} {EEH, HPS} {EEH, Substance} {EEH, HPS} {EEH, Substance}

HPS – hazardous or poisonous substance, IC – idea or concept.

error missed by our methodology is for Environmental sludge, which was manually reassigned EEH \ SB. The reason for this omission is that this concept has no parents. Table 8 lists the 14 concepts with the semantic type reassignments after group auditing. Six concepts are moved from pure ST EEH to EEH \ HPS, three concepts to EEH \ SB, and one concept to EEH \ Quantitative Concept. There are also five concepts which should not be assigned EEH at all. They are Classroom environment, College, which should be reassigned Organization, Drug-free school, which should be assigned Governmental or Regulatory Activity, Educational environment which should be assigned Idea or Concept \ Classification and PBC airborne level to be assigned Quantitative Concept. Since all these corrected concepts do not have children, no recursive calls are needed, and no difference exists between the dynamic and straightforward procedures. One concept, Industrial Waste, originally assigned EEH \ HPS was reassigned EEH \ SB. As a result, among the 61 original concepts in the extent of the ST EEH, 43 concepts ended up in the extent of the pure ST EEH, nine end up in the extent of the intersection ST EEH \ HPS, and four in the extent of the intersection ST EEH \ SB, one of which was missed by our auditing methodology. Five concepts are not assigned EEH anymore.

5. Additional experiments In this paper, we concentrated on two STs with small extents. There is a natural question how the presented techniques scale to larger extents. To address the issue of scalability, we applied our technique to the extent of the pure ST Governmental and Regulatory Activity. For this extent of 512 concepts, about 40% are suspicious, and 18% of those suspicious concepts are actually erroneous. No exhaustive manual audit was performed for the whole extent of the pure Governmental or Regulatory Activity ST to find how many errors are found for non-suspicious concepts. While the percentage of errors found is lower than for EMD and EEH, it is still relatively high for the expanded auditing effort. We further applied the algorithm IdentifySuspiciousConcepts(G) to the largest extent of a pure ST in the UMLS, Disease or Syndrome. Due to the huge number of the suspicious concepts (12,063) in the latter extent, we did not apply the labor-intensive auditing step. The results of applying our techniques to all four pure STs and the times required are listed in Table 9. One could extrapolate from the run time results for Disease or Syndrome, that the run time for finding suspicious concepts for the whole UMLS, which is about 20 times larger, would be more than 250 h. The time for human auditing of 12,000 suspicious concepts of Disease or Syndrome is estimated over 200 h. It is impossible to estimate the expert auditing time for the whole UMLS based on the data provided in Table 9, as the number of suspicious concepts is unknown. However, one could use such estimates to decide whether to apply the presented techniques to a specific ST of interest for a given application. In order to test whether the precision of our results could be further improved, we experimented with limiting the condition for suspicious concepts to child-of relationships with is-a RELA labels and then auditing the resulting suspicious concepts by applying the procedure DynamicAuditingSuspiciousSTAssignments(Susp). Table 10 shows the results of applying two different conditions, general child-of relationships and child-of relationships with is-a RELA labels. 6. Discussion 6.1. Evaluation In order to evaluate the auditing results obtained by our methodologies, we applied them to two different STs with small extents, EMD and EEH. To measure the performance of our methodologies, we conducted a comprehensive manual audit of the ST assignments

Table 9 Results and processing times of applying group auditing techniques to pure semantic type extents of different sizes Extent of pure ST

Number of concepts

Number of suspicious concepts

Running time finding suspicious concepts (min)

Number of erroneous concepts

Error (%)

Error/ suspicious (%)

Auditing time (h)

Experimental model of disease Environmental effect of humans Governmental or regulatory activity Disease or syndrome

46 56 512 81,267

29 26 206 12,063

0.7 0.8 3.8 885.5

14 13 37 ?a

30 23 7 ?a

48 50 18 ?a

0.5 0.5 3.5 ?a

a

Auditing was not conducted on E(Disease or Syndrome).

Table 10 Comparisons of two different conditions for identifying suspicious concepts Extent of pure ST

Number of General child-of Child-of with is-a labels concepts Number of Number of Precision Recall Number of Number of Precision Recall suspicious concepts erroneous concepts suspicious concepts erroneous concepts

Experimental model of disease 46 Environmental effect of humans 56 Governmental or regulatory activity 512 a

29 26 206

14 13 37

0.48 0.50 0.18

Exhaustive manual auditing was not conducted on E(Governmental or Regulatory Activity).

0.93 0.93 ?a

23 6 72

11 2 8

0.48 0.33 0.11

0.73 0.14 ?a

50

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

for both ST extents. With respect to the extents of EMD, the auditing results achieved with our dynamic procedure nearly matched those obtained with a comprehensive manual review of all the extents’ concepts. Assuming that the comprehensive manual review found all errors, then our straightforward and dynamic auditing methodologies have a recall of 12/16 = 0.80 and a recall of 15/16 = 0.94, respectively. The precisions are 12/30 = 0.4 and 15/30 = 0.50, respectively, much higher than the precision 16/73 = 0.22 for the exhaustive manual review, where all concepts of the extent were considered. With regards to the EEH extent, there was no difference between the straightforward and the dynamic procedures. The reason is that no concept for which a reassignment was made had children, the ST assignments of which may have become suspicious during auditing. The precision of the exhaustive review is 15/ 61 = 0.25. Assuming that the comprehensive manual review found all errors, the algorithmic (either dynamic or straightforward) procedure has a recall of 14/15 = 0.93. The precision is 14/30 = 0.47, much higher than for the manual review. 6.2. Interperation As we have seen in Section 4, a large portion, above 20% of the concepts in the extents of EMD and EEH had ST assignment errors. Some concepts were missing a second semantic type while others should not be assigned this ST at all. As a result, the extents of the pure STs become meaningfully smaller for both of these STs. Perhaps the high percentage of ST errors is caused by the complex nature of these two specific STs, being leaves in SN and having many ancestor STs. Experiments with more and larger ST extents are needed for further study of the situation in other STs. As mentioned earlier, errors found in ST assignments do not simply indicate incorrect categorizations but may also expose a potentially wrong perception, which may have led to other kinds of errors. For example, a concept may have an incorrect child-of to a parent from which it inherits an incorrect ST assignment. Once the ST mis-assignment has been uncovered, the incorrect child-of may be corrected in the next step. For example, the child-of relationship from Genetically Engineered Mouse (reassigned from EMD to Mammal), originally directed to Organism Modification, (assigned RA) was instead indeed redirected in (2006AD) to Laboratory Animal. Furthermore, lateral relationships inherited via the erroneous child-of relationship have been removed. If a concept is assigned a new ST, it may indicate that the concept was missing a child-of relationship. For example, the concept Mouse Models of Human Cancer, originally assigned EMD, and now additionally assigned Neoplastic Process, should have had a child-of to Animal Cancer Model, also assigned EMD \ Neoplastic Process. In the case of the ST EMD, one can further limit the number of suspicious concepts needing review. This improvement is based on the observation that many concepts, which represent an experimental disease, have as a parent the respective concept representing the same disease in humans. For example, Melanoma, Experimental has the parent Melanoma. Thus, we should allow Disease or Syndrome as a legitimate assignment for a parent of an EMD concept. For example, the concept Arthritis, Experimental, assigned EMD, will not be considered suspicious due to its parent Arthritis being assigned Disease or Syndrome. Utilizing this improvement, eight suspicious concepts in Table 4 would not be suspicious any more, lowering an auditor’s effort by reducing the number of reviewed concepts from 30 to 22 while at the same time improving the precision from 15/30 = 0.50 to 15/22 = 0.68, since all 15 erroneous concepts are still detected. We note that the same effect would have occurred if the semantic type EMD would have been changed to ‘‘IS-A Disease or Syn-

drome,” from ‘‘IS-A Pathologic Function,” which it currently is. By the definition of the two STs, this is a needed change, since EMD models a human disease represented in an experimental organism. The current ‘‘IS-A Pathlogic Function” will be implied by the transitivity of the IS-A relationship, since Disease or Syndrome IS-A Pathlogic Function. 6.3. Limitations More experiments with STs having large extents are needed to further examine the efficiency of our methodology for auditing ST assignments and the percentages of suspicious concepts found. The methodology described in this paper may still be difficult to apply to very large ST extents, since for such extents, even the number of suspicious concepts may be overwhelming, as illustrated above with the Disease or Syndrome ST. In such cases, it may help to partition the suspicious concepts into groups of narrower semantics, e.g. singly rooted hierarchies [47]. Reviews of such smaller groups will be easier. Furthermore, by looking at the roots of such groups, one may choose to manually review only those promising a potentially higher likelihood of errors. Those ideas require further experiments with STs with large extents. When comparing the results of using child-of with and without is_a RELA label in Section 5, we found that by limiting auditing to child-of relationships with is_a RELA labels, precision was lower than without the restriction, except for the precision for E(pure EMD), which shows no difference. Furthermore, the restriction of the condition to is_a labels produced a loss of recall ranging from 20% to 80%, that is, the major part of the concepts with wrong ST assignments were overlooked. As shown in Table 1, only 38% of the child-of relationships in META are is_a, while the majority (60%) are marked null, meaning their kinds are not specified. The is_a labels are typically from welldesigned source terminologies, such as SNOMED, NCI, GO, UWDA and NDFRT. The concepts with wrong ST assignments in E(pure EMD) are mainly from NCI, and therefore, even though we restricted the condition to is_a, most of the concepts with wrong ST assignments were still found suspicious. However, this is not the case for the other two extents, EEH and Governmental or Regulatory Activity. Thus, in general, we recommend to apply our technique to all child-of relationships. 6.4. Differences between partitioning techniques We want to stress the difference between partitioning the META’s concepts of one ST extent, which we employ here, and partitioning of the STs of the UMLS Semantic Network into groups of STs as employed in [48–50]. Partitioning of the SN also appears in underlying metaschemas of the Semantic Network [51–53]. The first kind of partitioning, of the META’s concepts into semantically uniform extents, helps auditing the ST assignments of concepts. The second kind of partitioning helps in abstraction and comprehension of the Semantic Network and may help in auditing its structure. That is, those two partitioning tasks occur on different levels.

7. Conclusions We presented a group auditing paradigm for the UMLS which is based on groups of concepts which, by their categorization in the UMLS, are purportedly of similar semantics. A human expert auditor looking at such a group is usually able to tell quickly whether a concept does not fit in. Our approach is based on the extents of semantic types. However, because ST extents are often not semantically uniform, we used the Refined Semantic Network (RSN).

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

Every concept of the UMLS is assigned exactly one refined semantic type from the RSN, and all concepts in the extent of such a refined semantic type have a uniform semantics. As a result of this, auditors are given smaller groups of concepts of uniform semantics, and detecting concepts that do not fit in becomes easier. However, groups may still be large, and this paper presented an additional mechanism to select suspicious concepts. A concept is suspicious if it has a parent assigned a semantic type such that it is neither equal to a semantic type of the concept, nor is an ST of the concept a subtype of the parent ST. The straightforward procedure utilizes an algorithm for detecting suspicious concepts. The dynamic procedure applies auditing also to concepts which were not originally suspicious but became suspicious due to changes in the assignments of their parents. Our methodologies were demonstrated with the extents of the two semantic types Experimental Model of Disease and Environmental Effect of Humans, which have small extents. The methodologies display high error recall with higher precision in comparison with an exhaustive manual audit. Scalability was shown for STs with larger extents. Acknowledgements This work was partially supported by the United States National Library of Medicine under Grant R 01 LM008445-01A2. We thank J.J. Cimino for his feedback on an earlier draft. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.jbi.2008.06.001. References [1] Humphreys BL, Lindberg DAB, Schoolman HM, Barnett GO. The Unified Medical Language System: an informatics research collaboration. JAMIA 1998;5(1):1–11. [2] Schuyler PL, Hole WT, Tuttle MS, Sherertz DD. The UMLS Metathesaurus: representing different views of biomedical concepts. Bull Med Libr Assoc 1993;81(2):217–22. [3] McCray AT, Hole WT. The scope and structure of the first version of the UMLS Semantic Network. In: Proceedings of 14th annual SCAMC, Los Alamitos, CA, November 1990. p. 126–30. [4] McCray AT. An uper-level ontology for the biomedical domain. Comp Func Genom 2003;4:80–4. [5] McCray AT, Nelson SJ. The representation of meaning in the UMLS. Methods Inf Med 1995;34:193–201. [6] Geller J, Gu H, Perl Y, Halper M. Semantic refinement and error correction in large terminological knowledge bases. Data Knowl Eng 2003;45(1):1–32. [7] Gu H, Perl Y, Geller J, Halper M, Liu LM, Cimino JJ. Representing the UMLS as an object-oriented database: modeling issues and advantages. JAMIA 2000;7(1):66–80. [8] Cimino JJ. Auditing the Unified Medical Language System with semantic methods. JAMIA 1998;5:41–51. [9] Chen Y, Perl Y, Geller J, Cimino JJ. Analysis of a study of the users, uses and future agenda of the UMLS. JAMIA 2007;14(2):221–31. [10] Johnsonbaugh R. Discrete mathematics, 6th ed., Englewood Cliffs (NJ): Pearson Prentice-Hall, 2005. [11] Gu H, Perl Y, Elhanan G, Min H, Zhang L, Peng Y. Auditing concept categorizations in the UMLS. Artif Intell Med 2004;31(1):29–44. [12] Reeve LH, Han H, Brooks AD. Biomedical text summarisation using concept chains. Int J Data Min Bioinform 2007;1(4):389–407. [13] Vizenor L, Bodenreider O, Peters L, McCray AT. Enhancing biomedical ontologies through alignment of semantic relationships: exploratory approaches. In: Proceedings of 2006 AMIA annual symposium. p. 840–48, 2006. [14] Delbecque T, Jacquemart P, Zweigenbaum P. Indexing UMLS semantic types for medical question-answering. Stud Health Technol Inform 2005;116:805–10. [15] Sadeghi S, Barzi A, Smith JW. Ontology driven construction of a knowledgebase for Bayesian decision models based on UMLS. Stud Health Technol Inform 2005;116:223–8. [16] Burgun A, Bodenreider O. Methods for exploring the semantics of the relationships between co-occurring UMLS concepts. In: Proceedings of 2001 Medinfo. p. 171–5, 2001. [17] Fan J, Friedman C. Semantic classification of biomedical concepts using distributional similarity. JAMIA 2005;14(4):467–77.

51

[18] Fan JW, Xu H, Friedman C. Using distributional analysis to semantically classify UMLS concepts. In: Proceedings of 2007 Medinfo. p. 519–23, 2007. [19] Cimino JJ. Battling Scylla and Charybdis: the search for redundancy and ambiguity in the 2001 UMLS metathesaurus. In: Overhage JM, editor. Proceedings of 2001 AMIA annual symposium, 2001. p. 120–4. [20] Bodenreider O. Strength in numbers: exploring redundancy in hierarchical relations across biomedical terminologies. In: Proceedings of 2003 AMIA annual symposium, 2003. p. 101–5. [21] Bodenreider O. Circular hierarchical relationships in the UMLS: etiology, diagnosis, treatment, complications and prevention. In: Proceedings of AMIA symposium, 2001. p. 57–61. [22] Mougin F, Bodenreider O. Approaches to eliminating cycles in the UMLS metathesaurus: naive vs. formal. In: Proceedings of 2005 AMIA annual symposium, 2005. p. 550–4. [23] Peng Y, Halper M, Perl Y, Geller J. Auditing the UMLS for redundant classifications. In: Proceedings of 2002 AMIA annual symposium, San Antonio, TX, November 2002. p. 612–6. [24] Hole WT, Srinivasan S. Discovering missed synonymy in a large conceptoriented metathesaurus. In: Overhage JM, editor. Proceedings of 2000 AMIA annual symposium, Los Angeles, CA, November 2000. p. 354–8. [25] Bodenreider O. An object-oriented model for representing semantic locality in the UMLS. In: Proceedings of Medinfo 2001, London, UK, September 2001. p. 161–5. [26] Schulze-Kremer S, Smith B, Kumar A. Revising the UMLS Semantic Network. In: Proceedings of Medinfo 2004, San Francisco, CA, September 2004. p. 1700. [27] Min H, Perl Y, Chen Y, Halper M, Geller J, Wang Y. Auditing as part of the terminology design life cycle. JAMIA 2006;13(6):676–90. [28] Ceusters W, Smith B, Kumar A, Dhaen C. Mistakes in medical ontologies: where do they come from and how can they be detected? In: Pisanelli DM, editor. Ontologies in medicine: proceedings of workshop on medical ontologies, Rome, October 2003. p. 145–64. [29] Ceusters W, Smith B, Kumar A, Dhaen C. Ontology-based error detection in SNOMED-CT. In: Fieschi M, Coiera E, Li Y-C, editors. Proceedings of Medinfo 2004, San Francisco, CA, September 2004. p. 482–6. [30] Ceusters W, Smith B, Goldberg L. A terminological and ontological analysis of the NCI thesaurus. Methods Inf Med 2005;44:498–507. [31] Ceusters W, Spackman KA, Smith B. Would SNOMED-CT benefit from realismbased ontology evolution? In: Teich JM, Suermondt J, Hripcsak G, editors. Proceedings of 2007 AMIA annual symposium, Chicago, IL, November 2007. p. 105–9. [32] Bodenreider O, Smith B, Kumar A, Burgun A. Investigating subsumption in DLbased terminologies: a case study in SNOMED CT. In: Hahn U, Schulz S, Cornet R, editors. Proceedings of first international workshop on formal biomedical knowledge representation (KR-MED 2004), Whistler, Canada, 2004. p. 12–20. [33] Baader F, Calvanese D, McGuinness DL, Nardi D, Patel-Schneider PF, editors. The description logic handbook: theory, implementation, and applications. Cambridge (MA): Cambridge University Press; 2003. [34] Schlobach S, Huang Z, Cornet R, Van Harmelen F. Debugging incoherent terminologies. J Autom Reasoning 2007;39:317–49. [35] Cornet R, Abu-Hanna A. Auditing description-logic-based medical terminological systems by detecting equivalent concept definitions. Int J Med Inform 2008;77(5):336–45. [36] Cornet R, Abu-Hanna A. Description-logic based methods for auditing frame-based medical terminology systems. Artif Intell Med 2005;34(3):201–17. [37] De Keizer NF, Abu-Hanna A, Cornet R, Zwersloot-Schonk JH, Stoutenbeek CP. Analysis and design of an ontology for intensive care diagnoses. Methods Inf Med 1999;38(2):102–12. [38] Kohler J, Munn K, Rnegg A, Skusa A, Smith B. Quality control for terms and definitions in ontologies and taxonomies. BMC Bioinformatics 2006;7:212. [39] Kumar A, Smith B. The Unified Medical Language System and the Gene Ontology: some critical reflections. In: Günter A, Kruse R, Neumann B, editors. KI 2003, advances in artificial intelligence. Lecture notes in artificial intelligence 2821, Springer, 2003. p. 135–48. [40] Smith B, Williams J, Schulze-Kremer S. The ontology of the Gene Ontology. In: Musen MA, editor. Proceedings of 2003 AMIA annual symposium, Washington, DC, November 2003. p. 609–13. [41] Smith B, Köhler J, Kumar A. On the application of formal principles to life science data: a case study in the Gene Ontology. In: Proceedings of DILS 2004 (data integration in the life sciences). Lecture notes in bioinformatics 2994, Springer, 2004. p. 79–94. [42] Kumar A, Smith B, Borgelt C. Dependence relationships between Gene Ontology terms based on TIGR Gene Product Annotations. In: Proceedings of third international workshop on computational terminology, 2004. p. 31–8. [43] Simon J, Dos Santos M, Fielding J, Smith B. Formal ontology for natural language processing and the integration of biomedical databases. Int J Med Inform 2006;75(3):224–31. [44] US Depeartment of Health and Human Services, National Institutes of Health, National Library of Medicine. Unified Medical Language System (UMLS). Available from: www.nlm.nih.gov/research/umls, 2008. [45] Cimino JJ, Min H, Perl Y. Consistency across the hierarchies of the UMLS Semantic Network and Metathesaurus. JBI 2003;36(6):450–61. [46] The Future of the UMLS Semantic Network Workshop, April 2005. Available from: http://mor.nlm.nih.gov/snw/. [47] Gu H, Perl Y, Geller J, Halper M, Singh M. A methodology for partitioning a vocabulary hierarchy into trees. Artif Intell Med 1999;17:77–98.

52

Y. Chen et al. / Journal of Biomedical Informatics 42 (2009) 41–52

[48] McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. In: Proceedings of Medinfo 2001, London, UK, September 2001. p. 171–5. [49] Bodenreider O, McCray AT. Exploring semantic groups through visual approaches. JBI 2003;36(6):414–32. [50] Chen Z, Perl Y, Halper M, Geller J, Gu H. Partitioning the UMLS Semantic Network. IEEE Trans Inf Technol Biomed 2002;6(2):102–8.

[51] Perl Y, Chen Z, Halper M, Geller J, Zhang L, Peng Y. The cohesive metaschema: a higher-level abstraction of the UMLS Semantic Network. JBI 2003;35(3):194–212. [52] Zhang L, Perl Y, Halper M, Geller J, Hripcsak G. A lexical metaschema for the UMLS Semantic Network. Artif Intell Med 2005;33:41–59. [53] Chen Y, Perl Y, Geller J, Hripcsak G, Zhang L. Comparing and consolidating two heuristic metaschemas. JBI 2008;42(2):293–317.

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.