http://decrypthon.igbmc.fr/kd4v KD4v: Comprehensible Knowledge Discovery System For Missense Variants Tien-Dao Luu, Vincent Walter,Hoan Nguyen and Olivier Poch 1 Institute of Genetics and Molecular and Cellular Biology (IGBMC), Illkirch,France
[email protected]
Introduction
A major challenge in the post-genomic era is a better understanding of how human genetic alterations involved in disease affect the gene products.
The KD4v server allows to characterize and predict the phenotypic effects (deleterious/neutral) of missense variants.
16 predicates annotated by MSV3d database: conservation, physico-chemical, functional and 3D structure
The server provides a set of rules learned by Induction Logic Programming.
These rules are interpretable by non-expert humans and are used to accurately predict the deleterious/neutral status of an unknown mutation.
Method & Implementation Generating knowledge: Applying learning methods Comprehensive results: Inductive Logic Programming (ILP) ILP: machine learning +Logic Programming Schema: positive examples + negative examples + background knowledge => hypothesis
Annotation service
Missense Variant
Interpretable rules
Prediction services
Some induced rules obtained by ILP new mutation
physicochemical
Selection of structural mutations
Conservation Localisation Training
Accessibility
Prolog code:
.
Stability Aleph/prolog Contacts
Selected Rules
Prediction service Web, Api SOAP
biologistes
human interpretable rules (if … then …)
deleterious(A) :conservation_class(A, sub_family_conservation), secondary_struc(A, no_helix_no_sheet), gain_contact(A, B), B>=1, stability(A, decrease).
Transform ILP rules into English sentences:
+ neutral or deleterious + decision rules
Dataset-Uniprot/Polyphen-2: 8000 variant swith 3D structure Cross Validation: SIFT PP2
TP 398 576
FP 38 111
FN 260 77
TN 260 184
POS 658 658
NEG 298 298
Pre 0,91 0,84
Recall 0,60 0,88
Acc 0,69 0,80
F-m (1) 0,73 0,86
KD4v
487
94
171
204
658
298
0,84
0,74
0,72
0,79
This rule states that a mutation A is deleterious if: • The mutated residue belongs into the “subfamily conservation class”. • The residue is found in neither an α-helix, nor a β-sheet. • The number of contacts gained after point mutation is larger than or equal to 1. • The stability of the protein after point mutation is decreased.
Cancer-associated gene: MSH2 variant swith 3D structure TP
FP
FN
TN
POS
NEG
Pre
Recall
Acc
F-m (1)
SIFT
33
2
39
10
72
12
0,94
0,45
0,51
0,62
PP2
47
4
25
8
72
12
0,92
0,65
0,65
0,76
KD4v
46
3
26
9
72
12
0,93
0,64
0,65
0,76
The prediction performance of KD4v is comparable with other methods
These ILP rules can be used, for example, to uncover the relationships between the deleterious effect of a mutation and the multi-class conservation pattern or the type of the physico-chemical alterations (e.g., size, charge and hydrophobicity) introduced by the substitution
Future work: •Prediction with 3D structure: adding structural surface topology descriptions of the proteins. •Prediction without 3D structure •SVM+ILP
References [1] Luu, T.-D., Rusu, A.-M., Walter, V., et al. (2012b). MSV3d: database of human MisSense Variants mapped to 3D protein structure. Database (Oxford) 2012, bas018. [2] Luu, T.-D., Rusu, A., Walter, et al. (2012a). KD4v: Comprehensible Knowledge Discovery System for Missense Variant. Nucleic Acids Res. 40, W71–75. [3] Friedrich, A., et al. (2010). SM2PH-db: an interactive system for the integrated analysis of phenotypic consequences, Human Mutation.
Acknowledgements: This work was funded by the Association Française contre les Myopathies (AFM), the Vietnam Ministry of Education and Training, the Institute National de la Santé et de la Recherche Médicale, the Centre National de la Recherche Scientifique (CNRS), and the Université de Strasbourg