Clinical code set engineering for reusing EHR data for research: A review.
J Biomed Inform. 2017 Apr 22;:
Authors: Williams R, Kontopantelis E, Buchan I, Peek N
INTRODUCTION: The construction of reliable, reusable clinical code sets is essential when re-using Electronic Health Record (EHR) data for research. Yet code set definitions are rarely transparent and their sharing is almost non-existent. There is a lack of methodological standards for the management (construction, sharing, revision and reuse) of clinical code sets which needs to be addressed to ensure the reliability and credibility of studies which use code sets.
OBJECTIVE: To review methodological literature on the management of sets of clinical codes used in research on clinical databases and to provide a list of best practice recommendations for future studies and software tools.
METHODS: We performed an exhaustive search for methodological papers about clinical code set engineering for re-using EHR data in research. This was supplemented with papers identified by snowball sampling. In addition, a list of e-phenotyping systems was constructed by merging references from several systematic reviews on this topic, and the processes adopted by those systems for code set management was reviewed.
RESULTS: Thirty methodological papers were reviewed. Common approaches included: creating an initial list of synonyms for the condition of interest (n=20); making use of the hierarchical nature of coding terminologies during searching (n=23); reviewing sets with clinician input (n=20); and reusing and updating an existing code set (n=20). Several open source software tools (n=3) were discovered.
DISCUSSION: There is a need for software tools that enable users to easily and quickly create, revise, extend, review and share code sets and we provide a list of recommendations for their design and implementation.
CONCLUSION: Research re-using EHR data could be improved through the further development, more widespread use and routine reporting of the methods by which clinical codes were selected.
PMID: 28442434 [PubMed - as supplied by publisher]
Electronic Health Record as a Research Tool: Frequency of Exposure to Targeted Clinical Problems and Health Care Providers' Clinical Proficiency.
J Biomed Inform. 2017 Apr 22;:
Authors: Wysocki T, Diaz MCG, Crutchfield JH, Franciosi JP, Werk LN
OBJECTIVES: The Electronic Health Record (EHR) could provide insight into possible decay in health care providers' (HCP) clinical knowledge and cognitive performance. Analyses of the contributions of variables such as frequency of exposure to targeted clinical problems could inform the development and testing of appropriate individualized interventions to mitigate these threats to quality and safety of care.
MATERIALS/METHODS: Nine targeted clinical problems (TCP) were selected for further study, and de-identified, aggregated study data were obtained for one calendar year. Task analysis interviews of subspecialty physicians defined optimal management of each TCP and guided specification of quality of care metrics that could be extracted from the EHR. The Δ-t statistic, days since the provider's prior encounter with a given TCP, quantified frequency of exposure.
RESULTS: Frequency of patient encounters ranged from 1,566 to 220,774 visits across conditions. Mean Δ-t ranged from 1.72 days, to 30.79 days. Maximum Δ-t ranged from 285 to 497 days. The distribution of Δ-t for the TCPs generally fit a Gamma distribution (P < 0.001), indicating that Δ-t conforms to a Poisson process. A quality of care metric derived for each TCP declined progressively with increasing Δ-t for 8 of the 9 TCPs, affirming that knowledge decay was detectable from EHR data.
DISCUSSION/CONCLUSIONS: This project demonstrates the utility of the EHR as a research tool in studies of health care delivery in association with frequency of exposure of HCPs to TCPs. Subsequent steps in our research include multivariate modeling of clinical knowledge decay and randomized trials of pertinent preventive interventions.
PMID: 28442433 [PubMed - as supplied by publisher]
Towards Generalizable Entity-Centric Clinical Coreference Resolution.
J Biomed Inform. 2017 Apr 21;:
Authors: Miller T, Dligach D, Bethard S, Lin C, Savova G
OBJECTIVE: This work investigates the problem of clinical coreference resolution in a model that explicitly tracks entities, and aims to measure the performance of that model in both traditional in-domain train/test splits and cross-domain experiments that measure the generalizability of learned models.
METHODS: The two methods we compare are a baseline mention-pair coreference system that operates over pairs of mentions with best-first conflict resolution and a mention-synchronous system that incrementally builds coreference chains. We develop new features that incorporate distributional semantics, discourse features, and entity attributes. We use two new coreference datasets with similar annotation guidelines - the THYME colon cancer dataset and the DeepPhe breast cancer dataset.
RESULTS: The mention-synchronous system performs similarly on in-domain data but performs much better on new data. Part of speech tag features prove superior in feature generalizability experiments over other word representations. Our methods show generalization improvement but there is still a performance gap when testing in new domains.
DISCUSSION: Generalizability of clinical NLP systems is important and under-studied, so future work should attempt to perform cross-domain and cross-institution evaluations and explicitly develop features and training regimens that favor generalizability. A performance-optimized version of the mention-synchronous system will be included in the open source Apache cTAKES software.
PMID: 28438706 [PubMed - as supplied by publisher]
Using classification models for the generation of disease-specific medications from biomedical literature and clinical data repository.
J Biomed Inform. 2017 Apr 20;:
Authors: Wang L, Haug PJ, Del Fiol G
OBJECTIVE: Mining disease-specific associations from existing knowledge resources can be useful for building disease-specific ontologies and supporting knowledge-based applications. Many association mining techniques have been exploited. However, the challenge remains when those extracted associations contained much noise. It is unreliable to determine the relevance of the association by simply setting up arbitrary cut-off points on multiple scores of relevance; and it would be expensive to ask human experts to manually review a large number of associations. We propose that machine-learning-based classification can be used to separate the signal from the noise, and to provide a feasible approach to create and maintain disease-specific vocabularies.
METHOD: We initially focused on disease-medication associations for the purpose of simplicity. For a disease of interest, we extracted potentially treatment-related drug concepts from biomedical literature citations and from a local clinical data repository. Each concept was associated with multiple measures of relevance (i.e., features) such as frequency of occurrence. For the machine purpose of learning, we formed nine datasets for three diseases with each disease having two single-source datasets and one from the combination of previous two datasets. All the datasets were labeled using existing reference standards. Thereafter, we conducted two experiments: 1) to test if adding features from the clinical data repository would improve the performance of classification achieved using features from the biomedical literature only, and 2) to determine if classifier(s) trained with known medication-disease data sets would be generalizable to new disease(s).
RESULTS: Simple logistic regression and LogitBoost were two classifiers identified as the preferred models separately for the biomedical-literature datasets and combined datasets. The performance of the classification using combined features provided significant improvement beyond that using biomedical-literature features alone (p-value<0.001). The performance of the classifier built from known diseases to predict associated concepts for new diseases showed no significant difference from the performance of the classifier built and tested using the new disease's dataset.
CONCLUSION: It is feasible to use classification approaches to automatically predict the relevance of a concept to a disease of interest. It is useful to combine features from disparate sources for the task of classification. Classifiers built from known diseases were generalizable to new diseases.
PMID: 28435015 [PubMed - as supplied by publisher]
A Distributed Framework for Health Information Exchange Using Smartphone Technologies.
J Biomed Inform. 2017 Apr 19;:
Authors: Abdulnabi M, Al-Haiqi A, Kiah M, Zaidan AA, Zaidan BB, Hussein M
Nationwide health information exchange (NHIE) continues to be a persistent concern for government agencies, despite the many efforts and the conceived benefits of sharing patient data among healthcare providers. Difficulties in ensuring global connectivity, interoperability, and concerns on security have always hampered the government from successfully deploying NHIE. By looking at NHIE from a fresh perspective and bearing in mind the pervasiveness and power of modern mobile platforms, this paper proposes a new approach to NHIE that builds on the notion of consumer-mediated HIE, albeit without the focus on central health record banks. With the growing acceptance of smartphones as reliable, indispensable, and most personal devices, we suggest to leverage the concept of mobile personal health records (PHRs installed on smartphones) to the next level. We envision mPHRs that take the form of distributed storage units for health information, under the full control and direct possession of patients, who can have ready access to their personal data whenever needed. However, for the actual exchange of data with health information systems managed by healthcare providers, the latter have to be interoperable with patient-carried mPHRs. Computer industry has long ago solved a similar problem of interoperability between peripheral devices and operating systems. We borrow from that solution the idea of providing special interfaces between mPHRs and provider systems. This interface enables the two entities to communicate with no change to either end. The design and operation of the proposed approach is explained. Additional pointers on potential implementations are provided, and issues that pertain to any solution to implement NHIE are discussed.
PMID: 28433825 [PubMed - as supplied by publisher]
Automated annotation and classification of BI-RADS assessment from radiology reports.
J Biomed Inform. 2017 Apr 17;:
Authors: Castro SM, Tseytlin E, Medvedeva O, Mitchell K, Visweswaran S, Bekhuis T, Jacobson RS
The Breast Imaging Reporting and Data System (BI-RADS) was developed to reduce variation in the descriptions of findings. Manual analysis of breast radiology report data is challenging but is necessary for clinical and healthcare quality assurance activities. The objective of this study is to develop a natural language processing (NLP) system for automated BI-RADS categories extraction from breast radiology reports. We evaluated an existing rule-based NLP algorithm, and then we developed and evaluated our own method using a supervised machine learning approach. We divided the BI-RADS category extraction task into two specific tasks: (1) annotation of all BI-RADS category values within a report, (2) classification of the laterality of each BI-RADS category value. We used one algorithm for task 1 and evaluated three algorithms for task 2. Across all evaluations and model training, we used a total of 2159 radiology reports from 18 hospitals, from 2003 to 2015. Performance with the existing rule-based algorithm was not satisfactory. Conditional random fields showed a high performance for task 1 with an F-1 measure of 0.95. Rules from partial decision trees (PART) algorithm showed the best performance across classes for task 2 with a weighted F-1 measure of 0.91 for BIRADS 0-6, and 0.93 for BIRADS 3-5. Classification performance by class showed that performance improved for all classes from Naïve Bayes to Support Vector Machine (SVM), and also from SVM to PART. Our system is able to annotate and classify all BI-RADS mentions present in a single radiology report and can serve as the foundation for future studies that will leverage automated BI-RADS annotation, to provide feedback to radiologists as part of a learning health system loop.
PMID: 28428140 [PubMed - as supplied by publisher]
Toward better public health reporting using existing off the shelf approaches: The value of medical dictionaries in automated cancer detection using plaintext medical data.
J Biomed Inform. 2017 Apr 11;:
Authors: Kasthurirathne SN, Dixon BE, Gichoya J, Xu H, Xia Y, Mamlin B, Grannis SJ
OBJECTIVES: Existing approaches to derive decision models from plaintext clinical data frequently depend on medical dictionaries as the sources of potential features. Prior research suggests that decision models developed using non-dictionary based feature sourcing approaches and "off the shelf" tools could predict cancer with performance metrics between 80%-90%. We sought to compare non-dictionary based models to models built using features derived from medical dictionaries.
MATERIALS AND METHODS: We evaluated the detection of cancer cases from free text pathology reports using decision models built with combinations of dictionary or non-dictionary based feature sourcing approaches, 4 feature subset sizes, and 5 classification algorithms. Each decision model was evaluated using the following performance metrics: sensitivity, specificity, accuracy, positive predictive value, and area under the receiver operating characteristics (ROC) curve.
RESULTS: Decision models parameterized using dictionary and non-dictionary feature sourcing approaches produced performance metrics between 70-90%. The source of features and feature subset size had no impact on the performance of a decision model.
CONCLUSION: Our study suggests there is little value in leveraging medical dictionaries for extracting features for decision model building. Decision models built using features extracted from the plaintext reports themselves achieve comparable results to those built using medical dictionaries. Overall, this suggests that existing "off the shelf" approaches can be leveraged to perform accurate cancer detection using less complex Named Entity Recognition (NER) based feature extraction, automated feature selection and modeling approaches.
PMID: 28410983 [PubMed - as supplied by publisher]
EHR-Based Phenotyping: Bulk Learning and Evaluation.
J Biomed Inform. 2017 Apr 11;:
Authors: Chiu PH, Hripcsak G
In data-driven phenotyping, a core computational task is to identify medical concepts and their variations from sources of electronic health records (EHR) to stratify phenotypic cohorts. A conventional analytic framework for phenotyping largely uses a manual knowledge engineering approach or a supervised learning approach where clinical cases are represented by variables encompassing diagnoses, medicinal treatments and laboratory tests, among others. In such a framework, tasks associated with feature engineering and data annotation remain a tedious and expensive exercise, resulting in poor scalability. In addition, certain clinical conditions, such as those that are rare and acute in nature, may never accumulate sufficient data over time, which poses a challenge to establishing accurate and informative statistical models. In this paper, we use infectious diseases as the domain of study to demonstrate a hierarchical learning method based on ensemble learning that attempts to address these issues through feature abstraction. We use a sparse annotation set to train and evaluate many phenotypes at once, which we call bulk learning. In this batch-phenotyping framework, disease cohort definitions can be learned from within the abstract feature space established by using multiple diseases as a substrate and diagnostic codes as surrogates. In particular, using surrogate labels for model training renders possible its subsequent evaluation using only a sparse annotated sample. Moreover, statistical models can be trained and evaluated, using the same sparse annotation, from within the abstract feature space of low dimensionality that encapsulates the shared clinical traits of these target diseases, collectively referred to as the bulk learning set.
PMID: 28410982 [PubMed - as supplied by publisher]
Predicting healthcare trajectories from medical records: A deep learning approach.
J Biomed Inform. 2017 Apr 11;:
Authors: Pham T, Tran T, Phung D, Venkatesh S
Personalized predictive medicine necessitates the modeling of patient illness and care processes, which inherently have long-term temporal dependencies. Healthcare observations, stored in electronic medical records are episodic and irregular in time. We introduce DeepCare, an end-to-end deep dynamic neural network that reads medical records, stores previous illness history, infers current illness states and predicts future medical outcomes. At the data level, DeepCare represents care episodes as vectors and models patient health state trajectories by the memory of historical records. Built on Long Short-Term Memory (LSTM), DeepCare introduces methods to handle irregularly timed events by moderating the forgetting and consolidation of memory. DeepCare also explicitly models medical interventions that change the course of illness and shape future medical risk. Moving up to the health state level, historical and present health states are then aggregated through multiscale temporal pooling, before passing through a neural network that estimates future outcomes. We demonstrate the efficacy of DeepCare for disease progression modeling, intervention recommendation, and future risk prediction. On two important cohorts with heavy social and economic burden - diabetes and mental health - the results show improved prediction accuracy.
PMID: 28410981 [PubMed - as supplied by publisher]
Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.
J Biomed Inform. 2017 Apr 09;:
Authors: He B, Dong B, Guan Y, Yang J, Jiang Z, Yu Q, Cheng J, Qu C
OBJECTIVE: To build a comprehensive corpus covering syntactic and semantic annotations of Chinese clinical texts with corresponding annotation guidelines and methods as well as to develop tools trained on the annotated corpus, which supplies baselines for research on Chinese texts in the clinical domain.
MATERIALS AND METHODS: An iterative annotation method was proposed to train annotators and to develop annotation guidelines. Then, by using annotation quality assurance measures, a comprehensive corpus was built, containing annotations of part-of-speech (POS) tags, syntactic tags, entities, assertions, and relations. Inter-annotator agreement (IAA) was calculated to evaluate the annotation quality and a Chinese clinical text processing and information extraction system (CCTPIES) was developed based on our annotated corpus.
RESULTS: The syntactic corpus consists of 138 Chinese clinical documents with 47,426 tokens and 2612 full parsing trees, while the semantic corpus includes 992 documents that annotated 39,511 entities with their assertions and 7693 relations. IAA evaluation shows that this comprehensive corpus is of good quality, and the system modules are effective.
DISCUSSION: The annotated corpus makes a considerable contribution to natural language processing (NLP) research into Chinese texts in the clinical domain. However, this corpus has a number of limitations. Some additional types of clinical text should be introduced to improve corpus coverage and active learning methods should be utilized to promote annotation efficiency.
CONCLUSIONS: In this study, several annotation guidelines and an annotation method for Chinese clinical texts were proposed, and a comprehensive corpus with its NLP modules were constructed, providing a foundation for further study of applying NLP techniques to Chinese texts in the clinical domain.
PMID: 28404537 [PubMed - as supplied by publisher]