Subscribe: pubmed: "journal of biomedic...
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0UTf7I--soZUCv_XPWmo6aGFh3ZP0jdNleq-cqg6_6w
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
algorithms  data  distributional representations  metadata  mining  patients  psychiatric  quality  records  symptoms  text mining  text 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: pubmed: "journal of biomedic...

pubmed: "journal of biomedic...



NCBI: db=pubmed; Term=("journal of biomedical informatics"[Jour])



 



Predicting Biomedical Metadata in CEDAR: a Study of Gene Expression Omnibus (GEO).
(image) Related Articles

Predicting Biomedical Metadata in CEDAR: a Study of Gene Expression Omnibus (GEO).

J Biomed Inform. 2017 Jun 15;:

Authors: Panahiazar M, Dumontier M, Gevaert O

Abstract
A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1,3 million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

PMID: 28625880 [PubMed - as supplied by publisher]




Psychiatric symptom recognition without labeled data using distributional representations of phrases and on-line knowledge.
(image) Related Articles

Psychiatric symptom recognition without labeled data using distributional representations of phrases and on-line knowledge.

J Biomed Inform. 2017 Jun 14;:

Authors: Zhang Y, Zhang O, Wu Y, Lee HJ, Xu J, Xu H, Roberts K

Abstract
OBJECTIVE: Mental health is becoming an increasingly important topic in healthcare. Psychiatric symptoms, which consist of subjective descriptions of the patient's experience, as well as the nature and severity of mental disorders, are critical to support the phenotypic classification for personalized prevention, diagnosis, and intervention of mental disorders. However, few automated approaches have been proposed to extract psychiatric symptoms from clinical text, mainly due to (a) the lack of annotated corpora, which are time-consuming and costly to build, and (b) the inherent linguistic difficulties that symptoms present as they are not well-defined clinical concepts like diseases. The goal of this study is to investigate techniques for recognizing psychiatric symptoms in clinical text without labeled data. Instead, external knowledge in the form of publicly available "seed" lists of symptoms is leveraged using unsupervised distributional representations.
MATERIALS AND METHODS: First, psychiatric symptoms are collected from three online repositories of healthcare knowledge for consumers-MedlinePlus, Mayo Clinic, and the American Psychiatric Association-for use as seed terms. Candidate symptoms in psychiatric notes are automatically extracted using phrasal syntax patterns. In particular, the 2016 CEGS N-GRID challenge data serves as the psychiatric note corpus. Second, three corpora-psychiatric notes, psychiatric forum data, and MIMIC II-are adopted to generate distributional representations with paragraph2vec. Finally, semantic similarity between the distributional representations of the seed symptoms and candidate symptoms is calculated to assess the relevance of a phrase. Experiments were performed on a set of psychiatric notes from the CEGS N-GRID 2016 Challenge.
RESULTS & CONCLUSION: Our method demonstrates good performance at extracting symptoms from an unseen corpus, including symptoms with no word overlap with the provided seed terms. Semantic similarity based on the distributional representation outperformed baseline methods. Our experiment yielded two interesting results. First, distributional representations built from social media data outperformed those built from clinical data. And second, the distributional representation model built from sentences resulted in better representations of phrases than the model built from phrase alone.

PMID: 28624644 [PubMed - as supplied by publisher]




Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature.
(image) Related Articles

Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature.

J Biomed Inform. 2017 Jun 14;:

Authors: Bouadjenek MR, Verspoor K, Zobel J

Abstract
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

PMID: 28624643 [PubMed - as supplied by publisher]




DrugSemantics: a corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics.
(image) Related Articles

DrugSemantics: a corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics.

J Biomed Inform. 2017 Jun 14;:

Authors: Moreno I, Boldrini E, Moreda P, Teresa Romá-Ferri M

Abstract
For the healthcare sector, it is critical to exploit the vast amount of textual health-related information. Nevertheless, healthcare providers have difficulties to benefit from such quantity of data during pharmacotherapeutic care. The problem is that such information is stored in different sources and their consultation time is limited. In this context, Natural Language Processing techniques can be applied to efficiently transform textual data into structured information so that it could be used in critical healthcare applications, being of help for physicians in their daily workload, such as: decision support systems, cohort identification, patient management, etc. Any development of these techniques requires annotated corpora. However, there is a lack of such resources in this domain and, in most cases, the few ones available concern English. This paper presents the definition and creation of DrugSemantics corpus, a collection of Summaries of Product Characteristics in Spanish. It was manually annotated with pharmacotherapeutic named entities, detailed in DrugSemantics annotation scheme. Annotators were a Registered Nurse (RN) and two students from the Degree in Nursing. The quality of DrugSemantics corpus has been assessed by measuring its annotation reliability (overall F=79.33% [95%CI: 78.35-80.31]), as well as its annotation precision (overall P=94.65% [95%CI: 94.11-95.19]). Besides, the gold-standard construction process is described in detail. In total, our corpus contains more than 2,000 named entities, 780 sentences and 226,729 tokens. Last, a Named Entity Classification module trained on DrugSemantics is presented aiming at showing the quality of our corpus, as well as an example of how to use it.

PMID: 28624642 [PubMed - as supplied by publisher]




Text Mining Applied to Electronic Cardiovascular Procedure Reports to Identify Patients with Trileaflet Aortic Stenosis and Coronary Artery Disease.
(image) Related Articles

Text Mining Applied to Electronic Cardiovascular Procedure Reports to Identify Patients with Trileaflet Aortic Stenosis and Coronary Artery Disease.

J Biomed Inform. 2017 Jun 14;:

Authors: Small AM, Kiss DH, Zlatsin Y, Birtwell DL, Williams H, Guerraty MA, Han Y, Anwaruddin S, Holmes JH, Chirinos JA, Wilensky RL, Giri J, Rader DJ

Abstract
BACKGROUND: Interrogation of the electronic health record (EHR) using billing codes as a surrogate for diagnoses of interest has been widely used for clinical research. However, the accuracy of this methodology is variable, as it reflects billing codes rather than severity of disease, and depends on the disease and the accuracy of the coding practitioner. Systematic application of text mining to the EHR has had variable success for the detection of cardiovascular phenotypes. We hypothesize that the application of text mining algorithms to cardiovascular procedure reports may be a superior method to identify patients with cardiovascular conditions of interest.
METHODS: We adapted the Oracle product Endeca, which utilizes text mining to identify terms of interest from a NoSQL-like database, for purposes of searching cardiovascular procedure reports and termed the tool "PennSeek". We imported 282,569 echocardiography reports representing 81,164 individuals and 27,205 cardiac catheterization reports representing 14,567 individuals from non-searchable databases into PennSeek. We then applied clinical criteria to these reports in PennSeek to identify patients with trileaflet aortic stenosis (TAS) and coronary artery disease (CAD). Accuracy of patient identification by text mining through PennSeek was compared with ICD-9 billing codes.
RESULTS: Text mining identified 7,115 patients with TAS and 9,247 patients with CAD. ICD-9 codes identified 8,272 patients with TAS and 6,913 patients with CAD. 4,346 patients with AS and 6,024 patients with CAD were identified by both approaches. A randomly selected sample of 200-250 patients uniquely identified by text mining was compared with 200-250 patients uniquely identified by billing codes for both diseases. We demonstrate that text mining was superior, with a positive predictive value (PPV) of 0.95 compared to 0.53 by ICD-9 for TAS, and a PPV of 0.97 compared to 0.86 for CAD.
CONCLUSION: These results highlight the superiority of text mining algorithms applied to electronic cardiovascular procedure reports in the identification of phenotypes of interest for cardiovascular research.

PMID: 28624641 [PubMed - as supplied by publisher]