Subscribe: IEEE/ACM Transactions on Computational Biology and Bioinformatics
http://csdl.computer.org/rss/tcbb.xml
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
algorithm  analysis  approach  based  data  gene  method  methods  network  problem  proposed  protein  proteins  results 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics



The IEEE/ACM Transactions on Computational Biology and Bioinformatics is a new quarterly that will publish archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and co



 



Learning Parameter-Advising Sets for Multiple Sequence Alignment

10/06/2017 2:00 pm PST

While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (such as the choice of gap penalties and substitution scores), most users rely on the single default parameter setting provided by the aligner. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice. The parameter advisor picks the parameter choice from the set whose resulting alignment has highest estimated accuracy. In this paper, we consider for the first time the problem of learning the optimal set of parameter choices for a parameter advisor that uses a given accuracy estimator. The optimal set is one that maximizes the expected true accuracy of the resulting parameter advisor, averaged over a collection of training data. While we prove that learning an optimal set for an advisor is NP-complete, we show there is a natural approximation algorithm for this problem, and prove a tight bound on its approximation ratio. Experiments with an implementation of this approximation algorithm on biological benchmarks, using various accuracy estimators from the literature, show it finds sets for advisors that are surprisingly close to optimal. Furthermore, the resulting parameter advisors are significantly more accurate in practice than simply aligning with a single default parameter choice.



Pathway Analysis with Signaling Hypergraphs

10/06/2017 2:01 pm PST

Signaling pathways play an important role in the cell’s response to its environment. Signaling pathways are often represented as directed graphs, which are not adequate for modeling reactions such as complex assembly and dissociation, combinatorial regulation, and protein activation/inactivation. More accurate representations such as directed hypergraphs remain underutilized. In this paper, we present an extension of a directed hypergraph that we call a signaling hypergraph. We formulate a problem that asks what proteins and interactions must be involved in order to stimulate a specific response downstream of a signaling pathway. We relate this problem to computing the shortest acyclic $B$ -hyperpath in a signaling hypergraph—an NP-hard problem—and present a mixed integer linear program to solve it. We demonstrate that the shortest hyperpaths computed in signaling hypergraphs are far more informative than shortest paths, Steiner trees, and subnetworks containing many short paths found in corresponding graph representations. Our results illustrate the potential of signaling hypergraphs as an improved representation of signaling pathways and motivate the development of novel hypergraph algorithms.



A Sparse Learning Framework for Joint Effect Analysis of Copy Number Variants

10/06/2017 2:01 pm PST

Copy number variants (CNVs), including large deletions and duplications, represent an unbalanced change of DNA segments. Abundant in human genomes, CNVs contribute to a large proportion of human genetic diversity, with impact on many human phenotypes. Although recent advances in genetic studies have shed light on the impact of individual CNVs on different traits, the analysis of joint effect of multiple interactive CNVs lags behind from many perspectives. A primary reason is that the large number of CNV combinations and interactions in the human genome make it computationally challenging to perform such joint analysis. To address this challenge, we developed a novel sparse learning framework that combines sparse learning with biological networks to identify interacting CNVs with joint effect on particular traits. We showed that our approach performs well in identifying CNVs with joint phenotypic effect using simulated data. Applied to a real human genomic dataset from the 1,000 Genomes Project, our approach identified multiple CNVs that collectively contribute to population differentiation. We found a set of multiple CNVs that have joint effect in different populations, and affect gene expression differently in distinct populations. These results provided a collection of CNVs that likely have downstream biomedical implications in individuals from diverse population backgrounds.



Improving Identification of Key Players in Aging via Network De-Noising and Core Inference

10/06/2017 2:01 pm PST

Current “ground truth” knowledge about human aging has been obtained by transferring aging-related knowledge from well-studied model species via sequence homology or by studying human gene expression data. Since proteins function by interacting with each other, analyzing protein-protein interaction (PPI) networks in the context of aging is promising. Unlike existing static network research of aging, since cellular functioning is dynamic, we recently integrated the static human PPI network with aging-related gene expression data to form dynamic, age-specific networks. Then, we predicted as key players in aging those proteins whose network topologies significantly changed with age. Since current networks are noisy , here, we use link prediction to de-noise the human network and predict improved key players in aging from the de-noised data. Indeed, de-noising gives more significant overlap between the predicted data and the “ground truth” aging-related data. Yet, we obtain novel predictions, which we validate in the literature. Also, we improve the predictions by an alternative strategy: removing “redundant” edges from the age-specific networks and using the resulting age-specific network “cores” to study aging. We produce new knowledge from dynamic networks encompassing multiple data types, via network de-noising or core inference, complementing the existing knowledge obtained from sequence or expression data.



Predicting nsSNPs that Disrupt Protein-Protein Interactions Using Docking

10/06/2017 2:01 pm PST

The human genome contains a large number of protein polymorphisms due to individual genome variation. How many of these polymorphisms lead to altered protein-protein interaction is unknown. We have developed a method to address this question. The intersection of the SKEMPI database (of affinity constants among interacting proteins) and CAPRI 4.0 docking benchmark was docked using HADDOCK, leading to a training set of 166 mutant pairs. A random forest classifier based on the differences in resulting docking scores between the 166 mutant pairs and their wild-types was used, to distinguish between variants that have either completely or partially lost binding ability. Fifty percent of non-binders were correctly predicted with a false discovery rate of only 2 percent. The model was tested on a set of 15 HIV-1 – human, as well as seven human- human glioblastoma-related, mutant protein pairs: 50 percent of combined non-binders were correctly predicted with a false discovery rate of 10 percent. The model was also used to identify 10 protein-protein interactions between human proteins and their HIV-1 partners that are likely to be abolished by rare non-synonymous single-nucleotide polymorphisms (nsSNPs). These nsSNPs may represent novel and potentially therapeutically-valuable targets for anti-viral therapy by disruption of viral binding.



An Annotation Agnostic Algorithm for Detecting Nascent RNA Transcripts in GRO-Seq

10/06/2017 2:01 pm PST

We present a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels which are affected by transcription, post-transcriptional processing, and RNA stability. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed. Here, we describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, hidden Markov models and logistic regression, to classify which regions of the genome are transcribed. Given a small user-defined training set, our algorithm is accurate, robust to varying read depth, annotation agnostic, and fast. Analysis of GRO-seq data without a priori need for annotation uncovers surprising new insights into several aspects of the transcription process.



Unconstrained Diameters for Deep Coalescence

10/06/2017 2:01 pm PST

The minimizing-deep-coalescence (MDC) approach infers a median (species) tree for a given set of gene trees under the deep coalescence cost. This cost accounts for the minimum number of deep coalescences needed to reconcile a gene tree with a species tree where the leaf-genes are mapped to the leaf-species through a function called leaf labeling. In order to better understand the MDC approach we investigate here the diameter of a gene tree, which is an important property of the deep coalescence cost. This diameter is the maximal deep coalescence costs for a given gene tree under all leaf labelings for each possible species tree topology. While we prove that this diameter is generally infinite, this result relies on the diameter’s unrealistic assumption that species trees can be of infinite size. Providing a more practical definition, we introduce a natural extension of the gene tree diameter that constrains the species tree size by a given constant. For this new diameter, we describe an exact formula, present a complete classification of the trees yielding this diameter, derive formulas for its mean and variance, and demonstrate its ability using comparative studies.



IsAProteinDB: An Indexed Database of Trypsinized Proteins for Fast Peptide Mass Fingerprinting

10/06/2017 2:01 pm PST

In peptite mass fingerprinting, an unknown protein is fragmented into smaller peptides whose masses are accurately measured; the obtained list of weights is then compared with a reference database to obtain a set of matching proteins. The exponential growth of known proteins discourage the use of brute force methods, where the weights’ list is compared with each protein in the reference collection; luckily, the scientific literature in the database field highlights that well designed searching algorithms, coupled with a proper data organization, allow to quickly solve the identification problem even on standard desktop computers. In this paper, IsAProteinsDB, an indexed database of trypsinized proteins, is presented. The corresponding search algorithm shows a time complexity that does not significantly depends on the size of the reference protein database.



An Eigen-Binding Site Based Method for the Analysis of Anti-EGFR Drug Resistance in Lung Cancer Treatment

10/06/2017 2:01 pm PST

We explore the drug resistance mechanism in non-small cell lung cancer treatment by characterizing the drug-binding site of a protein mutant based on local surface and energy features. These features are transformed to an eigen-binding site space and used for drug resistance level prediction and analysis.



An Algorithm for Motif-Based Network Design

10/06/2017 2:01 pm PST

A determinant property of the structure of a biological network is the distribution of local connectivity patterns, i.e., network motifs. In this work, a method for creating directed, unweighted networks while promoting a certain combination of motifs is presented. This motif-based network algorithm starts with an empty graph and randomly connects the nodes by advancing or discouraging the formation of chosen motifs. The in- or out-degree distribution of the generated networks can be explicitly chosen. The algorithm is shown to perform well in producing networks with high occurrences of the targeted motifs, both ones consisting of three nodes as well as ones consisting of four nodes. Moreover, the algorithm can also be tuned to bring about global network characteristics found in many natural networks, such as small-worldness and modularity.



Deep Conditional Random Field Approach to Transmembrane Topology Prediction and Application to GPCR Three-Dimensional Structure Modeling

10/06/2017 2:01 pm PST

Transmembrane proteins play important roles in cellular energy production, signal transmission, and metabolism. Many shallow machine learning methods have been applied to transmembrane topology prediction, but the performance was limited by the large size of membrane proteins and the complex biological evolution information behind the sequence. In this paper, we proposed a novel deep approach based on conditional random fields named as dCRF-TM for predicting the topology of transmembrane proteins. Conditional random fields take into account more complicated interrelation between residue labels in full-length sequence than HMM and SVM-based methods. Three widely-used datasets were employed in the benchmark. DCRF-TM had the accuracy 95 percent over helix location prediction and the accuracy 78 percent over helix number prediction. DCRF-TM demonstrated a more robust performance on large size proteins (>350 residues) against 11 state-of-the-art predictors. Further dCRF-TM was applied to ab initio modeling three-dimensional structures of seven-transmembrane receptors, also known as G protein-coupled receptors. The predictions on 24 solved G protein-coupled receptors and unsolved vasopressin V2 receptor illustrated that dCRF-TM helped abGPCR-I-TASSER to improve TM-score 34.3 percent rather than using the random transmembrane definition. Two out of five predicted models caught the experimental verified disulfide bonds in vasopressin V2 receptor.



hMuLab: A Biomedical Hybrid MUlti-LABel Classifier Based on Multiple Linear Regression

10/06/2017 2:01 pm PST

Many biomedical classification problems are multi-label by nature, e.g., a gene involved in a variety of functions and a patient with multiple diseases. The majority of existing classification algorithms assumes each sample with only one class label, and the multi-label classification problem remains to be a challenge for biomedical researchers. This study proposes a novel multi-label learning algorithm, hMuLab, by integrating both feature-based and neighbor-based similarity scores. The multiple linear regression modeling techniques make hMuLab capable of producing multiple label assignments for a query sample. The comparison results over six commonly-used multi-label performance measurements suggest that hMuLab performs accurately and stably for the biomedical datasets, and may serve as a complement to the existing literature.



Identifying Stages of Kidney Renal Cell Carcinoma by Combining Gene Expression and DNA Methylation Data

10/06/2017 2:01 pm PST

In this study, in order to take advantage of complementary information from different types of data for better disease status diagnosis, we combined gene expression with DNA methylation data and generated a fused network, based on which the stages of Kidney Renal Cell Carcinoma (KIRC) can be better identified. It is well recognized that a network is important for investigating the connectivity of disease groups. We exploited the potential of the network's features to identify the KIRC stage. We first constructed a patient network from each type of data. We then built a fused network based on network fusion method. Based on the link weights of patients, we used a generalized linear model to predict the group of KIRC subjects. Finally, the group prediction method was applied to test the power of network-based features. The performance (e.g., the accuracy of identifying cancer stages) when using the fused network from two types of data is shown to be superior to that when using two patient networks from only one data type. The work provides a good example for using network based features from multiple data types for a more comprehensive diagnosis.



Classification of Protein Structure Classes on Flexible Neutral Tree

10/06/2017 2:01 pm PST

Accurate classification on protein structural is playing an important role in Bioinformatics. An increase in evidence demonstrates that a variety of classification methods have been employed in such a field. In this research, the features of amino acids composition, secondary structure's feature, and correlation coefficient of amino acid dimers and amino acid triplets have been used. Flexible neutral tree (FNT), a particular tree structure neutral network, has been employed as the classification model in the protein structures’ classification framework. Considering different feature groups owing diverse roles in the model, impact factors of different groups have been put forward in this research. In order to evaluate different impact factors, Impact Factors Scaling (IFS) algorithm, which aim at reducing redundant information of the selected features in some degree, have been put forward. To examine the performance of such framework, the 640, 1189, and ASTRAL datasets are employed as the low-homology protein structure benchmark datasets. Experimental results demonstrate that the performance of the proposed method is better than the other methods in the low-homology protein tertiary structures.



Nonconvex Penalty Based Low-Rank Representation and Sparse Regression for eQTL Mapping

10/06/2017 2:01 pm PST

This paper addresses the problem of accounting for confounding factors and expression quantitative trait loci (eQTL) mapping in the study of SNP-gene associations. The existing convex penalty based algorithm has limited capacity to keep main information of matrix in the process of reducing matrix rank. We present an algorithm, which use nonconvex penalty based low-rank representation to account for confounding factors and make use of sparse regression for eQTL mapping (NCLRS). The efficiency of the presented algorithm is evaluated by comparing the results of 18 synthetic datasets given by NCLRS and presented algorithm, respectively. The experimental results or biological dataset show that our approach is an effective tool to account for non-genetic effects than currently existing methods.



Cancer Subtype Discovery Based on Integrative Model of Multigenomic Data

10/06/2017 2:00 pm PST

One major goal of large-scale cancer omics study is to understand molecular mechanisms of cancer and find new biomedical targets. To deal with the high-dimensional multidimensional cancer omics data (DNA methylation, mRNA expression, etc.), which can be used to discover new insight on identifying cancer subtypes, clustering methods are usually used to find an effective low-dimensional subspace of the original data and then cluster cancer samples in the reduced subspace. However, due to data-type diversity and big data volume, few methods can integrate these data and map them into an effective low-dimensional subspace. In this paper, we develop a dimension-reduction and data-integration method for indentifying cancer subtypes, named Scluster. First, Scluster, respectively, projects the different original data into the principal subspaces by an adaptive sparse reduced-rank regression method. Then, a fused patient-by-patient network is obtained for these subgroups through a scaled exponential similarity kernel method. Finally, candidate cancer subtypes are identified using spectral clustering method. We demonstrate the efficiency of our Scluster method using three cancers by jointly analyzing mRNA expression, miRNA expression, and DNA methylation data. The evaluation results and analyses show that Scluster is effective for predicting survival and identifies novel cancer subtypes of large-scale multi-omics data.



Exploring Consensus RNA Substructural Patterns Using Subgraph Mining

10/06/2017 2:00 pm PST

Frequently recurring RNA ?> structural motifs play important roles in RNA folding process and interaction with other molecules. Traditional index-based and shape-based schemas are useful in modeling RNA secondary structures but ignore the structural discrepancy of individual RNA family member. Further, the in-depth analysis of underlying substructure pattern is insufficient due to varied and unnormalized substructure data. This prevents us from understanding RNAs functions and their inherent synergistic regulation networks. This article thus proposes a novel labeled graph-based algorithm RnaGraph to uncover frequently RNA substructure patterns. Attribute data and graph data are combined to characterize diverse substructures and their correlations, respectively. Further, a top-k graph pattern mining algorithm is developed to extract interesting substructure motifs by integrating frequency and similarity. The experimental results show that our methods assist in not only modelling complex RNA secondary structures but also identifying hidden but interesting RNA substructure patterns.



PSPEL: In Silico Prediction of Self-Interacting Proteins from Amino Acids Sequences Using Ensemble Learning

10/06/2017 2:01 pm PST

Self interacting proteins (SIPs) play an important role in various aspects of the structural and functional organization of the cell. Detecting SIPs is one of the most important issues in current molecular biology. Although a large number of SIPs data has been generated by experimental methods, wet laboratory approaches are both time-consuming and costly. In addition, they yield high false negative and positive rates. Thus, there is a great need for in silico methods to predict SIPs accurately and efficiently. In this study, a new sequence-based method is proposed to predict SIPs. The evolutionary information contained in Position-Specific Scoring Matrix (PSSM) is extracted from of protein with known sequence. Then, features are fed to an ensemble classifier to distinguish the self-interacting and non-self-interacting proteins. When performed on Saccharomyces cerevisiae and Human SIPs data sets, the proposed method can achieve high accuracies of 86.86 and 91.30 percent, respectively. Our method also shows a good performance when compared with the SVM classifier and previous methods. Consequently, the proposed method can be considered to be a novel promising tool to predict SIPs.



IPED2: Inheritance Path Based Pedigree Reconstruction Algorithm for Complicated Pedigrees

10/06/2017 2:01 pm PST

Reconstruction of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. The problem is known to be NP-hard even for datasets known to only contain siblings. Some recent methods have been developed to accurately and efficiently reconstruct pedigrees. These methods, however, still consider relatively simple pedigrees, for example, they are not able to handle half-sibling situations where a pair of individuals only share one parent. In this work, we propose an efficient method, IPED2, based on our previous work, which specifically targets reconstruction of complicated pedigrees that include half-siblings. We note that the presence of half-siblings makes the reconstruction problem significantly more challenging which is why previous methods exclude the possibility of half-siblings. We proposed a novel model as well as an efficient graph algorithm and experiments show that our algorithm achieves relatively accurate reconstruction. To our knowledge, this is the first method that is able to handle pedigree reconstruction from genotype data when half-sibling exists in any generation of the pedigree.









cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU

08/07/2017 2:05 pm PST

BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.



Omics Informatics: From Scattered Individual Software Tools to Integrated Workflow Management Systems

08/09/2017 2:02 pm PST

Omic data analyses pose great informatics challenges. As an emerging subfield of bioinformatics, omics informatics focuses on analyzing multi-omic data efficiently and effectively, and is gaining momentum. There are two underlying trends in the expansion of omics informatics landscape: the explosion of scattered individual omics informatics tools with each of which focuses on a specific task in both single- and multi- omic settings, and the fast-evolving integrated software platforms such as workflow management systems that can assemble multiple tools into pipelines and streamline integrative analysis for complicated tasks. In this survey, we give a holistic view of omics informatics, from scattered individual informatics tools to integrated workflow management systems. We not only outline the landscape and challenges of omics informatics, but also sample a number of widely used and cutting-edge algorithms in omics data analysis to give readers a fine-grained view. We survey various workflow management systems (WMSs), classify them into three levels of WMSs from simple software toolkits to integrated multi-omic analytical platforms, and point out the emerging needs for developing intelligent workflow management systems. We also discuss the challenges, strategies and some existing work in systematic evaluation of omics informatics tools. We conclude by providing future perspectives of emerging fields and new frontiers in omics informatics.



An IR-Based Approach Utilizing Query Expansion for Plagiarism Detection in MEDLINE

08/07/2017 2:06 pm PST

The identification of duplicated and plagiarized passages of text has become an increasingly active area of research. In this paper, we investigate methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE, particularly when the original text has been modified through the replacement of words or phrases. A scalable approach based on Information Retrieval is used to perform candidate document selection—the identification of a subset of potential source documents given a suspicious text—from MEDLINE. Query expansion is performed using the ULMS Metathesaurus to deal with situations in which original documents are obfuscated. Various approaches to Word Sense Disambiguation are investigated to deal with cases where there are multiple Concept Unique Identifiers (CUIs) for a given term. Results using the proposed IR-based approach outperform a state-of-the-art baseline based on Kullback-Leibler Distance.



Super-Thresholding: Supervised Thresholding of Protein Crystal Images

08/07/2017 2:05 pm PST

In general, a single thresholding technique is developed or enhanced to separate foreground objects from background for a domain of images. This idea may not generate satisfactory results for all images in a dataset, since different images may require different types of thresholding methods for proper binarization or segmentation. To overcome this limitation, in this study, we propose a novel approach called “super-thresholding” that utilizes a supervised classifier to decide an appropriate thresholding method for a specific image. This method provides a generic framework that allows selection of the best thresholding method among different thresholding techniques that are beneficial for the problem domain. A classifier model is built using features extracted priori from the original image only or posteriori by analyzing the outputs of thresholding methods and the original image. This model is applied to identify the thresholding method for new images of the domain. We performed our method on protein crystallization images, and then we compared our results with six thresholding techniques. Numerical results are provided using four different correctness measurements. Super-thresholding outperforms the best single thresholding method around 10 percent, and it gives the best performance for protein crystallization dataset in our experiments.



Prediction of Novel Drugs for Hepatocellular Carcinoma Based on Multi-Source Random Walk

08/07/2017 2:06 pm PST

Computational approaches for predicting drug-disease associations by integrating gene expression and biological network provide great insights to the complex relationships among drugs, targets, disease genes, and diseases at a system level. Hepatocellular carcinoma (HCC) is one of the most common malignant tumors with a high rate of morbidity and mortality. We provide an integrative framework to predict novel d rugs for HCC based on multi-source random walk (PD-MRW). Firstly, based on gene expression and protein interaction network, we construct a gene-gene weighted i nteraction network (GWIN). Then, based on multi-source random walk in GWIN, we build a drug-drug similarity network. Finally, based on the known drugs for HCC, we score all drugs in the drug-drug similarity network. The robustness of our predictions, their overlap with those reported in Comparative Toxicogenomics Database (CTD) and literatures, and their enriched KEGG pathway demonstrate our approach can effectively identify new drug indications. Specifically, regorafenib (Rank = 9 in top-20 list) is proven to be effective in Phase I and II clinical trials of HCC, and the Phase III trial is ongoing. And, it has 11 overlapping pathways with HCC with lower p-values. Focusing on a particular disease, we believe our approach is more accurate and possesses better scalability.



SuperMIC: Analyzing Large Biological Datasets in Bioinformatics with Maximal Information Coefficient

08/07/2017 2:06 pm PST

The maximal information coefficient (MIC) has been proposed to discover relationships and associations between pairs of variables. It poses significant challenges for bioinformatics scientists to accelerate the MIC calculation, especially in genome sequencing and biological annotations. In this paper, we explore a parallel approach which uses MapReduce framework to improve the computing efficiency and throughput of the MIC computation. The acceleration system includes biological data storage on HDFS, preprocessing algorithms, distributed memory cache mechanism, and the partition of MapReduce jobs. Based on the acceleration approach, we extend the traditional two-variable algorithm to multiple variables algorithm. The experimental results show that our parallel solution provides a linear speedup comparing with original algorithm without affecting the correctness and sensitivity.



Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources

08/07/2017 2:06 pm PST

Since the discovery of the regulatory function of microRNA (miRNA), increased attention has focused on identifying the relationship between miRNA and disease. It has been suggested that computational method is an efficient way to identify potential disease-related miRNAs for further confirmation using biological experiments. In this paper, we first highlighted three limitations commonly associated with previous computational methods. To resolve these limitations, we established disease similarity subnetwork and miRNA similarity subnetwork by integrating multiple data sources, where the disease similarity is composed of disease semantic similarity and disease functional similarity, and the miRNA similarity is calculated using the miRNA-target gene and miRNA-lncRNA (long non-coding RNA) associations. Then, a heterogeneous network was constructed by connecting the disease similarity subnetwork and the miRNA similarity subnetwork using the known miRNA-disease associations. We extended random walk with restart to predict miRNA-disease associations in the heterogeneous network. The leave-one-out cross-validation achieved an average area under the curve (AUC) of $0.8049$ across $341$ diseases and $476$ miRNAs. For five-fold cross-validation, our method achieved an AUC from $0.7970$ to $0.9249$ for $15$ human diseases. Case studies further demonstrated the feasibility of our method to discover potential miRNA-disease associations. An online service for prediction is freely available at http://ifmda.aliapp.com.



Identifying Cell Populations in Flow Cytometry Data Using Phenotypic Signatures

08/07/2017 2:05 pm PST

Single-cell flow cytometry is a technology that measures the expression of several cellular markers simultaneously for a large number of cells. Identification of homogeneous cell populations, currently done by manual biaxial gating, is highly subjective and time consuming. To overcome the shortcomings of manual gating, automatic algorithms have been proposed. However, the performance of these methods highly depends on the shape of populations and the dimension of the data. In this paper, we have developed a time-efficient method that accurately identifies cellular populations. This is done based on a novel technique that estimates the initial number of clusters in high dimension and identifies the final clusters by merging clusters using their phenotypic signatures in low dimension. The proposed method is called SigClust. We have applied SigClust to four public datasets and compared it with five well known methods in the field. The results are promising and indicate higher performance and accuracy compared to similar approaches reported in literature.



ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution

08/07/2017 2:06 pm PST

The purpose of de novo assembly is to report more contiguous, complete, and less error prone contigs. Thanks to the advent of the next generation sequencing (NGS) technologies, the cost of producing high depth reads is reduced greatly. However, due to the disadvantages of NGS, de novo assembly has to face the difficulties brought by repeat regions, error rate, and low sequencing coverage in some regions. Although many de novo algorithms have been proposed to solve these problems, the de novo assembly still remains a challenge. In this article, we developed an iterative seed-extension algorithm for de novo assembly, called ISEA. To avoid the negative impact induced by error rate, ISEA utilizes reads overlap and paired-end information to correct error reads before assemblying. During extending seeds in a De Bruijn graph, ISEA uses an elaborately designed score function based on paired-end information and the distribution of insert size to solve the repeat region problem. By employing the distribution of insert size, the score function can also reduce the influence of error reads. In scaffolding, ISEA adopts a relaxed strategy to join contigs that were terminated for low coverage during the extension. The performance of ISEA was compared with six previous popular assemblers on four real datasets. The experimental results demonstrate that ISEA can effectively obtain longer and more accurate scaffolds.



Search for a Minimal Set of Parameters by Assessing the Total Optimization Potential for a Dynamic Model of a Biochemical Network

08/07/2017 2:06 pm PST

Selecting an efficient small set of adjustable parameters to improve metabolic features of an organism is important for a reduction of implementation costs and risks of unpredicted side effects. In practice, to avoid the analysis of a huge combinatorial space for the possible sets of adjustable parameters, experience-, and intuition-based subsets of parameters are often chosen, possibly leaving some interesting counter-intuitive combinations of parameters unrevealed. The combinatorial scan of possible adjustable parameter combinations at the model optimization level is possible; however, the number of analyzed combinations is still limited. The total optimization potential (TOP) approach is proposed to assess the full potential for increasing the value of the objective function by optimizing all possible adjustable parameters. This seemingly unpractical combination of adjustable parameters allows assessing the maximum attainable value of the objective function and stopping the combinatorial space scanning when the desired fraction of TOP is reached and any further increase in the number of adjustable parameters cannot bring any reasonable improvement. The relation between the number of adjustable parameters and the reachable fraction of TOP is a valuable guideline in choosing a rational solution for industrial implementation. The TOP approach is demonstrated on the basis of two case studies.



Finite-Time Stability Analysis of Reaction-Diffusion Genetic Regulatory Networks with Time-Varying Delays

08/07/2017 2:05 pm PST

This paper is concerned with the finite-time stability problem of the delayed genetic regulatory networks (GRNs) with reaction-diffusion terms under Dirichlet boundary conditions. By constructing a Lyapunov–Krasovskii functional including quad-slope integrations, we establish delay-dependent finite-time stability criteria by employing the Wirtinger-type integral inequality, Gronwall inequality, convex technique, and reciprocally convex technique. In addition, the obtained criteria are also reaction-diffusion-dependent. Finally, a numerical example is provided to illustrate the effectiveness of the theoretical results.



A New Feature Vector Based on Gene Ontology Terms for Protein-Protein Interaction Prediction

08/07/2017 2:05 pm PST

Protein-protein interaction (PPI) plays a key role in understanding cellular mechanisms in different organisms. Many supervised classifiers like Random Forest (RF) and Support Vector Machine (SVM) have been used for intra or inter-species interaction prediction. For improving the prediction performance, in this paper we propose a novel set of features to represent a protein pair using their annotated Gene Ontology (GO) terms, including their ancestors. In our approach, a protein pair is treated as a document (bag of words), where the terms annotating the two proteins represent the words. Feature value of each word is calculated using information content of the corresponding term multiplied by a coefficient, which represents the weight of that term inside a document (i.e., a protein pair). We have tested the performance of the classifier using the proposed feature on different well known data sets of different species like S. cerevisiae, H. Sapiens, E. Coli, and D. melanogaster. We compare it with the other GO based feature representation technique, and demonstrate its competitive performance.



Detection Copy Number Variants from NGS with Sparse and Smooth Constraints

08/07/2017 2:06 pm PST

It is known that copy number variations (CNVs) are associated with complex diseases and particular tumor types, thus reliable identification of CNVs is of great potential value. Recent advances in next generation sequencing (NGS) data analysis have helped manifest the richness of CNV information. However, the performances of these methods are not consistent. Reliably finding CNVs in NGS data in an efficient way remains a challenging topic, worthy of further investigation. Accordingly, we tackle the problem by formulating CNVs identification into a quadratic optimization problem involving two constraints. By imposing the constraints of sparsity and smoothness, the reconstructed read depth signal from NGS is anticipated to fit the CNVs patterns more accurately. An efficient numerical solution tailored from alternating direction minimization (ADM) framework is elaborated. We demonstrate the advantages of the proposed method, namely ADM-CNV, by comparing it with six popular CNV detection methods using synthetic, simulated, and empirical sequencing data. It is shown that the proposed approach can successfully reconstruct CNV patterns from raw data, and achieve superior or comparable performance in detection of the CNVs compared to the existing counterparts.



Impact of Synaptic Localization and Subunit Composition of Ionotropic Glutamate Receptors on Synaptic Function: Modeling and Simulation Studies

08/07/2017 2:06 pm PST

Ionotropic NMDA and AMPA glutamate receptors (iGluRs) play important roles in synaptic function under physiological and pathological conditions. iGluRs sub-synaptic localization and subunit composition are dynamically regulated by activity-dependent insertion and internalization. However, understanding the impact on synaptic transmission of changes in composition and localization of iGluRs is difficult to address experimentally. To address this question, we developed a detailed computational model of glutamatergic synapses, including spine and dendritic compartments, elementary models of subtypes of NMDA and AMPA receptors, glial glutamate transporters, intracellular calcium, and a calcium-dependent signaling cascade underlying the development of long-term potentiation (LTP). These synapses were distributed on a neuron model and numerical simulations were performed to assess the impact of changes in composition and localization (synaptic versus extrasynaptic) of iGluRs on synaptic transmission and plasticity following various patterns of presynaptic stimulation. In addition, the effects of various pharmacological compounds targeting NMDARs or AMPARs were determined. Our results showed that changes in NMDAR localization have a greater impact on synaptic plasticity than changes in AMPARs. Moreover, the results suggest that modulators of AMPA and NMDA receptors have differential effects on restoring synaptic plasticity under different experimental situations mimicking various human diseases.



A Novel Adaptive Penalized Logistic Regression for Uncovering Biomarker Associated with Anti-Cancer Drug Sensitivity

08/07/2017 2:06 pm PST

We propose a novel adaptive penalized logistic regression modeling strategy based on Wilcoxon rank sum test (WRST) to effectively uncover driver genes in classification. In order to incorporate significance of gene in classification, we first measure significance of each gene by gene ranking method based on WRST, and then the adaptive L $_{1}$ -type penalty is discriminately imposed on each gene depending on the measured importance degree of gene. The incorporating significance of genes into adaptive logistic regression enables us to impose a large amount of penalty on low ranking genes, and thus noise genes are easily deleted from the model and we can effectively identify driver genes. Monte Carlo experiments and real world example are conducted to investigate effectiveness of the proposed approach. In Sanger data analysis, we introduce a strategy to identify expression modules indicating gene regulatory mechanisms via the principal component analysis (PCA), and perform logistic regression modeling based on not a single gene but gene expression modules. We can see through Monte Carlo experiments and real world example that the proposed adaptive penalized logistic regression outperforms feature selection and classification compared with existing L $_{1}$ -type regularization. The discriminately imposed penalty based on WRST effectively performs crucial gene selection, and thus our method can improve classification accuracy without interruption of noise genes. Furthermore, it can be seen through Sanger data analysis that the method for gene expression modules based on principal components and their loading scores provides interpretable results in biological viewpoints.



Pattern Classification of Instantaneous Cognitive Task-load Through GMM Clustering, Laplacian Eigenmap, and Ensemble SVMs

08/07/2017 2:06 pm PST

The identification of the temporal variations in human operator cognitive task-load (CTL) is crucial for preventing possible accidents in human-machine collaborative systems. Recent literature has shown that the change of discrete CTL level during human-machine system operations can be objectively recognized using neurophysiological data and supervised learning technique. The objective of this work is to design subject-specific multi-class CTL classifier to reveal the complex unknown relationship between the operator's task performance and neurophysiological features by combining target class labeling, physiological feature reduction and selection, and ensemble classification techniques. The psychophysiological data acquisition experiments were performed under multiple human-machine process control tasks. Four or five target classes of CTL were determined by using a Gaussian mixture model and three human performance variables. By using Laplacian eigenmap, a few salient EEG features were extracted, and heart rates were used as the input features of the CTL classifier. Then, multiple support vector machines were aggregated via majority voting to create an ensemble classifier for recognizing the CTL classes. Finally, the obtained CTL classification results were compared with those of several existing methods. The results showed that the proposed methods are capable of deriving a reasonable number of target classes and low-dimensional optimal EEG features for individual human operator subjects.



Derivative-Free Optimization of Rate Parameters of Capsid Assembly Models from Bulk in Vitro Data

08/07/2017 2:06 pm PST

The assembly of virus capsids proceeds by a complicated cascade of association and dissociation steps, the great majority of which cannot be directly experimentally observed. This has made capsid assembly a rich field for computational models, but there are substantial obstacles to model inference for such systems. Here, we describe progress on fitting kinetic rate constants defining capsid assembly models to experimental data, a difficult data-fitting problem because of the high computational cost of simulating assembly trajectories, the stochastic noise inherent to the models, and the limited and noisy data available for fitting. We evaluate the merits of data-fitting methods based on derivative-free optimization (DFO) relative to gradient-based methods used in prior work. We further explore the advantages of alternative data sources through simulation of a model of time-resolved mass spectrometry data, a technology for monitoring bulk capsid assembly that can be expected to provide much richer data than previously used static light scattering approaches. The results show that advances in both the data and the algorithms can improve model inference. More informative data sources lead to high-quality fits for all methods, but DFO methods show substantial advantages on less informative data sources that better represent current experimental practice.



A Generalized Lattice Based Probabilistic Approach for Metagenomic Clustering

08/07/2017 2:06 pm PST

Metagenomics involves the analysis of genomes of microorganisms sampled directly from their environment. Next Generation Sequencing allows a high-throughput sampling of small segments from genomes in the metagenome to generate reads. To study the properties and relationships of the microorganisms present, clustering can be performed based on the inherent composition of the sampled reads for unknown species. We propose a two-dimensional lattice based probabilistic model for clustering metagenomic datasets. The occurrence of a species in the metagenome is estimated using a lattice of probabilistic distributions over small sized genomic sequences. The two dimensions denote distributions for different sizes and groups of words, respectively. The lattice structure allows for additional support for a node from its neighbors when the probabilistic support for the species using the parameters of the current node is deemed insufficient. We also show convergence for our algorithm. We test our algorithm on simulated metagenomic data containing bacterial species and observe more than $85\text{percent}$ precision. We also evaluate our algorithm on an in vitro-simulated bacterial metagenome and on human patient data, and show a better clustering than other algorithms even for short reads and varied abundance. The software and datasets can be downloaded from https:// github.com/lattclus/lattice-metage .



Brain Modulyzer: Interactive Visual Analysis of Functional Brain Connectivity

08/07/2017 2:05 pm PST

We present Brain Modulyzer, an interactive visual exploration tool for functional magnetic resonance imaging (fMRI) brain scans, aimed at analyzing the correlation between different brain regions when resting or when performing mental tasks. Brain Modulyzer combines multiple coordinated views—such as heat maps, node link diagrams, and anatomical views—using brushing and linking to provide an anatomical context for brain connectivity data. Integrating methods from graph theory and analysis, e.g., community detection and derived graph measures, makes it possible to explore the modular and hierarchical organization of functional brain networks. Providing immediate feedback by displaying analysis results instantaneously while changing parameters gives neuroscientists a powerful means to comprehend complex brain structure more effectively and efficiently and supports forming hypotheses that can then be validated via statistical analysis. To demonstrate the utility of our tool, we present two case studies—exploring progressive supranuclear palsy, as well as memory encoding and retrieval.



Circular Order Aggregation and Its Application to Cell-Cycle Genes Expressions

08/07/2017 2:05 pm PST

The aim of circular order aggregation is to find a circular order on a set of $n$ items using angular values from $p$ heterogeneous data sets. This problem is new in the literature and has been motivated by the biological question of finding the order among the peak expression of a group of cell cycle genes. In this paper, two very different approaches to solve the problem that use pairwise and triplewise information are proposed. Both approaches are analyzed and compared using theoretical developments and numerical studies, and applied to the cell cycle data that motivated the problem.



Calcium Ion Fluctuations Alter Channel Gating in a Stochastic Luminal Calcium Release Site Model

06/02/2017 2:01 pm PST

Stochasticity and small system size effects in complex biochemical reaction networks can greatly alter transient and steady-state system properties. A common approach to modeling reaction networks, which accounts for system size, is the chemical master equation that governs the dynamics of the joint probability distribution for molecular copy number. However, calculation of the stationary distribution is often prohibitive, due to the large state-space associated with most biochemical reaction networks. Here, we analyze a network representing a luminal calcium release site model and investigate to what extent small system size effects and calcium fluctuations, driven by ion channel gating, influx and diffusion, alter steady-state ion channel properties including open probability. For a physiological ion channel gating model and number of channels, the state-space may be between approximately $10^6-10^8$ elements, and a novel modified block power method is used to solve the associated dominant eigenvector problem required to calculate the stationary distribution. We demonstrate that both small local cytosolic domain volume and a small number of ion channels drive calcium fluctuations that result in deviation from the corresponding model that neglects small system size effects.



On the Complexity of Duplication-Transfer-Loss Reconciliation with Non-Binary Gene Trees

06/02/2017 2:01 pm PST

Duplication-Transfer-Loss (DTL) reconciliation has emerged as a powerful technique for studying gene family evolution in the presence of horizontal gene transfer. DTL reconciliation takes as input a gene family phylogeny and the corresponding species phylogeny, and reconciles the two by postulating speciation, gene duplication, horizontal gene transfer, and gene loss events. Efficient algorithms exist for finding optimal DTL reconciliations when the gene tree is binary. However, gene trees are frequently non-binary. With such non-binary gene trees, the reconciliation problem seeks to find a binary resolution of the gene tree that minimizes the reconciliation cost. Given the prevalence of non-binary gene trees, many efficient algorithms have been developed for this problem in the context of the simpler Duplication-Loss (DL) reconciliation model. Yet, no efficient algorithms exist for DTL reconciliation with non-binary gene trees and the complexity of the problem remains unknown. In this work, we resolve this open question by showing that the problem is, in fact, NP-hard. Our reduction applies to both the dated and undated formulations of DTL reconciliation. By resolving this long-standing open problem, this work will spur the development of both exact and heuristic algorithms for this important problem.



Sorting Circular Permutations by Super Short Reversals

06/02/2017 2:02 pm PST

We consider the problem of sorting a circular permutation by super short reversals (i.e., reversals of length at most 2), a problem that finds application in comparative genomics. Polynomial-time solutions to the unsigned version of this problem are known, but the signed version remained open. In this paper, we present the first polynomial-time solution to the signed version of this problem. Moreover, we perform experiments for inferring phylogenies of two different groups of bacterial species and compare our results with the phylogenies presented in previous works. Finally, to facilitate phylogenetic studies based on the methods studied in this paper, we present a web tool for rearrangement-based phylogenetic inference using short operations, such as super short reversals.



Searching Genome-Wide Multi-Locus Associations for Multiple Diseases Based on Bayesian Inference

06/02/2017 2:02 pm PST

Taking the advantage of high-throughput single nucleotide polymorphism (SNP) genotyping technology, large genome-wide association studies (GWASs) have been considered to hold promise for unraveling complex relationships between genotypes and phenotypes. Current multi-locus-based methods are insufficient to detect interactions with diverse genetic effects on multifarious diseases. Also, statistic tests for high-order epistasis ( $\geq 2$ SNPs) raise huge computational and analytical challenges because the computation increases exponentially as the growth of the cardinality of SNPs combinations. In this paper, we provide a simple, fast and powerful method, named DAM, using Bayesian inference to detect genome-wide multi-locus epistatic interactions in multiple diseases. Experimental results on simulated data demonstrate that our method is powerful and efficient. We also apply DAM on two GWAS datasets from WTCCC, i.e., Rheumatoid Arthritis and Type 1 Diabetes, and identify some novel findings. Therefore, we believe that our method is suitable and efficient for the full-scale analysis of multi-disease-related interactions in GWASs.



Prediction and Validation of Disease Genes Using HeteSim Scores

06/02/2017 2:02 pm PST

Deciphering the gene disease association is an important goal in biomedical research. In this paper, we use a novel relevance measure, called HeteSim, to prioritize candidate disease genes. Two methods based on heterogeneous networks constructed using protein-protein interaction, gene-phenotype associations, and phenotype-phenotype similarity, are presented. In HeteSim_MultiPath (HSMP), HeteSim scores of different paths are combined with a constant that dampens the contributions of longer paths. In HeteSim_SVM (HSSVM), HeteSim scores are combined with a machine learning method. The 3-fold experiments show that our non-machine learning method HSMP performs better than the existing non-machine learning methods, our machine learning method HSSVM obtains similar accuracy with the best existing machine learning method CATAPULT. From the analysis of the top 10 predicted genes for different diseases, we found that HSSVM avoid the disadvantage of the existing machine learning based methods, which always predict similar genes for different diseases. The data sets and Matlab code for the two methods are freely available for download at http://lab.malab.cn/data/HeteSim/index.jsp.



A Modified Multiple Alignment Fast Fourier Transform with Higher Efficiency

06/02/2017 2:01 pm PST

Multiple sequence alignment (MSA) is the most common task in bioinformatics. Multiple alignment fast Fourier transform (MAFFT) is the fastest MSA program among those the accuracy of the resulting alignments can be comparable with the most accurate MSA programs. In this paper, we modify the correlation computation scheme of the MAFFT for further efficiency improvement in three aspects. First, novel complex number based amino acid and nucleotide expressions are utilized in the modified correlation. Second, linear convolution with a limitation is proposed for computing the correlation of amino acid and nucleotide sequences. Third, we devise a fast Fourier transform (FFT) algorithm for computing linear convolution. The FFT algorithm is based on conjugate pair split-radix FFT and does not require the permutation of order, and it is new as only real parts of the final outputs are required. Simulation results show that the speed of the modified scheme is 107.58 to 365.74 percent faster than that of the original MAFFT for one execution of the function Falign() of MAFFT, indicating its faster realization.



Efficient Constant-Time Complexity Algorithm for Stochastic Simulation of Large Reaction Networks

06/02/2017 2:02 pm PST

Exact stochastic simulation is an indispensable tool for a quantitative study of biochemical reaction networks. The simulation realizes the time evolution of the model by randomly choosing a reaction to fire and update the system state according to a probability that is proportional to the reaction propensity. Two computationally expensive tasks in simulating large biochemical networks are the selection of next reaction firings and the update of reaction propensities due to state changes. We present in this work a new exact algorithm to optimize both of these simulation bottlenecks. Our algorithm employs the composition-rejection on the propensity bounds of reactions to select the next reaction firing. The selection of next reaction firings is independent of the number reactions while the update of propensities is skipped and performed only when necessary. It therefore provides a favorable scaling for the computational complexity in simulating large reaction networks. We benchmark our new algorithm with the state of the art algorithms available in literature to demonstrate its applicability and efficiency.



Drug-Target Interaction Prediction with Graph Regularized Matrix Factorization

06/02/2017 2:01 pm PST

Experimental determination of drug-target interactions is expensive and time-consuming. Therefore, there is a continuous demand for more accurate predictions of interactions using computational techniques. Algorithms have been devised to infer novel interactions on a global scale where the input to these algorithms is a drug-target network (i.e., a bipartite graph where edges connect pairs of drugs and targets that are known to interact). However, these algorithms had difficulty predicting interactions involving new drugs or targets for which there are no known interactions (i.e., “orphan” nodes in the network). Since data usually lie on or near to low-dimensional non-linear manifolds, we propose two matrix factorization methods that use graph regularization in order to learn such manifolds. In addition, considering that many of the non-occurring edges in the network are actually unknown or missing cases, we developed a preprocessing step to enhance predictions in the “new drug” and “new target” cases by adding edges with intermediate interaction likelihood scores. In our cross validation experiments, our methods achieved better results than three other state-of-the-art methods in most cases. Finally, we simulated some “new drug” and “new target” cases and found that GRMF predicted the left-out interactions reasonably well.



RNA Visualization: Relevance and the Current State-of-the-Art Focusing on Pseudoknots

06/02/2017 2:02 pm PST

RNA visualization is crucial in order to understand the relationship that exists between RNA structure and its function, as well as the development of better RNA structure prediction algorithms. However, in the context of RNA visualization, one key structure remains difficult to visualize: Pseudoknots. Pseudoknots occur in RNA folding when two secondary structural components form base-pairs between them. The three-dimensional nature of these components makes them challenging to visualize in two-dimensional media, such as print media or screens. In this review, we focus on the advancements that have been made in the field of RNA visualization in two-dimensional media in the past two decades. The review aims at presenting all relevant aspects of pseudoknot visualization. We start with an overview of several pseudoknotted structures and their relevance in RNA function. Next, we discuss the theoretical basis for RNA structural topology classification and present RNA classification systems for both pseudoknotted and non-pseudoknotted RNAs. Each description of RNA classification system is followed by a discussion of the software tools and algorithms developed to date to visualize RNA, comparing the different tools’ strengths and shortcomings.



Leveraging FPGAs for Accelerating Short Read Alignment

06/02/2017 2:02 pm PST

One of the key challenges facing genomics today is how to efficiently analyze the massive amounts of data produced by next-generation sequencing platforms. With general-purpose computing systems struggling to address this challenge, specialized processors such as the Field-Programmable Gate Array (FPGA) are receiving growing interest. The means by which to leverage this technology for accelerating genomic data analysis is however largely unexplored. In this paper, we present a runtime reconfigurable architecture for accelerating short read alignment using FPGAs. This architecture exploits the reconfigurability of FPGAs to allow the development of fast yet flexible alignment designs. We apply this architecture to develop an alignment design which supports exact and approximate alignment with up to two mismatches. Our design is based on the FM-index, with optimizations to improve the alignment performance. In particular, the $n$ -step FM-index, index oversampling, a seed-and-compare stage, and bi-directional backtracking are included. Our design is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with eight Altera Stratix-V FPGAs. Measurements show that our design is 28 times faster than Bowtie2 running with 16 threads on dual Intel Xeon E5-2640 CPUs, and nine times faster than Soap3-dp running on an NVIDIA Tesla C2070 GPU.



An Effective Computational Method Incorporating Multiple Secondary Structure Predictions in Topology Determination for Cryo-EM Images

06/02/2017 2:02 pm PST

A key idea in de novo modeling of a medium-resolution density image obtained from cryo-electron microscopy is to compute the optimal mapping between the secondary structure traces observed in the density image and those predicted on the protein sequence. When secondary structures are not determined precisely, either from the image or from the amino acid sequence of the protein, the computational problem becomes more complex. We present an efficient method that addresses the secondary structure placement problem in presence of multiple secondary structure predictions and computes the optimal mapping. We tested the method using 12 simulated images from α-proteins and two Cryo-EM images of α-β proteins. We observed that the rank of the true topologies is consistently improved by using multiple secondary structure predictions instead of a single prediction. The results show that the algorithm is robust and works well even when errors/misses in the predicted secondary structures are present in the image or the sequence. The results also show that the algorithm is efficient and is able to handle proteins with as many as 33 helices.



Classical Mechanics Approach Applied to Analysis of Genetic Oscillators

06/02/2017 2:02 pm PST

Biological oscillators present a fundamental part of several regulatory mechanisms that control the response of various biological systems. Several analytical approaches for their analysis have been reported recently. They are, however, limited to only specific oscillator topologies and/or to giving only qualitative answers, i.e., is the dynamics of an oscillator given the parameter space oscillatory or not. Here, we present a general analytical approach that can be applied to the analysis of biological oscillators. It relies on the projection of biological systems to classical mechanics systems. The approach is able to provide us with relatively accurate results in the meaning of type of behavior system reflects (i.e., oscillatory or not) and periods of potential oscillations without the necessity to conduct expensive numerical simulations. We demonstrate and verify the proposed approach on three different implementations of amplified negative feedback oscillator.



PCID: A Novel Approach for Predicting Disease Comorbidity by Integrating Multi-Scale Data

06/02/2017 2:02 pm PST

Disease comorbidity is the presence of one or more diseases along with a primary disorder, which causes additional pain to patients and leads to the failure of standard treatments compared with single diseases. Therefore, the identification of potential comorbidity can help prevent those comorbid diseases when treating a primary disease. Unfortunately, most of current known disease comorbidities are discovered occasionally in clinic, and our knowledge about comorbidity is far from complete. Despite the fact that many efforts have been made to predict disease comorbidity, the prediction accuracy of existing computational approaches needs to be improved. By investigating the factors underlying disease comorbidity, e.g., mutated genes and rewired protein-protein interactions (PPIs), we here present a novel algorithm to predict disease comorbidity by integrating multi-scale data ranging from genes to phenotypes. Benchmark results on real data show that our approach outperforms existing algorithms, and some of our novel predictions are validated with those reported in literature, indicating the effectiveness and predictive power of our approach. In addition, we identify some pathway and PPI patterns that underlie the co-occurrence between a primary disease and certain disease classes, which can help explain how the comorbidity is initiated from molecular perspectives.



Protein Complex Detection via Effective Integration of Base Clustering Solutions and Co-Complex Affinity Scores

06/02/2017 2:02 pm PST

With the increasing availability of protein interaction data, various computational methods have been developed to predict protein complexes. However, different computational methods may have their own advantages and limitations. Ensemble clustering has thus been studied to minimize the potential bias and risk of individual methods and generate prediction results with better coverage and accuracy. In this paper, we extend the traditional ensemble clustering by taking into account the co-complex affinity scores and present an Ensemble H ierarchical Clustering framework (EnsemHC) to detect protein complexes. First, we construct co-cluster matrices by integrating the clustering results with the co-complex evidences. Second, we sum up the constructed co-cluster matrices to derive a final ensemble matrix via a novel iterative weighting scheme. Finally, we apply the hierarchical clustering to generate protein complexes from the final ensemble matrix. Experimental results demonstrate that our EnsemHC performs better than its base clustering methods and various existing integrative methods. In addition, we also observed that integrating the clusters and co-complex affinity scores from different data sources will improve the prediction performance, e.g., integrating the clusters from TAP data and co-complex affinities from binary PPI data achieved the best performance in our experiments.



Robust Reachability of Boolean Control Networks

06/02/2017 2:02 pm PST

Boolean networks serve a powerful tool in analysis of genetic regulatory networks since it emphasizes the fundamental principles and establishes a nature framework for capturing the dynamics of regulation of cellular states. In this paper, the robust reachability of Boolean control networks is investigated by means of semi-tensor product. Necessary and sufficient conditions for the robust reachability of Boolean control networks are provided, in which control inputs relying on disturbances or not are considered, respectively. Besides, the corresponding control algorithms are developed for these two cases. A reduced model of the lac operon in the Escherichia coli is presented to show the effectiveness of the presented results.



A Heterogeneous Network Based Method for Identifying GBM-Related Genes by Integrating Multi-Dimensional Data

06/02/2017 2:02 pm PST

The emergence of multi-dimensional data offers opportunities for more comprehensive analysis of the molecular characteristics of human diseases and therefore improving diagnosis, treatment, and prevention. In this study, we proposed a heterogeneous network based method by integrating multi-dimensional data (HNMD) to identify GBM-related genes. The novelty of the method lies in that the multi-dimensional data of GBM from TCGA dataset that provide comprehensive information of genes, are combined with protein-protein interactions to construct a weighted heterogeneous network, which reflects both the general and disease-specific relationships between genes. In addition, a propagation algorithm with resistance is introduced to precisely score and rank GBM-related genes. The results of comprehensive performance evaluation show that the proposed method significantly outperforms the network based methods with single-dimensional data and other existing approaches. Subsequent analysis of the top ranked genes suggests they may be functionally implicated in GBM, which further corroborates the superiority of the proposed method. The source code and the results of HNMD can be downloaded from the following URL: http://bioinformatics.ustc.edu.cn/hnmd/ .



Genomic Distance with High Indel Costs

06/02/2017 2:02 pm PST

We determine complexity of computing the DCJ-indel distance, when DCJ and indel operations have distinct constant costs, by showing an exact formula that can be computed in linear time for any choice of (constant) costs for DCJ and indel operations. We additionally consider the problem of triangular inequality disruption and propose an algorithmically efficient correction on each member of the family of DCJ-indel.



txCoords: A Novel Web Application for Transcriptomic Peak Re-Mapping

06/02/2017 2:01 pm PST

Since the development of new technologies such as RIP-Seq and m6A-seq, peak calling has become an important step in transcriptomic sequencing data analysis. However, many of the reported genomic coordinates of transcriptomic peaks are incorrect owing to negligence of the introns. There is currently a lack of a convenient tool to address this problem. Here, we present txCoords, a novel and easy-to-use web application for transcriptomic peak re-mapping. txCoords can be used to correct the incorrectly reported transcriptomic peaks and retrieve the true sequences. It also supports visualization of the re-mapped peaks in a schematic figure or from the UCSC Genome Browser. Our web server is freely available at http://www.bioinfo.tsinghua.edu.cn/txCoords.



Bi-convex Optimization to Learn Classifiers from Multiple Biomedical Annotations

06/02/2017 2:02 pm PST

The problem of constructing classifiers from multiple annotators who provide inconsistent training labels is important and occurs in many application domains. Many existing methods focus on the understanding and learning of the crowd behaviors. Several probabilistic algorithms consider the construction of classifiers for specific tasks using consensus of multiple labelers annotations. These methods impose a prior on the consensus and develop an expectation-maximization algorithm based on logistic regression loss. We extend the discussion to the hinge loss commonly used by support vector machines. Our formulations form bi-convex programs that construct classifiers and estimate the reliability of each labeler simultaneously. Each labeler is associated with a reliability parameter, which can be a constant, or class-dependent, or varies for different examples. The hinge loss is modified by replacing the true labels by the weighted combination of labelers’ labels with reliabilities as weights. Statistical justification is discussed to motivate the use of linear combination of labels. In parallel to the expectation-maximization algorithm for logistic-based methods, efficient alternating algorithms are developed to solve the proposed bi-convex programs. Experimental results on benchmark datasets and three real-world biomedical problems demonstrate that the proposed methods either outperform or are competitive to the state of the art.



Modeling Healthcare Quality via Compact Representations of Electronic Health Records

06/02/2017 2:02 pm PST

Increased availability of Electronic Health Record (EHR) data provides unique opportunities for improving the quality of health services. In this study, we couple EHRs with the advanced machine learning tools to predict three important parameters of healthcare quality. More specifically, we describe how to learn low-dimensional vector representations of patient conditions and clinical procedures in an unsupervised manner, and generate feature vectors of hospitalized patients useful for predicting their length of stay, total incurred charges, and mortality rates. In order to learn vector representations, we propose to employ state-of-the-art language models specifically designed for modeling co-occurrence of diseases and applied clinical procedures. The proposed model is trained on a large-scale EHR database comprising more than 35 million hospitalizations in California over a period of nine years. We compared the proposed approach to several alternatives and evaluated their effectiveness by measuring accuracy of regression and classification models used for three predictive tasks considered in this study. Our model outperformed the baseline models on all tasks, indicating a strong potential of the proposed approach for advancing quality of the healthcare system.



From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis

06/02/2017 2:02 pm PST

Sequence describes the primary structure of a protein, which contains important structural, characteristic, and genetic information and thereby motivates many sequence-based computational approaches to infer protein function. Among them, feature-base approaches attract increased attention because they make prediction from a set of transformed and more biologically meaningful sequence features. However, original features extracted from sequence are usually of high dimensionality and often compromised by irrelevant patterns, therefore dimension reduction is necessary prior to classification for efficient and effective protein function prediction. A protein usually performs several different functions within an organism, which makes protein function prediction a multi-label classification problem. In machine learning, multi-label classification deals with problems where each object may belong to more than one class. As a well-known feature reduction method, linear discriminant analysis (LDA) has been successfully applied in many practical applications. It, however, by nature is designed for single-label classification, in which each object can belong to exactly one class. Because directly applying LDA in multi-label classification causes ambiguity when computing scatters matrices, we apply a new Multi-label Linear Discriminant Analysis (MLDA) approach to address this problem and meanwhile preserve powerful classification capability inherited from classical LDA. We further extend MLDA by $\ell _1$ -normalization to overcome the problem of over-counting data points with multiple labels. In addition, we incorporate biological network data using Laplacian embedding into our method, and assess the reliability of predicted putative functions. Extensive empirical evaluations demonstrate promising results of our methods.



Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates

06/02/2017 2:01 pm PST

RNA-Sequencing has been the leading technology to quantify expression of thousands of genes simultaneously. The data analysis of an RNA-Seq experiment starts from aligning short reads to the reference genome/transcriptome or reconstructed transcriptome. However, current aligners lack the sensitivity to distinguish reads that come from homologous regions of an genome. One group of these homologies is the paralog pseudogenes. Pseudogenes arise from duplication of a set of protein coding genes, and have been considered as degraded paralogs in the genome due to their lost of functionality. Recent studies have provided evidence to support their novel regulatory roles in biological processes. With the growing interests in quantifying the expression level of pseudogenes at different tissues or cell lines, it is critical to have a sensitive method that can correctly align ambiguous reads and accurately estimate the expression level among homologous genes. Previously in PseudoLasso, we proposed a linear regression approach to learn read alignment behaviors, and to leverage this knowledge for abundance estimation and alignment correction. In this paper, we extend the work of PseudoLasso by grouping the homologous genomic regions into different communities using a community detection algorithm, followed by building a linear regression model separately for each community. The results show that this approach is able to retain the same accuracy as PseudoLasso. By breaking the genome into smaller homologous communities, the running time is improved from quadratic growth to linear with respect to the number of genes.



Prognosis of Clinical Outcomes with Temporal Patterns and Experiences with One Class Feature Selection

06/07/2017 2:02 pm PST

Accurate prognosis of outcome events, such as clinical procedures or disease diagnosis, is central in medicine. The emergence of longitudinal clinical data, like the Electronic Health Records (EHR), represents an opportunity to develop automated methods for predicting patient outcomes. However, these data are highly dimensional and very sparse, complicating the application of predictive modeling techniques. Further, their temporal nature is not fully exploited by current methods, and temporal abstraction was recently used which results in symbolic time intervals representation. We present Maitreya, a framework for the prediction of outcome events that leverages these symbolic time intervals. Using Maitreya, learn predictive models based on the temporal patterns in the clinical records that are prognostic markers and use these markers to train predictive models for eight clinical procedures. In order to decrease the number of patterns that are used as features, we propose the use of three one class feature selection methods. We evaluate the performance of Maitreya under several parameter settings, including the one-class feature selection, and compare our results to that of atemporal approaches. In general, we found that the use of temporal patterns outperformed the atemporal methods, when representing the number of pattern occurrences.



Applications of Transductive Spectral Clustering Methods in a Military Medical Concussion Database

06/02/2017 2:02 pm PST

Traumatic brain injury (TBI) is one of the most common forms of neurotrauma that has affected more than 250,000 military service members over the last decade alone. While in battle, service members who experience TBI are at significant risk for the development of normal TBI symptoms, as well as risk for the development of psychological disorders such as Post-Traumatic Stress Disorder (PTSD). As such, these service members often require intense bouts of medication and therapy in order to resume full return-to-duty status. The primary aim of this study is to identify the relationship between the administration of specific medications and reductions in symptomology such as headaches, dizziness, or light-headedness. Service members diagnosed with mTBI and seen at the Concussion Restoration Care Center (CRCC) in Afghanistan were analyzed according to prescribed medications and symptomology. Here, we demonstrate that in such situations with sparse labels and small feature sets, classic analytic techniques such as logistic regression, support vector machines, naïve Bayes, random forest, decision trees, and k-nearest neighbor are not well suited for the prediction of outcomes. We attribute our findings to several issues inherent to this problem setting and discuss several advantages of spectral graph methods.



Towards Unsupervised Gene Selection: A Matrix Factorization Framework

06/02/2017 2:02 pm PST

The recent development of microarray gene expression techniques have made it possible to offer phenotype classification of many diseases. However, in gene expression data analysis, each sample is represented by quite a large number of genes, and many of them are redundant or insignificant to clarify the disease problem. Therefore, how to efficiently select the most useful genes has been becoming one of the most hot research topics in the gene expression data analysis. In this paper, a novel unsupervised two-stage coarse-fine gene selection method is proposed. In the first stage, we apply the kmeans algorithm to over-cluster the genes and discard some redundant genes. In the second stage, we select the most representative genes from the remaining ones based on matrix factorization. Finally the experimental results on several data sets are presented to show the effectiveness of our method.






Guest Editorial: Special Section on Biological Data Mining and Its Applications in Healthcare

06/02/2017 2:01 pm PST

Biologists are stepping up their efforts in understanding the biological processes that underlie disease pathways in the clinical contexts. This has resulted in a flood of biological and clinical data—genomic sequences, DNA microarrays, protein interactions, biomedical images, disease pathways, etc. The rapid adoption of Electronic Health Records (EHRs) across healthcare systems, coupled with the capability of linking EHRs to research biorepositories, provides a unique opportunity for conducting large-scale Precision Medicine research. As a result, data mining techniques, for knowledge discovery and deriving data driven insights from various data sources, are increasingly important in modern biology and healthcare. The purpose of this special section is to bring together the researchers in bioinformatics, healthcare informatics, and data mining to share about their current research, and their visions on future directions.



NovoExD: De novo Peptide Sequencing for ETD/ECD Spectra

03/27/2017 3:35 pm PST

De novo peptide sequencing using tandem mass spectrometry (MS/MS) data has become a major computational method for sequence identification in recent years. With the development of new instruments and technology, novel computational methods have emerged with enhanced performance. However, there are only a few methods focusing on ECD/ETD spectra, which mainly contain variants of $c$ -ions and $z$ -ions. Here, a de novo sequencing method for ECD/ETD spectra, NovoExD, is presented. NovoExD applies a new form of spectrum graph with multiple edge types (called a GMET), considers multiple peptide tags, and integrates amino acid combination (AAC) and fragment ion charge information. Its performance is compared with another successful de novo sequencing method, pNovo+, which has an option for ECD/ETD spectra. Experiments conducted on three different datasets show that the average full length peptide identification accuracy of NovoExD is as high as 88.70 percent, and that NovoExD's average accuracy is more than 20 percent greater on all datasets than that of pNovo+.



Predicting Protein Functions by Using Unbalanced Random Walk Algorithm on Three Biological Networks

03/27/2017 3:30 pm PST

With the gap between the sequence data and their functional annotations becomes increasing wider, many computational methods have been proposed to annotate functions for unknown proteins. However, designing effective methods to make good use of various biological resources is still a big challenge for researchers due to function diversity of proteins. In this work, we propose a new method named ThrRW, which takes several steps of random walking on three different biological networks: protein interaction network (PIN), domain co-occurrence network (DCN), and functional interrelationship network (FIN), respectively, so as to infer functional information from neighbors in the corresponding networks. With respect to the topological and structural differences of the three networks, the number of walking steps in the three networks will be different. In the course of working, the functional information will be transferred from one network to another according to the associations between the nodes in different networks. The results of experiment on S. cerevisiae data show that our method achieves better prediction performance not only than the methods that consider both PIN data and GO term similarities, but also than the methods using both PIN data and protein domain information, which verifies the effectiveness of our method on integrating multiple biological data sources.



United Complex Centrality for Identification of Essential Proteins from PPI Networks

03/27/2017 3:36 pm PST

Essential proteins are indispensable for the survival or reproduction of an organism. Identification of essential proteins is not only necessary for the understanding of the minimal requirements for cellular life, but also important for the disease study and drug design. With the development of high-throughput techniques, a large number of protein-protein interaction data are available, which promotes the studies of essential proteins from the network level. Up to now, though a series of computational methods have been proposed, the prediction precision still needs to be improved. In this paper, we propose a new method, United complex Centrality (UC), to identify essential proteins by integrating the protein complexes with the topological features of protein-protein interaction (PPI) networks. By analyzing the relationship between the essential proteins and the known protein complexes of S. cerevisiae and human, we find that the proteins in complexes are more likely to be essential compared with the proteins not included in any complexes and the proteins appeared in multiple complexes are more inclined to be essential compared to those only appeared in a single complex. Considering that some protein complexes generated by computational methods are inaccurate, we also provide a modified version of UC with parameter alpha, named UC-P. The experimental results show that protein complex information can help identify the essential proteins more accurate both for the PPI network of S. cerevisiae and that of human. The proposed method UC performs obviously better than the eight previously proposed methods (DC, IC, EC, SC, BC, CC, NC, and LAC) for identifying essential proteins.



Identifying Spurious Interactions in the Protein-Protein Interaction Networks Using Local Similarity Preserving Embedding

03/27/2017 3:37 pm PST

In recent years, a remarkable amount of protein-protein interaction (PPI) data are being available owing to the advance made in experimental high-throughput technologies. However, the experimentally detected PPI data usually contain a large amount of spurious links, which could contaminate the analysis of the biological significance of protein links and lead to incorrect biological discoveries, thereby posing new challenges to both computational and biological scientists. In this paper, we develop a new embedding algorithm called local similarity preserving embedding (LSPE) to rank the interaction possibility of protein links. By going beyond limitations of current geometric embedding methods for network denoising and emphasizing the local information of PPI networks, LSPE can avoid the unstableness of previous methods. We demonstrate experimental results on benchmark PPI networks and show that LSPE was the overall leader, outperforming the state-of-the-art methods in topological false links elimination problems.



An Approach for Peptide Identification by De Novo Sequencing of Mixture Spectra

03/27/2017 3:39 pm PST

Mixture spectra occur quite frequently in a typical wet-lab mass spectrometry experiment, which result from the concurrent fragmentation of multiple precursors. The ability to efficiently and confidently identify mixture spectra is essential to alleviate the existent bottleneck of low mass spectra identification rate. However, most of the traditional computational methods are not suitable for interpreting mixture spectra, because they still take the assumption that the acquired spectra come from the fragmentation of a single precursor. In this manuscript, we formulate the mixture spectra de novo sequencing problem mathematically, and propose a dynamic programming algorithm for the problem. Additionally, we use both simulated and real mixture spectra data sets to verify the merits of the proposed algorithm.



A Two-Phase Improved Correlation Method for Automatic Particle Selection in Cryo-EM

03/27/2017 3:33 pm PST

Particle selection from cryo-electron microscopy (Cryo-EM) images is very important for high-resolution reconstruction of macromolecular structure. The methods of particle selection can be roughly grouped into two classes, template-matching methods and feature-based methods. In general, template-matching methods usually generate better results than feature-based methods. However, the accuracy of template-matching methods is restricted by the noise and low contrast of Cryo-EM images. Moreover, the processing speed of template-matching methods, restricted by the random orientation of particles, further limits their practical applications. In this paper, combining the advantages of feature-based methods and template-matching methods, we present a two-phase improved correlation method for automatic, fast particle selection. In Phase I, we generate a preliminary particle set using rotation-invariant features of particles. In Phase II, we filter the preliminary particle set using a correlation method to reduce the interference of the high noise background and improve the precision of particle selection. We apply several optimization strategies, including a modified adaboost algorithm, Divide and Conquer technique, cascade strategy and graphics processing unit parallel technique, to improve feature recognition ability and reduce processing time. In addition, we developed two correlation score functions for different correlation situations. Experimental results on the benchmark of Cryo-EM images show that our method can improve the accuracy and processing speed of particle selection significantly.



Microbiome Data Representation by Joint Nonnegative Matrix Factorization with Laplacian Regularization

03/27/2017 3:27 pm PST

Microbiome datasets are often comprised of different representations or views which provide complementary information to understand microbial communities, such as metabolic pathways, taxonomic assignments, and gene families. Data integration methods including approaches based on nonnegative matrix factorization (NMF) combine multi-view data to create a comprehensive view of a given microbiome study by integrating multi-view information. In this paper, we proposed a novel variant of NMF which called Laplacian regularized joint non-negative matrix factorization (LJ-NMF) for integrating functional and phylogenetic profiles from HMP. We compare the performance of this method to other variants of NMF. The experimental results indicate that the proposed method offers an efficient framework for microbiome data analysis.



Optimizing Analytical Depth and Cost Efficiency of IEF-LC/MS Proteomics

03/27/2017 3:32 pm PST

IEF LC-MS/MS is an analytical method that incorporates a two-step sample separation prior to MS identification of proteins. When analyzing complex samples this preparatory separation allows for higher analytical depth and improved quantification accuracy of proteins. However, cost and analysis time are greatly increased as each analyzed IEF fraction is separately profiled using LC-MS/MS. We propose an approach that selects a subset of IEF fractions for LC-MS/MS analysis that is highly informative in the context of a group of proteins of interest. Specifically, our method allows a significant reduction in cost and instrument time as compared to the standard protocol of running all fractions, with little compromise to coverage. We develop algorithmics to optimize the selection of the IEF fractions on which to run LC-MS/MS. We translate the fraction optimization task to Minimum Set Cover, a well-studied NP-hard problem. We develop heuristic solutions and compare them in terms of effectiveness and running times. We provide examples to demonstrate advantages and limitations of each algorithmic approach. Finally, we test our methodology by applying it to experimental data obtained from IEF LC-MS/MS analysis of yeast and human samples. We demonstrate the benefit of this approach for analyzing complex samples with a focus on different protein sets of interest.



Muscle Tissue Labeling of Human Lower Limb in Multi-Channel mDixon MR Imaging: Concepts and Applications

03/27/2017 3:28 pm PST

With increasing resolutions and number of acquisitions, medical imaging more and more requires computer support for interpretation as currently not all imaging data is fully used. In our work, we show how multi-channel images can be used for robust air masking and reliable muscle tissue detection in the human lower limb. We exploit additional channels that are usually discarded in clinical routine. We use the common mDixon acquisition protocol for MR imaging. A series of thresholding, morphological, and connectivity operations is used for processing. We demonstrate our fully automated approach on four subjects and present a comparison with manual labeling. We discuss how this work is used for advanced and intuitive visualization, the quantification of tissue types, pose estimation, initialization of further segmentation methods, and how it could be used in clinical environments.



Analysis of Organization of the Interactome Using Dominating Sets: A Case Study on Cell Cycle Interaction Networks

03/27/2017 3:36 pm PST

In this study, a minimum dominating set based approach was developed and implemented as a Cytoscape plugin to identify critical and redundant proteins in a protein interaction network. We focused on the investigation of the properties associated with critical proteins in the context of the analysis of interaction networks specific to cell cycle in both yeast and human. A total of 132 yeast genes and 129 human proteins have been identified as critical nodes while 950 in yeast and 980 in human have been categorized as redundant nodes. A clear distinction between critical and redundant proteins was observed when examining their topological parameters including betweenness centrality, suggesting a central role of critical proteins in the control of a network. The significant differences in terms of gene coexpression and functional similarity were observed between the two sets of proteins in yeast. Critical proteins were found to be enriched with essential genes in both networks and have a more deleterious effect on the network integrity than their redundant counterparts. Furthermore, we obtained statistically significant enrichments of proteins that govern human diseases including cancer-related and virus-targeted genes in the corresponding set of critical proteins.



Improving Recognition of Antimicrobial Peptides and Target Selectivity through Machine Learning and Genetic Programming

03/27/2017 3:34 pm PST

Growing bacterial resistance to antibiotics is spurring research on utilizing naturally-occurring antimicrobial peptides (AMPs) as templates for novel drug design. While experimentalists mainly focus on systematic point mutations to measure the effect on antibacterial activity, the computational community seeks to understand what determines such activity in a machine learning setting. The latter seeks to identify the biological signals or features that govern activity. In this paper, we advance research in this direction through a novel method that constructs and selects complex sequence-based features which capture information about distal patterns within a peptide. Comparative analysis with state-of-the-art methods in AMP recognition reveals our method is not only among the top performers, but it also provides transparent summarizations of antibacterial activity at the sequence level. Moreover, this paper demonstrates for the first time the capability not only to recognize that a peptide is an AMP or not but also to predict its target selectivity based on models of activity against only Gram-positive, only Gram-negative, or both types of bacteria. The work described in this paper is a step forward in computational research seeking to facilitate AMP design or modification in the wet laboratory.



Discovering Protein-DNA Binding Cores by Aligned Pattern Clustering

03/27/2017 3:29 pm PST

Understanding binding cores is of fundamental importance in deciphering Protein-DNA (TF-TFBS) binding and gene regulation. Limited by expensive experiments, it is promising to discover them with variations directly from sequence data. Although existing computational methods have produced satisfactory results, they are one-to-one mappings with no site-specific information on residue/nucleotide variations, where these variations in binding cores may impact binding specificity. This study presents a new representation for modeling binding cores by incorporating variations and an algorithm to discover them from only sequence data. Our algorithm takes protein and DNA sequences from TRANSFAC (a Protein-DNA Binding Database) as input; discovers from both sets of sequences conserved regions in Aligned Pattern Clusters (APCs); associates them as Protein-DNA Co-Occurring APCs; ranks the Protein-DNA Co-Occurring APCs according to their co-occurrence, and among the top ones, finds three-dimensional structures to support each binding core candidate. If successful, candidates are verified as binding cores. Otherwise, homology modeling is applied to their close matches in PDB to attain new chemically feasible binding cores. Our algorithm obtains binding cores with higher precision and much faster runtime ($\geq$ 1,600x) than that of its contemporaries, discovering candidates that do not co-occur as one-to-one associated patterns in the raw data. Availability: http://www.pami.uwaterloo.ca/~ealee/files/tcbbPnDna2015/Release.zip.



Multi-View Clustering of Microbiome Samples by Robust Similarity Network Fusion and Spectral Clustering

03/27/2017 3:27 pm PST

Microbiome datasets are often comprised of different representations or views which provide complementary information, such as genes, functions, and taxonomic assignments. Integration of multi-view information for clustering microbiome samples could create a comprehensive view of a given microbiome study. Similarity network fusion (SNF) can efficiently integrate similarities built from each view of data into a unique network that represents the full spectrum of the underlying data. Based on this method, we develop a Robust Similarity Network Fusion (RSNF) approach which combines the strength of random forest and the advantage of SNF at data aggregation. The experimental results indicate the strength of the proposed strategy. The method substantially improves the clustering performance significantly comparing to several state-of-the-art methods in several datasets.



A New Scheme to Characterize and Identify Protein Ubiquitination Sites

03/27/2017 3:39 pm PST

Protein ubiquitination, involving the conjugation of ubiquitin on lysine residue, serves as an important modulator of many cellular functions in eukaryotes. Recent advancements in proteomic technology have stimulated increasing interest in identifying ubiquitination sites. However, most computational tools for predicting ubiquitination sites are focused on small-scale data. With an increasing number of experimentally verified ubiquitination sites, we were motivated to design a predictive model for identifying lysine ubiquitination sites for large-scale proteome dataset. This work assessed not only single features, such as amino acid composition (AAC), amino acid pair composition (AAPC) and evolutionary information, but also the effectiveness of incorporating two or more features into a hybrid approach to model construction. The support vector machine (SVM) was applied to generate the prediction models for ubiquitination site identification. Evaluation by five-fold cross-validation showed that the SVM models learned from the combination of hybrid features delivered a better prediction performance. Additionally, a motif discovery tool, MDDLogo, was adopted to characterize the potential substrate motifs of ubiquitination sites. The SVM models integrating the MDDLogo-identified substrate motifs could yield an average accuracy of 68.70 percent. Furthermore, the independent testing result showed that the MDDLogo-clustered SVM models could provide a promising accuracy (78.50 percent) and perform better than other prediction tools. Two cases have demonstrated the effective prediction of ubiquitination sites with corresponding substrate motifs.



Optimal Landmark Selection for Registration of 4D Confocal Image Stacks in Arabidopsis

03/27/2017 3:30 pm PST

Technologically advanced imaging techniques have allowed us to generate and study the internal part of a tissue over time by capturing serial optical images that contain spatio-temporal slices of hundreds of tightly packed cells. Image registration of such live-imaging datasets of developing multicelluar tissues is one of the essential components of all image analysis pipelines. In this paper, we present a fully automated 4D(X-Y-Z-T) registration method of live imaging stacks that takes care of both temporal and spatial misalignments. We present a novel landmark selection methodology where the shape features of individual cells are not of high quality and highly distinguishable. The proposed registration method finds the best image slice correspondence from consecutive image stacks to account for vertical growth in the tissue and the discrepancy in the choice of the starting focal point. Then, it uses local graph-based approach to automatically find corresponding landmark pairs, and finally the registration parameters are used to register the entire image stack. The proposed registration algorithm combined with an existing tracking method is tested on multiple image stacks of tightly packed cells of Arabidopsis shoot apical meristem and the results show that it significantly improves the accuracy of cell lineages and division statistics.



Extending and Applying Spartan to Perform Temporal Sensitivity Analyses for Predicting Changes in Influential Biological Pathways in Computational Models

03/27/2017 3:31 pm PST

Through integrating real time imaging, computational modelling, and statistical analysis approaches, previous work has suggested that the induction of and response to cell adhesion factors is the key initiating pathway in early lymphoid tissue development, in contrast to the previously accepted view that the process is triggered by chemokine mediated cell recruitment. These model derived hypotheses were developed using spartan, an open-source sensitivity analysis toolkit designed to establish and understand the relationship between a computational model and the biological system that model captures. Here, we extend the functionality available in spartan to permit the production of statistical analyses that contrast the behavior exhibited by a computational model at various simulated time-points, enabling a temporal analysis that could suggest whether the influence of biological mechanisms changes over time. We exemplify this extended functionality by using the computational model of lymphoid tissue development as a time-lapse tool. By generating results at twelve- hour intervals, we show how the extensions to spartan have been used to suggest that lymphoid tissue development could be biphasic, and predict the time-point when a switch in the influence of biological mechanisms might occur.



Algorithms and Complexity Results for Genome Mapping Problems

03/27/2017 3:33 pm PST

Genome mapping algorithms aim at computing an ordering of a set of genomic markers based on local ordering information such as adjacencies and intervals of markers. In most genome mapping models, markers are assumed to occur uniquely in the resulting map. We introduce algorithmic questions that consider repeats, i.e., markers that can have several occurrences in the resulting map. We show that, provided with an upper bound on the copy number of repeated markers and with intervals that span full repeat copies, called repeat spanning intervals, the problem of deciding if a set of adjacencies and repeat spanning intervals admits a genome representation is tractable if the target genome can contain linear and/or circular chromosomal fragments. We also show that extracting a maximum cardinality or weight subset of repeat spanning intervals given a set of adjacencies that admits a genome realization is NP-hard but fixed-parameter tractable in the maximum copy number and the number of adjacent repeats, and tractable if intervals contain a single repeated marker.



Genome-Wide Semi-Automated Annotation of Transporter Systems

03/27/2017 3:30 pm PST

Usually, transport reactions are added to genome-scale metabolic models (GSMMs) based on experimental data and literature. This approach does not allow associating specific genes with transport reactions, which impairs the ability of the model to predict effects of gene deletions. Novel methods for systematic genome-wide transporter functional annotation and their integration into GSMMs are therefore necessary. In this work, an automatic system to detect and classify all potential membrane transport proteins for a given genome and integrate the related reactions into GSMMs is proposed, based on the identification and classification of genes that encode transmembrane proteins. The Transport Reactions Annotation and Generation (TRIAGE) tool identifies the metabolites transported by each transmembrane protein and its transporter family. The localization of the carriers is also predicted and, consequently, their action is confined to a given membrane. The integration of the data provided by TRIAGE with highly curated models allowed the identification of new transport reactions. TRIAGE is included in the new release of merlin, a software tool previously developed by the authors, which expedites the GSMM reconstruction processes.



A Resolution of the Static Formulation Question for the Problem of Computing the History Bound

03/27/2017 3:36 pm PST

Evolutionary data has been traditionally modeled via phylogenetic trees; however, branching alone cannot model conflicting phylogenetic signals, so networks are used instead. Ancestral recombination graphs (ARGs) are used to model the evolution of incompatible sets of SNP data, allowing each site to mutate only once. The model often aims to minimize the number of recombinations. Similarly, incompatible cluster data can be represented by a reticulation network that minimizes reticulation events. The ARG literature has traditionally been disjoint from the reticulation network literature. By building on results from the reticulation network literature, we resolve an open question of interest to the ARG community. We explicitly prove that the History Bound, a lower bound on the number of recombinations in an ARG for a binary matrix, which was previously only defined procedurally, is equal to the minimum number of reticulation nodes in a network for the corresponding cluster data. To facilitate the proof, we give an algorithm that constructs this network using intermediate values from the procedural History Bound definition. We then develop a top-down algorithm for computing the History Bound, which has the same worst-case runtime as the known dynamic program, and show that it is likely to run faster in typical cases.



A Flexible Computational Framework Using R and Map-Reduce for Permutation Tests of Massive Genetic Analysis of Complex Traits

03/27/2017 3:38 pm PST

In quantitative trait locus (QTL) mapping significance of putative QTL is often determined using permutation testing. The computational needs to calculate the significance level are immense, $10^4$ up to $10^8$ or even more permutations can be needed. We have previously introduced the PruneDIRECT algorithm for multiple QTL scan with epistatic interactions. This algorithm has specific strengths for permutation testing. Here, we present a flexible, parallel computing framework for identifying multiple interacting QTL using the PruneDIRECT algorithm which uses the map-reduce model as implemented in Hadoop. The framework is implemented in R, a widely used software tool among geneticists. This enables users to rearrange algorithmic steps to adapt genetic models, search algorithms, and parallelization steps to their needs in a flexible way. Our work underlines the maturity of accessing distributed parallel computing for computationally demanding bioinformatics applications through building workflows within existing scientific environments. We investigate the PruneDIRECT algorithm, comparing its performance to exhaustive search and DIRECT algorithm using our framework on a public cloud resource. We find that PruneDIRECT is vastly superior for permutation testing, and perform $2 \times 10^5$ permutations for a 2D QTL problem in $15$ hours, using $100$ cloud processes. We show that our framework scales out almost linearly for a 3D QTL search.[...]



D-Map: Random Walking on Gene Network Inference Maps Towards differential Avenue Discovery

03/27/2017 3:34 pm PST

Differential rewiring of cellular interaction networks between disease and healthy state is of great importance. Through a systems level approach, malfunctioned mechanisms that are absent in the normal cases, may enlighten the key-players in terms of genes and their interaction chains related to disease. We have developed D-Map, a publicly available user-friendly web application, capable of generating and manipulating advanced differential networks by combining state-of-the-art inference reconstruction methods with random walk simulations. The inputs are expression profiles obtained from the Gene Expression Omnibus and a gene list under investigation. Differential networks may be visualized and interpreted through the use of D-Map interface, where display of the disease, the normal and the common state can be performed, interactively. A case study scenario concerning Alzheimer's disease, as well as breast, lung, and bladder cancer was conducted in order to demonstrate the usefulness of the proposed methodology to different disease types. Findings were consistent with the current bibliography, and the provided interaction lists may be further explored towards novel biological insights of the investigated diseases. The DMap web-application is available at: http://bioserver-3.bioacademy.gr/Bioserver/DMap/index.php.



Building Ancestral Recombination Graphs for Whole Genomes

03/27/2017 3:32 pm PST

We propose a heuristic algorithm, called ARG4WG, to build plausible ancestral recombination graphs (ARGs) from thousands of whole genome samples. By using the longest shared end for recombination inference, ARG4WG constructs ARGs with small numbers of recombination events that perform well in association mapping on genome-wide association studies.



A Linear Bound on the Number of States in Optimal Convex Characters for Maximum Parsimony Distance

03/27/2017 3:29 pm PST

Given two phylogenetic trees on the same set of taxa $X$ , the maximum parsimony distance $d_\mathrm{MP}$ is defined as the maximum, ranging over all characters $\chi$ on $X$ , of the absolute difference in parsimony score induced by $\chi$ on the two trees. In this note, we prove that for binary trees there exists a character achieving this maximum that is convex on one of the trees (i.e., the parsimony score induced on that tree is equal to the number of states in the character minus 1) and such that the number of states in the character is at most $7d_\mathrm{MP}-5$ . This is the first non-trivial bound on the number of states required by optimal characters, convex or otherwise. The result potentially has algorithmic significance because, unlike general characters, convex characters with a bounded number of states can be enumerated in polynomial time.[...]



Metabolic Flux Analysis in Isotope Labeling Experiments Using the Adjoint Approach

03/27/2017 3:35 pm PST

Comprehension of metabolic pathways is considerably enhanced by metabolic flux analysis (MFA-ILE) in isotope labeling experiments. The balance equations are given by hundreds of algebraic (stationary MFA) or ordinary differential equations (nonstationary MFA), and reducing the number of operations is therefore a crucial part of reducing the computation cost. The main bottleneck for deterministic algorithms is the computation of derivatives, particularly for nonstationary MFA. In this article, we explain how the overall identification process may be speeded up by using the adjoint approach to compute the gradient of the residual sum of squares. The proposed approach shows significant improvements in terms of complexity and computation time when it is compared with the usual (direct) approach. Numerical results are obtained for the central metabolic pathways of Escherichia coli and are validated against reference software in the stationary case. The methods and algorithms described in this paper are included in the sysmetab software package distributed under an Open Source license at http://forge.scilab.org/index.php/p/sysmetab/.



Pubcast and Genecast: Browsing and Exploring Publications and Associated Curated Content in Biology Through Mobile Devices

03/27/2017 3:37 pm PST

Services such as Facebook, Amazon, and eBay were once solely accessed from stationary computers. These web services are now being used increasingly on mobile devices. We acknowledge this new reality by providing users a way to access publications and a curated cancer mutation database on their mobile device with daily automated updates. Availability: http://hive.biochemistry.gwu.edu/tools/HivePubcast.



A Characterization of Minimum Spanning Tree-Like Metric Spaces

03/27/2017 3:36 pm PST

Recent years have witnessed a surge of biological interest in the minimum spanning tree (MST) problem for its relevance to automatic model construction using the distances between data points. Despite the increasing use of MST algorithms for this purpose, the goodness-of-fit of an MST to the data is often elusive because no quantitative criteria have been developed to measure it. Motivated by this, we provide a necessary and sufficient condition to ensure that a metric space on $n$ points can be represented by a fully labeled tree on $n$ vertices, and thereby determine when an MST preserves all pairwise distances between points in a finite metric space.












A Novel Cluster Head Selection Algorithm Based on Fuzzy Clustering and Particle Swarm Optimization

02/14/2017 10:40 am PST

An important objective of wireless sensor network is to prolong the network life cycle, and topology control is of great significance for extending the network life cycle. Based on previous work, for cluster head selection in hierarchical topology control, we propose a solution based on fuzzy clustering preprocessing and particle swarm optimization. More specifically, first, fuzzy clustering algorithm is used to initial clustering for sensor nodes according to geographical locations, where a sensor node belongs to a cluster with a determined probability, and the number of initial clusters is analyzed and discussed. Furthermore, the fitness function is designed considering both the energy consumption and distance factors of wireless sensor network. Finally, the cluster head nodes in hierarchical topology are determined based on the improved particle swarm optimization. Experimental results show that, compared with traditional methods, the proposed method achieved the purpose of reducing the mortality rate of nodes and extending the network life cycle.



Fireworks Algorithm with Enhanced Fireworks Interaction

02/14/2017 10:40 am PST

As a relatively new metaheuristic in swarm intelligence, fireworks algorithm (FWA) has exhibited promising performance on a wide range of optimization problems. This paper aims to improve FWA by enhancing fireworks interaction in three aspects: 1) Developing a new Gaussian mutation operator to make sparks learn from more exemplars; 2) Integrating the regular explosion operator of FWA with the migration operator of biogeography-based optimization (BBO) to increase information sharing; 3) Adopting a new population selection strategy that enables high-quality solutions to have high probabilities of entering the next generation without incurring high computational cost. The combination of the three strategies can significantly enhance fireworks interaction and thus improve solution diversity and suppress premature convergence. Numerical experiments on the CEC 2015 single-objective optimization test problems show the effectiveness of the proposed algorithm. The application to a high-speed train scheduling problem also demonstrates its feasibility in real-world optimization problems.