Preview: IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics
The IEEE/ACM Transactions on Computational Biology and Bioinformatics is a new quarterly that will publish archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and co
Published: Mon, 3 Nov 2014 15:35:50 GMT
PrePrint: Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware
Multiple sequence alignment (MSA) constitutes an extremely powerful tool for many biological applications including phylogenetic tree estimation, secondary structure prediction, and critical residue identification. However, aligning large biological sequences with popular tools such as MAFFT requires long runtimes on sequential architectures. Due to the ever increasing sizes of sequence databases, there is increasing demand to accelerate this task. In this paper, we demonstrate how Graphic Processing Units (GPUs), powered by the Compute Unified Device Architecture (CUDA), can be used as an efficient computational platform to accelerate the MAFFT algorithm. To fully exploit the GPU’s capabilities for accelerating MAFFT, we have optimized the sequence data organization to eliminate the bandwidth bottleneck of memory access, designed a memory allocation and reuse strategy to make full use of limited memory of GPUs, proposed a new modified-run-length encoding (MRLE) scheme to reduce memory consumption, and used high-performance shared memory to speed up I/O operations. Our implementation tested in three NVIDIA GPUs achieves speedup up to 11.28 on a Tesla K20m GPU compared to the sequential MAFFT 7.015.
PrePrint: Tractable cases of (*,2)-bounded parsimony haplotyping
Parsimony haplotyping is the problem of finding a set of haplotypes of minimum cardinality that explains a given set of genotypes, where a genotype is explained by two haplotypes if it can be obtained as a combination of the two. This problem is NP-complete in the general case, but polynomially solvable for (k; l)-bounded instances for certain k and l. Here, k denotes the maximum number of ambiguous sites in any genotype, and l is the maximum number of genotypes that are ambiguous at the same site. Only the complexity of the (; 2)-bounded problem is still unknown, where denotes no restriction. It has been proved that (; 2)-bounded instances have compatibility graphs that can be constructed from cliques and circuits by pasting along an edge. In this paper we give a constructive proof of the fact that (; 2)- bounded instances are polynomially solvable if the compatibility graph is constructed by pasting cliques, trees and circuits along a bounded number of edges. We obtain this proof by solving a slightly generalized problem on circuits, trees and cliques respectively, and arguing that all possible combinations of optimal solutions for these graphs that are pasted along a bounded number of edges can be enumerated efficiently.
PrePrint: An Improved Integral Inequality to Stability Analysis of Genetic Regulatory Networks With Interval Time-Varying Delays
This paper focuses on stability analysis for a class of genetic regulatory networks with interval time-varying delays. An improved integral inequality concerning on double-integral items is first established. Then we use the improved integral inequality to deal with the resultant double-integral items in the derivative of the involved Lyapunov-Krasovskii functional. As a result, a delay-range-dependent and delay-rate-dependent asymptotical stability criterion is established for genetic regulato- ry networks with differential time-varying delays. Furthermore, it is theoretically proven that the stability criterion proposed here is less conservative than the corresponding one in [Neurocomputing, 2012, 93: 19–26]. Based on the obtained result, another stability criterion is given under the case that the information of the derivatives of delays is unknown. Finally, the effectiveness of the approach proposed in this paper is illustrated by a pair of numerical examples which give the comparisons of stability criteria proposed in this paper and some literature
PrePrint: Multiple 3D RNA Structure Superposition Using Neighbor Joining
Recent advances in RNA research and the steady growth of available RNA structures call for bioinformatics methods for handling and analyzing RNA structural data. Recently, we introduced SETTER—a fast and accurate method for RNA pairwise structure alignment. In this paper we describe MultiSETTER, SETTER extension for multiple RNA structure alignment. MultiSETTER combines SETTER’s decomposition of RNA structures into non-overlapping structural subunits with the multiple sequence alignment algorithm ClustalW adapted for the structure alignment. The accuracy of MultiSETTER was assessed by the automatic classification of RNA structures and its comparison to SCOR annotations. In addition, MultiSETTER classification was also compared to multiple sequence alignment-based and secondary structure alignment-based classifications provided by LocARNA and RNADistance tools, respectively. MultiSETTER precompiled Windows libraries, as well as the C++ source code, are freely available from http://siret.cz/multisetter.
PrePrint: A Comparative Assessment of Predictive Accuracies of Conventional and Machine Learning Scoring Functions for Protein-Ligand Binding Affinity Prediction
Accurately predicting the binding affinities of large diverse sets of protein-ligand complexes efficiently is a key challenge in computational biomolecular science, with applications in drug discovery, chemical biology, and structural biology. Since a scoring function (SF) is used to score, rank, and identify potential drug leads, the fidelity with which it predicts the affinity of a ligand candidate for a protein’s binding site has a significant bearing on the accuracy of virtual screening. Despite intense efforts in developing conventional SFs, which are either force-field based, knowledge-based, or empirical, their limited predictive accuracy has been a major roadblock toward cost-effective drug discovery. Therefore, in this work, we explore a range of novel SFs employing different machine-learning (ML) approaches in conjunction with a variety of physicochemical and geometrical features characterizing protein-ligand complexes. We assess the scoring accuracies of these new ML SFs as well as those of conventional SFs in the context of the 2007 and 2010 PDBbind benchmark datasets on both diverse and protein-family-specific test sets. We also investigate the influence of the size of the training dataset and the type and number of features used on scoring accuracy. We find that the best performing ML SF has a Pearson correlation coefficient of 0.806 between predicted and measured binding affinities compared to 0.644 achieved by a state-of-the-art conventional SF. We also find that ML SFs benefit more than their conventional counterparts from increases in the number of features and the size of training dataset. In addition, they perform better on novel proteins that they were never trained on before.
PrePrint: Quantitative Measurement of Split of the Second Heart Sound (S2)
This study proposes a quantitative measurement of split of the second heart sound (S2) based on nonstationary signal decomposition to deal with overlaps and energy modeling of the subcomponents of S2. The second heart sound includes aortic (A2) and pulmonic (P2) closure sounds. However, the split detection is obscured due to A2–P2 overlap and low energy of P2. To identify such split, HVD method is used to decompose the S2 into a number of components while preserving the phase information. Further, A2s and P2s are localized using smoothed pseudo Wigner-Ville distribution followed by reassignment method. Finally, the split is calculated by taking the differences between the means of time indices of A2s and P2s. Experiments on total 33 clips of S2 signals are performed for evaluation of the method. The mean±standard deviation of the split is 34.7±4.6ms. The method measures the split efficiently, even when A2–P2 overlap is ≤ 20ms and the normalized peak temporal ratio of P2 to A2 is low (≥0.22). This proposed method thus, demonstrates its robustness by defining split detectability (SDT), the split detection aptness through detecting P2s, by measuring upto 96%. Such findings reveal the effectiveness of the method as competent against the other baselines, especially for A2–P2 overlaps and low energy P2.
PrePrint: Predicting Protein Functions using Multiple Kernels
High-throughput experimental techniques provide a wide variety of heterogeneous proteomic data sources. To exploit the information spread across multiple sources for protein function prediction, these data sources are transformed into kernels and then integrated into a composite kernel. Several methods firstly optimize the weights on these kernels to produce a composite kernel, and then train a classifier on the composite kernel. As such, these approaches result in an optimal composite kernel, but not necessarily in an optimal classifier. On the other hand, some approaches optimize the loss of binary classifiers and learn weights for the different kernels iteratively. For multi-class or multi-label data, these methods have to solve the problem of optimizing weights on these kernels for each of the labels, which are computationally expensive and ignore the correlation among labels. In this paper, we propose a method called Predicting Protein Function using Multiple Kernels (ProMK). ProMK iteratively optimizes the phases of learning optimal weights and reduces the empirical loss of multi-label classifier for each of the labels simultaneously. ProMK can integrate kernels selectively and downgrade the weights on noisy kernels. We investigate the performance of ProMK on several publicly available protein function prediction benchmarks and synthetic datasets. We show that the proposed approach performs better than previously proposed protein function prediction approaches that integrate multiple data sources and multi-label multiple kernel learning methods. The codes of our proposed method are available at https://sites.google.com/site/guoxian85/promk
PrePrint: Identification of functionally-related enzymes by learning-to-rank methods
Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking for correspondences in their sequences, structures or surfaces. For a given query, the search operation results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information about the biological function of annotated database enzymes is ignored. In this work we show that rankings of that kind can be substantially improved by applying kernel-based learning algorithms. This approach enables the detection of statistical dependencies between similarities of the active cleft and the biological function of annotated enzymes. This is in contrast to search-based approaches, which do not take annotated training data into account. Similarity measures based on the active cleft are known to outperform sequence-based or structurebased measures under certain conditions.We consider the Enzyme Commission (EC) classification hierarchy for obtaining annotated enzymes during the training phase. The results of a set of sizeable experiments indicate a consistent and significant improvement for a set of similarity measures that exploit information about small cavities in the surface of enzymes.
PrePrint: On representing protein folding patterns using non-linear parametric curves.
Proteins fold into complex three-dimensional shapes. Simplified representations of their shapes are central to rationalise, compare, classify, and interpret protein structures. Traditional methods to abstract protein folding patterns rely on representing their standard secondary structural elements (helices and strands of sheet) using line segments. This results in ignoring a significant proportion of structural information. The motivation of this research is to derive mathematically rigorous and biologically meaningful abstractions of protein folding patterns that maximize the economy of structural description and minimize the loss of structural information. We report on a novel method to describe a protein as a non-overlapping set of parametric three dimensional curves of varying length and complexity. Our approach to this problem is supported by information theory and uses the statistical framework of minimum message length (MML) inference. We demonstrate the effectiveness of our non-linear abstraction to support efficient and effective comparison of protein folding patterns on a large scale.
PrePrint: Computing elementary flux modes involving a set of target reactions
Elementary flux mode (EM) computation is an important tool in the constraint-based analysis of genome-scale metabolic networks. Due to the combinatorial complexity of these networks, as well as the advances in the level of detail to which they can be reconstructed, an exhaustive enumeration of all EMs is often not practical. Therefore, in recent years interest has shifted towards searching EMs with specific properties. We present a novel method that allows computing EMs containing a given set of target reactions. This generalizes previous algorithms where the set of target reactions consists of a single reaction. In the one-reaction case, our method compares favorably to the previous approaches. In addition, we present several applications of our algorithm for computing EMs containing two target reactions in genome-scale metabolic networks. A software tool implementing the algorithms described in this paper is available at https://sourceforge.net/projects/caefm.