Subscribe: IEEE/ACM Transactions on Computational Biology and Bioinformatics
http://csdl.computer.org/rss/tcbb.xml
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
algorithm  approach  based  data  gene  genes  information  method  methods  network  networks  proposed  protein  results 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics



The IEEE/ACM Transactions on Computational Biology and Bioinformatics is a new quarterly that will publish archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and co



 



Machine Learned Replacement of N-Labels for Basecalled Sequences in DNA Barcoding

02/05/2018 2:01 pm PST

This study presents a machine learning method that increases the number of identified bases in Sanger Sequencing. The system post-processes a KB basecalled chromatogram. It selects a recoverable subset of N-labels in the KB-called chromatogram to replace with basecalls (A,C,G,T). An N-label correction is defined given an additional read of the same sequence, and a human finished sequence. Corrections are added to the dataset when an alignment determines the additional read and human agree on the identity of the N-label. KB must also rate the replacement with quality value of $>60$ in the additional read. Corrections are only available during system training. Developing the system, nearly 850,000 N-labels are obtained from Barcode of Life Datasystems, the premier database of genetic markers called DNA Barcodes. Increasing the number of correct bases improves reference sequence reliability, increases sequence identification accuracy, and assures analysis correctness. Keeping with barcoding standards, our system maintains an error rate of $<1$ percent. Our system only applies corrections when it estimates low rate of error. Tested on this data, our automation selects and recovers: 79 percent of N-labels from COI (animal barcode); 80 percent from matK and rbcL (plant barcodes); and 58 percent from non-protein-coding sequences (across eukaryotes).



A Graphical Model of Smoking-Induced Global Instability in Lung Cancer

02/07/2018 2:03 pm PST

Smoking is the major cause of lung cancer and the leading cause of cancer-related death in the world. The most current view about lung cancer is no longer limited to individual genes being mutated by any carcinogenic insults from smoking. Instead, tumorigenesis is a phenotype conferred by many systematic and global alterations, leading to extensive heterogeneity and variation for both the genotypes and phenotypes of individual cancer cells. Thus, strategically it is foremost important to develop a methodology to capture any consistent and global alterations presumably shared by most of the cancerous cells for a given population. This is particularly true that almost all of the data collected from solid cancers (including lung cancers) are usually distant apart over a large span of temporal or even spatial contexts. Here, we report a multiple non-Gaussian graphical model to reconstruct the gene interaction network using two previously published gene expression datasets. Our graphical model aims to selectively detect gross structural changes at the level of gene interaction networks. Our methodology is extensively validated, demonstrating good robustness, as well as the selectivity and specificity expected based on our biological insights. In summary, gene regulatory networks are still relatively stable during presumably the early stage of neoplastic transformation. But drastic structural differences can be found between lung cancer and its normal control, including the gain of functional modules for cellular proliferations such as EGFR and PDGFRA, as well as the lost of the important IL6 module, supporting their roles as potential drug targets. Interestingly, our method can also detect early modular changes, with the ALDH3A1 and its associated interactions being strongly implicated as a potential early marker, whose activations appear to alter LCN2 module as well as its interactions with the important TP53-MDM2 circuitry. Our strategy using the graphical model to reconstruct gene interaction work with biologically-inspired constraints exemplifies the importance and beauty of biology in developing any bio-computational approach.



Optimal Objective-Based Experimental Design for Uncertain Dynamical Gene Networks with Experimental Error

02/07/2018 2:03 pm PST

In systems biology, network models are often used to study interactions among cellular components, a salient aim being to develop drugs and therapeutic mechanisms to change the dynamical behavior of the network to avoid undesirable phenotypes. Owing to limited knowledge, model uncertainty is commonplace and network dynamics can be updated in different ways, thereby giving multiple dynamic trajectories, that is, dynamics uncertainty. In this manuscript, we propose an experimental design method that can effectively reduce the dynamics uncertainty and improve performance in an interaction-based network. Both dynamics uncertainty and experimental error are quantified with respect to the modeling objective, herein, therapeutic intervention. The aim of experimental design is to select among a set of candidate experiments the experiment whose outcome, when applied to the network model, maximally reduces the dynamics uncertainty pertinent to the intervention objective.



Structural Class Classification of 3D Protein Structure Based on Multi-View 2D Images

02/07/2018 2:03 pm PST

Computing similarity or dissimilarity between protein structures is an important task in structural biology. A conventional method to compute protein structure dissimilarity requires structural alignment of the proteins. However, defining one best alignment is difficult, especially when the structures are very different. In this paper, we propose a new similarity measure for protein structure comparisons using a set of multi-view 2D images of 3D protein structures. In this approach, each protein structure is represented by a subspace from the image set. The similarity between two protein structures is then characterized by the canonical angles between the two subspaces. The primary advantage of our method is that precise alignment is not needed. We employed Grassmann Discriminant Analysis (GDA) as the subspace-based learning in the classification framework. We applied our method for the classification problem of seven SCOP structural classes of protein 3D structures. The proposed method outperformed the k-nearest neighbor method (k-NN) based on conventional alignment-based methods CE, FATCAT, and TM-align. Our method was also applied to the classification of SCOP folds of membrane proteins, where the proposed method could recognize the fold HEM-binding four-helical bundle (f.21) much better than TM-Align.



Species Tree Inference from Gene Splits by Unrooted STAR Methods

02/07/2018 2:01 pm PST

The $\text{NJ}_{st}$ method was proposed by Liu and Yu to infer a species tree topology from unrooted topological gene trees. While its statistical consistency under the multispecies coalescent model was established only for a four-taxon tree, simulations demonstrated its good performance on gene trees inferred from sequences for many taxa. Here, we prove the statistical consistency of the method for an arbitrarily large species tree. Our approach connects $\text{NJ}_{st}$ to a generalization of the STAR method of Liu, Pearl, and Edwards, and a previous theoretical analysis of it. We further show $\text{NJ}_{st}$ utilizes only the distribution of splits in the gene trees, and not their individual topologies. Finally, we discuss how multiple samples per taxon per gene should be handled for statistical consistency.



Filter Design with Adaptation to Time-Delay Parameters for Genetic Regulatory Networks

02/07/2018 2:02 pm PST

In existing works, the filters designed for delayed genetic regulatory networks (GRNs) contain time delay. If the time delay is unknown, the filters do not work in practical applications. In order to overcome the shortcoming in such existing works, this paper studies the filter design problem of GRNs with unknown constant time delay, and a novel adaptive filter is introduced, in which all unknown network parameters and the unknown time delay can be estimated online. By Lyapunove approach, it is shown that the estimating errors asymptotically converge to the origin. Finally, simulation results are presented to illustrate the effectiveness of the new method proposed in this paper.



Sampled-Data Stabilization for Fuzzy Genetic Regulatory Networks with Leakage Delays

02/07/2018 2:02 pm PST

This paper deals with the sampled-data stabilization problem for Takagi-Sugeno (T-S) fuzzy genetic regulatory networks with leakage delays. A novel Lyapunov-Krasovskii functional (LKF) is established by the non-uniform division of the delay intervals with triplex and quadruplex integral terms. Using such LKFs for constant and time-varying delay cases, new stability conditions are obtained in the T-S fuzzy framework. Based on this, a new condition for the sampled-data controller design is proposed using a linear matrix inequality representation. A numerical result is provided to show the effectiveness and potential of the developed design method.



ACID: Association Correction for Imbalanced Data in GWAS

02/07/2018 2:02 pm PST

Genome-wide association study (GWAS) has been widely witnessed as a powerful tool for revealing suspicious loci from various diseases. However, real world GWAS tasks always suffer from the data imbalance problem of sufficient control samples and limited case samples. This imbalance issue can cause serious biases to the result and thus leads to losses of significance for true causal markers. To tackle this problem, we proposed a computational framework to perform association correction for imbalanced data (ACID) that could potentially improve the performance of GWAS under the imbalance condition. ACID is inspired by the imbalance learning theory but is particularly modified to address the task of association discovery from sequential genomic data. Simulation studies demonstrate ACID can dramatically improve the power of traditional GWAS method on the dataset with severe imbalances. We further applied ACID to two imbalanced datasets (gastric cancer and bladder cancer) to conduct genome wide association analysis. Experimental results indicate that our method has better abilities in identifying suspicious loci than the regression approach and shows consistencies with existing discoveries.



Algorithms for the Majority Rule (+) Consensus Tree and the Frequency Difference Consensus Tree

02/07/2018 2:03 pm PST

This article presents two new deterministic algorithms for constructing consensus trees. Given an input of $k$  phylogenetic trees with identical leaf label sets and $n$  leaves each, the first algorithm constructs the majority rule (+) consensus tree in $O(k n)$ time, which is optimal since the input size is $\Omega (k n)$ , and the second one constructs the frequency difference consensus tree in $\min \lbrace O(k n^{2}), O(k n (k + \log ^{2}n))\rbrace$ time.



A Novel Approach to Identify the miRNA-mRNA Causal Regulatory Modules in Cancer

02/07/2018 2:02 pm PST

MicroRNAs (miRNAs) play an essential role in many biological processes by regulating the target genes, especially in the initiation and development of cancers. Therefore, the identification of the miRNA-mRNA regulatory modules is important for understanding the regulatory mechanisms. Most computational methods only used statistical correlations in predicting miRNA-mRNA modules, and neglected the fact there are causal relationships between miRNAs and their target genes. In this paper, we propose a novel approach called CALM (the causal regulatory modules) to identify the miRNA-mRNA regulatory modules through integrating the causal interactions and statistical correlations between the miRNAs and their target genes. Our algorithm largely consists of three steps: it first forms the causal regulatory relationships of miRNAs and genes from gene expression profiles and detects the miRNA clusters according to the GO function information of their target genes, then expands each miRNA cluster by greedy adding (discarding) the target genes to maximize the modularity score. To show the performance of our method, we apply CALM on four datasets including EMT, breast, ovarian, and thyroid cancer and validate our results. The experiment results show that our method can not only outperform the compared method, but also achieve ideal overall performance in terms of the functional enrichment.



Random Sets of Stadiums in Square and Collective Behavior of Bacteria

02/07/2018 2:03 pm PST

Collective motion of swimmers can be detected by hydrodynamic interactions through the effective (macroscopic) viscosity. It follows from the general hydrodynamics that the effective viscosity of non-dilute random suspensions depends on the shape of particles and of their spacial probabilistic distribution. Therefore, a comparative analysis of disordered and collectively interacting particles of the bacteria shape can be done in terms of the probabilistic geometric parameters which determine the effective viscosity. In this paper, we develop a quantitative criterion to detect the collective behavior of bacteria. This criterion is based on the basic statistic moments ( $e$ -sums or generalized Eisenstein-Rayleigh sums) which characterize the high-order correlation functions. The locations and the shape of bacteria are modeled by stadiums randomly embedded in medium without overlapping. These shape models can be considered as improvement of the previous segment model. We calculate the $e$ -sums of the simulated disordered sets and of the observed experimental locations of bacteria subtilis. The obtained results show a difference between these two sets that demonstrates the collective motion of bacteria.



Index-Based Network Aligner of Protein-Protein Interaction Networks

02/07/2018 2:03 pm PST

Network Alignment over graph-structured data has received considerable attention in many recent applications. Global network alignment tries to uniquely find the best mapping for a node in one network to only one node in another network. The mapping is performed according to some matching criteria that depend on the nature of data. In molecular biology, functional orthologs, protein complexes, and evolutionary conserved pathways are some examples of information uncovered by global network alignment. Current techniques for global network alignment suffer from several drawbacks, e.g., poor performance and high memory requirements. We address these problems by proposing IBNAL, Indexes-Based Network ALigner, for better alignment quality and faster results. To accelerate the alignment step, IBNAL makes use of a novel clique-based index and is able to align large networks in seconds. IBNAL produces a higher topological quality alignment and comparable biological match in alignment relative to other state-of-the-art aligners even though topological fit is primarily used to match nodes. IBNAL’s results confirm and give another evidence that homology information is more likely to be encoded in network topology than sequence information.



Combinatorics of Tandem Duplication Random Loss Mutations on Circular Genomes

02/07/2018 2:03 pm PST

The tandem duplication random loss operation (TDRL) is an important genome rearrangement operation in metazoan mitochondrial genomes. A TDRL consists of a duplication of a contiguous set of genes in tandem followed by a random loss of one copy of each duplicated gene. This paper presents an analysis of the combinatorics of TDRLs on circular genomes, e.g., the mitochondrial genome. In particular, results on TDRLs for circular genomes and their linear representatives are established. Moreover, the distance between gene orders with respect to linear TDRLs and circular TDRLs is studied. An analysis of the available animal mitochondrial gene orders shows the practical relevance of the theoretical results.



Tanglegrams: A Reduction Tool for Mathematical Phylogenetics

02/05/2018 2:01 pm PST

Many discrete mathematics problems in phylogenetics are defined in terms of the relative labeling of pairs of leaf-labeled trees. These relative labelings are naturally formalized as tanglegrams, which have previously been an object of study in coevolutionary analysis. Although there has been considerable work on planar drawings of tanglegrams, they have not been fully explored as combinatorial objects until recently. In this paper, we describe how many discrete mathematical questions on trees “factor” through a problem on tanglegrams, and how understanding that factoring can simplify analysis. Depending on the problem, it may be useful to consider a unordered version of tanglegrams, and/or their unrooted counterparts. For all of these definitions, we show how the isomorphism types of tanglegrams can be understood in terms of double cosets of the symmetric group, and we investigate their automorphisms. Understanding tanglegrams better will isolate the distinct problems on leaf-labeled pairs of trees and reveal natural symmetries of spaces associated with such problems.



Petri Net Siphon Analysis and Graph Theoretic Measures for Identifying Combination Therapies in Cancer

02/07/2018 2:03 pm PST

Epidermal Growth Factor Receptor (EGFR) signaling to the Ras-MAPK pathway is implicated in the development and progression of cancer and is a major focus of targeted combination therapies. Physiochemical models have been used for identifying and testing the signal-inhibiting potential of targeted therapies; however, their application to larger multi-pathway networks is limited by the availability of experimentally-determined rate and concentration parameters. An alternate strategy for identifying and evaluating drug-targetable nodes is proposed. A physiochemical model of EGFR-Ras-MAPK signaling is implemented and calibrated to experimental data. Essential topological features of the model are converted into a Petri net and nodes that behave as siphons—a structural property of Petri nets—are identified. Siphons represent potential drug-targets since they are unrecoverable if their values fall below a threshold. Centrality measures are then used to prioritize siphons identified as candidate drug-targets. Single and multiple drug-target combinations are identified which correspond to clinically relevant drug targets and exhibit inhibition synergy in physiochemical simulations of EGF-induced EGFR-Ras-MAPK signaling. Taken together, these studies suggest that siphons and centrality analyses are a promising computational strategy to identify and rank drug-targetable nodes in larger networks as they do not require knowledge of the dynamics of the system, but rely solely on topology.



The Intrinsic Pepsin Resistance of Interleukin-8 Can Be Explained from a Combined Bioinformatical and Experimental Approach

02/07/2018 2:02 pm PST

Interleukin-8 (IL-8, CXCL8) is a neutrophil chemotactic factor belonging to the family of chemokines. IL-8 was shown to resist pepsin cleavage displaying its high resistance to this protease. However, the molecular mechanisms underlying this resistance are not fully understood. Using our in-house database containing the data on three-dimensional arrangements of secondary structure elements from the whole Protein Data Bank, we found a striking structural similarity between IL-8 and pepsin inhibitor-3. Such similarity could play a key role in understanding IL-8 resistance to the protease pepsin. To support this hypothesis, we applied pepsin assays confirming that intact IL-8 is not degraded by pepsin in comparison to IL-8 in a denaturated state. Applying 1H-15N Heteronuclear Single Quantum Coherence NMR measurements, we determined the putative regions at IL-8 that are potentially responsible for interactions with the pepsin. The results obtained in this work contribute to the understanding of the resistance of IL-8 to pepsin proteolysis in terms of its structural properties.



HEMEsPred: Structure-Based Ligand-Specific Heme Binding Residues Prediction by Using Fast-Adaptive Ensemble Learning Scheme

02/07/2018 2:03 pm PST

Heme is an essential biomolecule that widely exists in numerous extant organisms. Accurately identifying heme binding residues (HEMEs) is of great importance in disease progression and drug development. In this study, a novel predictor named HEMEsPred was proposed for predicting HEMEs. First, several sequence- and structure-based features, including amino acid composition, motifs, surface preferences, and secondary structure, were collected to construct feature matrices. Second, a novel fast-adaptive ensemble learning scheme was designed to overcome the serious class-imbalance problem as well as to enhance the prediction performance. Third, we further developed ligand-specific models considering that different heme ligands varied significantly in their roles, sizes, and distributions. Statistical test proved the effectiveness of ligand-specific models. Experimental results on benchmark datasets demonstrated good robustness of our proposed method. Furthermore, our method also showed good generalization capability and outperformed many state-of-art predictors on two independent testing datasets. HEMEsPred web server was available at http://www.inforstation.com/HEMEsPred/ for free academic use.



Region Growing for Segmenting Green Microalgae Images

02/07/2018 2:01 pm PST

We describe a specialized methodology for segmenting 2D microscopy digital images of freshwater green microalgae. The goal is to obtain representative algae shapes to extract morphological features to be employed in a posterior step of taxonomical classification of the species. The proposed methodology relies on the seeded region growing principle and on a fine-tuned filtering preprocessing stage to smooth the input image. A contrast enhancement process then takes place to highlight algae regions on a binary pre-segmentation image. This binary image is also employed to determine where to place the seed points and to estimate the statistical probability distributions that characterize the target regions, i.e., the algae areas and the background, respectively. These preliminary stages produce the required information to set the homogeneity criterion for region growing. We evaluate the proposed methodology by comparing its resulting segmentations with a set of corresponding ground-truth segmentations (provided by an expert biologist) and also with segmentations obtained with existing strategies. The experimental results show that our solution achieves highly accurate segmentation rates with greater efficiency, as compared with the performance of standard segmentation approaches and with an alternative previous solution, based on level-sets, also specialized to handle this particular problem.



Inferring the Functions of Proteins from the Interrelationships between Functional Categories

02/07/2018 2:03 pm PST

This study proposes a new method to determine the functions of an unannotated protein. The proteins and amino acid residues mentioned in biomedical texts associated with an unannotated protein $p$ can be considered as characteristics terms for $p$ , which are highly predictive of the potential functions of $p$ . Similarly, proteins and amino acid residues mentioned in biomedical texts associated with proteins annotated with a functional category $f$ can be considered as characteristics terms of $f$ . We introduce in this paper an information extraction system called IFP_IFC that predicts the functions of an unannotated protein $p$ by representing $p$ and each functional category $f$ by a vector of weights. Each weight reflects the degree of association between a characteristic term and $p$ (or a characteristic term and $f$ ). First, IFP_IFC constructs a network, whose nodes represent the different functional categories, and its edges the interrelationships between the nodes. Then, it determines the functions of $p$ by employing random walks with restarts on the mentioned network. The walker is the vector of $p$ . Finally, $p$ is assigned to the functional categories of the nodes in the network that are visited most by the walker. We evaluated the quality of IFP_IFC by comparing it experimentally with two other systems. Results showed marked improvement.



Detecting Essential Proteins Based on Network Topology, Gene Expression Data, and Gene Ontology Information

02/07/2018 2:01 pm PST

The identification of essential proteins in protein-protein interaction (PPI) networks is of great significance for understanding cellular processes. With the increasing availability of large-scale PPI data, numerous centrality measures based on network topology have been proposed to detect essential proteins from PPI networks. However, most of the current approaches focus mainly on the topological structure of PPI networks, and largely ignore the gene ontology annotation information. In this paper, we propose a novel centrality measure, called TEO, for identifying essential proteins by combining network topology, gene expression profiles, and GO information. To evaluate the performance of the TEO method, we compare it with five other methods (degree, betweenness, NC, Pec, and CowEWC) in detecting essential proteins from two different yeast PPI datasets. The simulation results show that adding GO information can effectively improve the predicted precision and that our method outperforms the others in predicting essential proteins.



Inferring Unknown Biological Function by Integration of GO Annotations and Gene Expression Data

02/05/2018 2:01 pm PST

Characterizing genes with semantic information is an important process regarding the description of gene products. In spite that complete genomes of many organisms have been already sequenced, the biological functions of all of their genes are still unknown. Since experimentally studying the functions of those genes, one by one, would be unfeasible, new computational methods for gene functions inference are needed. We present here a novel computational approach for inferring biological function for a set of genes with previously unknown function, given a set of genes with well-known information. This approach is based on the premise that genes with similar behaviour should be grouped together. This is known as the guilt-by-association principle. Thus, it is possible to take advantage of clustering techniques to obtain groups of unknown genes that are co-clustered with genes that have well-known semantic information (GO annotations). Meaningful knowledge to infer unknown semantic information can therefore be provided by these well-known genes. We provide a method to explore the potential function of new genes according to those currently annotated. The results obtained indicate that the proposed approach could be a useful and effective tool when used by biologists to guide the inference of biological functions for recently discovered genes. Our work sets an important landmark in the field of identifying unknown gene functions through clustering, using an external source of biological input. A simple web interface to this proposal can be found at http://fich.unl.edu.ar/sinc/webdemo/gamma-am/.



Nonbinary Tree-Based Phylogenetic Networks

02/07/2018 2:03 pm PST

Rooted phylogenetic networks are used to describe evolutionary histories that contain non-treelike evolutionary events such as hybridization and horizontal gene transfer. In some cases, such histories can be described by a phylogenetic base-tree with additional linking arcs, which can, for example, represent gene transfer events. Such phylogenetic networks are called tree-based. Here, we consider two possible generalizations of this concept to nonbinary networks, which we call tree-based and strictly-tree-based nonbinary phylogenetic networks. We give simple graph-theoretic characterizations of tree-based and strictly-tree-based nonbinary phylogenetic networks. Moreover, we show for each of these two classes that it can be decided in polynomial time whether a given network is contained in the class. Our approach also provides a new view on tree-based binary phylogenetic networks. Finally, we discuss two examples of nonbinary phylogenetic networks in biology and show how our results can be applied to them.



Classification of State Trajectories in Gene Regulatory Networks

02/07/2018 2:02 pm PST

Gene-expression-based phenotype classification is used for disease diagnosis and prognosis relating to treatment strategies. The present paper considers classification based on sequential measurements of multiple genes using gene regulatory network (GRN) modeling. There are two networks, original and mutated, and observations consist of trajectories of network states. The problem is to classify an observation trajectory as coming from either the original or mutated network. GRNs are modeled via probabilistic Boolean networks, which incorporate stochasticity at both the gene and network levels. Mutation affects the regulatory logic. Classification is based upon observing a trajectory of states of some given length. We characterize the Bayes classifier and find the Bayes error for a general PBN and the special case of a single Boolean network affected by random perturbations (BNp). The Bayes error is related to network sensitivity, meaning the extent of alteration in the steady-state distribution of the original network owing to mutation. Using standard methods to calculate steady-state distributions is cumbersome and sometimes impossible, so we provide an efficient algorithm and approximations. Extensive simulations are performed to study the effects of various factors, including approximation accuracy. We apply the classification procedure to a p53 BNp and a mammalian cell cycle PBN.



Attention Recognition in EEG-Based Affective Learning Research Using CFS+KNN Algorithm

02/07/2018 2:02 pm PST

The research detailed in this paper focuses on the processing of Electroencephalography (EEG) data to identify attention during the learning process. The identification of affect using our procedures is integrated into a simulated distance learning system that provides feedback to the user with respect to attention and concentration. The authors propose a classification procedure that combines correlation-based feature selection (CFS) and a k-nearest-neighbor (KNN) data mining algorithm. To evaluate the CFS+KNN algorithm, it was test against CFS+C4.5 algorithm and other classification algorithms. The classification performance was measured 10 times with different 3-fold cross validation data. The data was derived from 10 subjects while they were attempting to learn material in a simulated distance learning environment. A self-assessment model of self-report was used with a single valence to evaluate attention on 3 levels (high, neutral, low). It was found that CFS+KNN had a much better performance, giving the highest correct classification rate (CCR) of $80.84 \pm 3.0$ % for the valence dimension divided into three classes.



Complexity and Algorithms for Finding a Perfect Phylogeny from Mixed Tumor Samples

02/07/2018 2:02 pm PST

Hajirasouliha and Raphael (WABI 2014) proposed a model for deconvoluting mixed tumor samples measured from a collection of high-throughput sequencing reads. This is related to understanding tumor evolution and critical cancer mutations. In short, their formulation asks to split each row of a binary matrix so that the resulting matrix corresponds to a perfect phylogeny and has the minimum number of rows among all matrices with this property. In this paper, we disprove several claims about this problem, including an NP-hardness proof of it. However, we show that the problem is indeed NP-hard, by providing a different proof. We also prove NP-completeness of a variant of this problem proposed in the same paper. On the positive side, we propose an efficient (though not necessarily optimal) heuristic algorithm based on coloring co-comparability graphs, and a polynomial time algorithm for solving the problem optimally on matrix instances in which no column is contained in both columns of a pair of conflicting columns. Implementations of these algorithms are freely available at https://github.com/alexandrutomescu/MixedPerfectPhylogeny.



RAFP-Pred: Robust Prediction of Antifreeze Proteins Using Localized Analysis of n-Peptide Compositions

02/07/2018 2:02 pm PST

In extreme cold weather, living organisms produce Antifreeze Proteins (AFPs) to counter the otherwise lethal intracellular formation of ice. Structures and sequences of various AFPs exhibit a high degree of heterogeneity, consequently the prediction of the AFPs is considered to be a challenging task. In this research, we propose to handle this arduous manifold learning task using the notion of localized processing. In particular, an AFP sequence is segmented into two sub-segments each of which is analyzed for amino acid and di-peptide compositions. We propose to use only the most significant features using the concept of information gain (IG) followed by a random forest classification approach. The proposed RAFP-Pred achieved an excellent performance on a number of standard datasets. We report a high Youden’s index (sensitivity+specificity-1) value of 0.75 on the standard independent test data set outperforming the AFP-PseAAC, AFP_PSSM, AFP-Pred, and iAFP by a margin of 0.05, 0.06, 0.14, and 0.68, respectively. The verification rate on the UniProKB dataset is found to be 83.19 percent which is substantially superior to the 57.18 percent reported for the iAFP method.



GSEH: A Novel Approach to Select Prostate Cancer-Associated Genes Using Gene Expression Heterogeneity

02/07/2018 2:02 pm PST

When a gene shows varying levels of expression among normal people but similar levels in disease patients or shows similar levels of expression among normal people but different levels in disease patients, we can assume that the gene is associated with the disease. By utilizing this gene expression heterogeneity, we can obtain additional information that abets discovery of disease-associated genes. In this study, we used collaborative filtering to calculate the degree of gene expression heterogeneity between classes and then scored the genes on the basis of the degree of gene expression heterogeneity to find “differentially predicted” genes. Through the proposed method, we discovered more prostate cancer-associated genes than 10 comparable methods. The genes prioritized by the proposed method are potentially significant to biological processes of a disease and can provide insight into them.



Calculating the Expected Time to Eradicate HIV-1 Using a Markov Chain

02/07/2018 2:02 pm PST

In this study, the expected time required to eradicate HIV-1 completely was found as the conditional absorbing time in a finite state space continuous-time Markov chain model. The Markov chain has two absorbing states: one corresponds to HIV eradication and another representing the possible disaster. This method allowed us to calculate the expected eradication time by solving systems of linear equations. To overcome the challenge of huge dimension of the problem, we applied a novel stop and resume technique. This technique also helped to stop the numerical computation whenever we wanted and continue later from that point until the final result was obtained. Our numerical study showed the dependence of the expected eradication time of HIV on the half-life of the latently infected cells and there agreed with the previous studies. The study predicted that when the half-life of the latent cells varied from 4.6 to 60 months, it took a mean 4.97 to 31.04 years with a corresponding standard deviation of 0.64 to 3.99 years to eradicate the latent cell reservoir. It also revealed the crucial dependence of eradication time on the initial number of latently infected cells.



Efficient Algorithms for Sequence Analysis with Entropic Profiles

02/07/2018 2:01 pm PST

Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.



Application of Genetic Programming (GP) Formalism for Building Disease Predictive Models from Protein-Protein Interactions (PPI) Data

02/05/2018 2:01 pm PST

Protein-protein interactions (PPIs) play a vital role in the biological processes involved in the cell functions and disease pathways. The experimental methods known to predict PPIs require tremendous efforts and the results are often hindered by the presence of a large number of false positives. Herein, we demonstrate the use of a new Genetic Programming (GP) based Symbolic Regression (SR) approach for predicting PPIs related to a disease. In this case study, a dataset consisting of 135 PPI complexes related to cancer was used to construct a generic PPI predicting model with good PPI prediction accuracy and generalization ability. A high correlation coefficient (CC) magnitude of 0.893, and low root mean square error (RMSE), and mean absolute percentage error (MAPE) values of 478.221 and 0.239, respectively, were achieved for both the training and test set outputs. To validate the discriminatory nature of the model, it was applied on a dataset of diabetes complexes where it yielded significantly low CC values. Thus, the GP model developed here serves a dual purpose: (a) a predictor of the binding energy of cancer related PPI complexes, and (b) a classifier for discriminating PPI complexes related to cancer from those of other diseases.



Bi-level and Bi-objective p-Median Type Problems for Integrative Clustering: Application to Analysis of Cancer Gene-Expression and Drug-Response Data

02/07/2018 2:02 pm PST

Recent advances in high-throughput technologies have given rise to collecting large amounts of multidimensional heterogeneous data that provide diverse information on the same biological samples. Integrative analysis of such multisource datasets may reveal new biological insights into complex biological mechanisms and therefore remains an important research field in systems biology. Most of the modern integrative clustering approaches rely on independent analysis of each dataset and consensus clustering, probabilistic or statistical modeling, while flexible distance-based integrative clustering techniques are sparsely covered. We propose two distance-based integrative clustering frameworks based on bi-level and bi-objective extensions of the p-median problem. A hybrid branch-and-cut method is developed to find global optimal solutions to the bi-level p-median model. As to the bi-objective problem, an $\varepsilon$ -constraint algorithm is proposed to generate an approximation to the Pareto optimal set. Every solution found by any of the frameworks corresponds to an integrative clustering. We present an application of our approaches to integrative analysis of NCI-60 human tumor cell lines characterized by gene expression and drug activity profiles. We demonstrate that the proposed mathematical optimization-based approaches outperform some state-of-the-art and traditional distance-based integrative and non-integrative clustering techniques.



Introducing a Stable Bootstrap Validation Framework for Reliable Genomic Signature Extraction

02/07/2018 2:01 pm PST

The application of machine learning methods for the identification of candidate genes responsible for phenotypes of interest, such as cancer, is a major challenge in the field of bioinformatics. These lists of genes are often called genomic signatures and their linkage to phenotype associations may form a significant step in discovering the causation between genotypes and phenotypes. Traditional methods that produce genomic signatures from DNA Microarray data tend to extract significantly different lists under relatively small variations of the training data. That instability hinders the validity of research findings and raises skepticism about the reliability of such methods. In this study, a complete framework for the extraction of stable and reliable lists of candidate genes is presented. The proposed methodology enforces stability of results at the validation step and as a result, it is independent of the feature selection and classification methods used. Furthermore, two different statistical tests are performed in order to assess the statistical significance of the observed results. Moreover, the consistency of the signatures extracted by independent executions of the proposed method is also evaluated. The results of this study highlight the importance of stability issues in genomic signatures, beyond their prediction capabilities.



2017 Index IEEE/ACM Transactions on Computational Biology and Bioinformatics Vol. 14

02/07/2018 2:02 pm PST

This index covers all technical items - papers, correspondence, reviews, etc. - that appeared in this periodical during the year, and items from previous years that were commented upon or corrected in this year. Departments and other items may also be covered if they have been judged to have archival value. The Author Index contains the primary entry for each item, listed under the first author's name. The primary entry includes the co-authors' names, the title of the paper or other item, and its location, specified by the publication abbreviation, year, month, and inclusive pagination. The Subject Index contains entries describing the item under all appropriate subject headings, plus the first author's name, the publication abbreviation, month, and year, and inclusive pages. Note that the item title is found only under the primary entry in the Author Index.



Normalizing Kernels in the Billera-Holmes-Vogtmann Treespace

12/07/2017 2:02 pm PST

As costs of genome sequencing have dropped precipitously, development of efficient bioinformatic methods to analyze genome structure and evolution have become ever more urgent. For example, most published phylogenomic studies involve either massive concatenation of sequences, or informal comparisons of phylogenies inferred on a small subset of orthologous genes, neither of which provides a comprehensive overview of evolution or systematic identification of genes with unusual and interesting evolution (e.g., horizontal gene transfers, gene duplication, and subsequent neofunctionalization). We are interested in identifying such “outlying” gene trees from the set of gene trees and estimating the distribution of trees over the “tree space”. This paper describes an improvement to the kdetrees algorithm, an adaptation of classical kernel density estimation to the metric space of phylogenetic trees (Billera-Holmes-Vogtman treespace), whereby the kernel normalizing constants, are estimated through the use of the novel holonomic gradient methods. As in the original kdetrees paper, we have applied kdetrees to a set of Apicomplexa genes. The analysis identified several unreliable sequence alignments that had escaped previous detection, as well as a gene independently reported as a possible case of horizontal gene transfer. The updated version of the kdetrees software package is available both from CRAN (the official R package system), as well as from the official development repository on Github. ( github.com/grady/kdetrees).



Enhancing Protein Conformational Space Sampling Using Distance Profile-Guided Differential Evolution

12/07/2017 2:02 pm PST

De novo protein structure prediction aims to search for low-energy conformations as it follows the thermodynamics hypothesis that places native conformations at the global minimum of the protein energy surface. However, the native conformation is not necessarily located in the lowest-energy regions owing to the inaccuracies of the energy model. This study presents a differential evolution algorithm using distance profile-based selection strategy to sample conformations with reasonable structure effectively. In the proposed algorithm, besides energy, the residue-residue distance is considered another measure of the conformation. The average distance errors of decoys between the distance of each residue pair and the corresponding distance in the distance profiles are first calculated when the trial conformation yields a larger energy value than that of the target. Then, the distance acceptance probability of the trial conformation is designed based on distance profiles if the trial conformation obtains a lower average distance error compared with that of the target conformation. The trial conformation is accepted to the next generation in accordance with its distance acceptance probability. By using the dual constraints of energy and distance in guiding sampling, the algorithm can sample conformations with lower energies and more reasonable structures. Experimental results of 28 benchmark proteins show that the proposed algorithm can effectively predict near-native protein structures.



Benchmark Dataset for Whole Genome Sequence Compression

12/07/2017 2:02 pm PST

The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. Availability: The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.



Improving Biochemical Named Entity Recognition Using PSO Classifier Selection and Bayesian Combination Methods

12/07/2017 2:02 pm PST

Named Entity Recognition (NER) is a basic step for large number of consequent text mining tasks in the biochemical domain. Increasing the performance of such recognition systems is of high importance and always poses a challenge. In this study, a new community based decision making system is proposed which aims at increasing the efficiency of NER systems in the chemical/drug name context. Particle Swarm Optimization (PSO) algorithm is chosen as the expert selection strategy along with the Bayesian combination method to merge the outputs of the selected classifiers as well as evaluate the fitness of the selected candidates. The proposed system performs in two steps. The first step focuses on creating various numbers of baseline classifiers for NER with different features sets using the Conditional Random Fields (CRFs). The second step involves the selection and efficient combination of the classifiers using PSO and Bayesisan combination. Two comprehensive corpora from BioCreative events, namely ChemDNER and CEMP, are used for the experiments conducted. Results show that the ensemble of classifiers selected by means of the proposed approach perform better than the single best classifier as well as ensembles formed using other popular selection/combination strategies for both corpora. Furthermore, the proposed method outperforms the best performing system at the Biocreative IV ChemDNER track by achieving an F-score of 87.95 percent.



Data Management for Heterogeneous Genomic Datasets

12/07/2017 2:01 pm PST

Next Generation Sequencing (NGS), a family of technologies for reading DNA and RNA, is changing biological research, and will soon change medical practice, by quickly providing sequencing data and high-level features of numerous individual genomes in different biological and clinical conditions. The availability of millions of whole genome sequences may soon become the biggest and most important ”big data” problem of mankind. In this exciting framework, we recently proposed a new paradigm to raise the level of abstraction in NGS data management, by introducing a GenoMetric Query Language (GMQL) and demonstrating its usefulness through several biological query examples. Leveraging on that effort, here we motivate and formalize GMQL operations, especially focusing on the most characteristic and domain-specific ones. Furthermore, we address their efficient implementation and illustrate the architecture of the new software system that we have developed for their execution on big genomic data in a cloud computing environment, providing the evaluation of its performance. The new system implementation is available for download at the GMQL website (http://www.bioinformatics.deib.polimi.it/GMQL/); GMQL can also be tested through a set of predefined queries on ENCODE and Roadmap Epigenomics data at http://www.bioinformatics.deib.polimi.it/GMQL/queries/.



Unsupervised Binning of Metagenomic Assembled Contigs Using Improved Fuzzy C-Means Method

12/07/2017 2:01 pm PST

Metagenomic contigs binning is a necessary step of metagenome analysis. After assembly, the number of contigs belonging to different genomes is usually unequal. So a metagenomic contigs dataset is a kind of imbalanced dataset and traditional fuzzy c-means method (FCM) fails to handle it very well. In this paper, we will introduce an improved version of fuzzy c-means method (IFCM) into metagenomic contigs binning. First, tetranucleotide frequencies are calculated for every contig. Second, the number of bins is roughly estimated by the distribution of genome lengths of a complete set of non-draft sequenced microbial genomes from NCBI. Then, IFCM is used to cluster DNA contigs with the estimated result. Finally, a clustering validity function is utilized to determine the binning result. We tested this method on a synthetic and two real datasets and experimental results have showed the effectiveness of this method compared with other tools.



High Class-Imbalance in pre-miRNA Prediction: A Novel Approach Based on deepSOM

12/07/2017 2:01 pm PST

The computational prediction of novel microRNA within a full genome involves identifying sequences having the highest chance of being a miRNA precursor (pre-miRNA). These sequences are usually named candidates to miRNA. The well-known pre-miRNAs are usually only a few in comparison to the hundreds of thousands of potential candidates to miRNA that have to be analyzed, which makes this task a high class-imbalance classification problem. The classical way of approaching it has been training a binary classifier in a supervised manner, using well-known pre-miRNAs as positive class and artificially defining the negative class. However, although the selection of positive labeled examples is straightforward, it is very difficult to build a set of negative examples in order to obtain a good set of training samples for a supervised method. In this work, we propose a novel and effective way of approaching this problem using machine learning, without the definition of negative examples. The proposal is based on clustering unlabeled sequences of a genome together with well-known miRNA precursors for the organism under study, which allows for the quick identification of the best candidates to miRNA as those sequences clustered with known precursors. Furthermore, we propose a deep model to overcome the problem of having very few positive class labels. They are always maintained in the deep levels as positive class while less likely pre-miRNA sequences are filtered level after level. Our approach has been compared with other methods for pre-miRNAs prediction in several species, showing effective predictivity of novel miRNAs. Additionally, we will show that our approach has a lower training time and allows for a better graphical navegability and interpretation of the results. A web-demo interface to try deepSOM is available at http://fich.unl.edu.ar/sinc/web-demo/deepsom/.



Copy Number Variations Detection: Unravelling the Problem in Tangible Aspects

12/07/2017 2:02 pm PST

In the midst of the important genomic variants associated to the susceptibility and resistance to complex diseases, Copy Number Variations (CNV) has emerged as a prevalent class of structural variation. Following the flood of next-generation sequencing data, numerous tools publicly available have been developed to provide computational strategies to identify CNV at improved accuracy. This review goes beyond scrutinizing the main approaches widely used for structural variants detection in general, including Split-Read, Paired-End Mapping, Read-Depth, and Assembly-based. In this paper, (1) we characterize the relevant technical details around the detection of CNV, which can affect the estimation of breakpoints and number of copies, (2) we pinpoint the most important insights related to GC-content and mappability biases, and (3) we discuss the paramount caveats in the tools evaluation process. The points brought out in this study emphasize common assumptions, a variety of possible limitations, valuable insights, and directions for desirable contributions to the state-of-the-art in CNV detection tools.



Reframed Genome-Scale Metabolic Model to Facilitate Genetic Design and Integration with Expression Data

12/07/2017 2:01 pm PST

Genome-scale metabolic network models (GEMs) have played important roles in the design of genetically engineered strains and helped biologists to decipher metabolism. However, due to the complex gene-reaction relationships that exist in model systems, most algorithms have limited capabilities with respect to directly predicting accurate genetic design for metabolic engineering. In particular, methods that predict reaction knockout strategies leading to overproduction are often impractical in terms of gene manipulations. Recently, we proposed a method named logical transformation of model (LTM) to simplify the gene-reaction associations by introducing intermediate pseudo reactions, which makes it possible to generate genetic design. Here, we propose an alternative method to relieve researchers from deciphering complex gene-reactions by adding pseudo gene controlling reactions. In comparison to LTM, this new method introduces fewer pseudo reactions and generates a much smaller model system named as gModel. We showed that gModel allows two seldom reported applications: identification of minimal genomes and design of minimal cell factories within a modified OptKnock framework. In addition, gModel could be used to integrate expression data directly and improve the performance of the E-Fmin method for predicting fluxes. In conclusion, the model transformation procedure will facilitate genetic research based on GEMs, extending their applications.



Detecting Pairwise Interactive Effects of Continuous Random Variables for Biomarker Identification with Small Sample Size

12/07/2017 2:02 pm PST

Aberrant changes to interactions among cellular components have been conjectured to be potential causes of abnormalities in cellular functions. By systematic analysis of high-throughput-omics data, researchers hope to detect potential associations among measured variables for better biomarker identification and phenotype prediction. In this paper, we focus on the methods to measure pairwise interactive effects among continuous random variables, representing molecular expressions, with respect to a given categorical outcome. Together with a comprehensive review on the existing measures, we further propose new measures that better estimate interactive effects, especially in small sample size scenarios. We first evaluate the performance of the existing and new methods for both small and large sample sizes based on simulated datasets that shows our proposed methods outperform previous methods in general. The best performing method for small sample size scenarios suggested by simulation experiments is then implemented to estimate interactive effects among genes with respect to the metastasis outcome in two breast cancer studies based on micro-array gene expression datasets. Our results further demonstrate that integrating detected interactive effects together with individual effects can help in finding more accurate biomarkers for breast cancer metastasis, which are indeed involved in important pathways related to cancer metastasis based on gene set enrichment analysis.



Extending the Applicability of Graphlets to Directed Networks

12/07/2017 2:02 pm PST

With recent advances in high-throughput cell biology, the amount of cellular biological data has grown drastically. Such data is often modeled as graphs (also called networks) and studying them can lead to new insights into molecule-level organization. A possible way to understand their structure is by analyzing the smaller components that constitute them, namely network motifs and graphlets. Graphlets are particularly well suited to compare networks and to assess their level of similarity due to the rich topological information that they offer but are almost always used as small undirected graphs of up to five nodes, thus limiting their applicability in directed networks. However, a large set of interesting biological networks such as metabolic, cell signaling, or transcriptional regulatory networks are intrinsically directional, and using metrics that ignore edge direction may gravely hinder information extraction. Our main purpose in this work is to extend the applicability of graphlets to directed networks by considering their edge direction, thus providing a powerful basis for the analysis of directed biological networks. We tested our approach on two network sets, one composed of synthetic graphs and another of real directed biological networks, and verified that they were more accurately grouped using directed graphlets than undirected graphlets. It is also evident that directed graphlets offer substantially more topological information than simple graph metrics such as degree distribution or reciprocity. However, enumerating graphlets in large networks is a computationally demanding task. Our implementation addresses this concern by using a state-of-the-art data structure, the g-trie, which is able to greatly reduce the necessary computation. We compared our tool to other state-of-the art methods and verified that it is the fastest general tool for graphlet counting.



Pluribus—Exploring the Limits of Error Correction Using a Suffix Tree

12/07/2017 2:02 pm PST

Next generation sequencing technologies enable efficient and cost-effective genome sequencing. However, sequencing errors increase the complexity of the de novo assembly process, and reduce the quality of the assembled sequences. Many error correction techniques utilizing substring frequencies have been developed to mitigate this effect. In this paper, we present a novel and effective method called Pluribus, for correcting sequencing errors using a generalized suffix trie. Pluribus utilizes multiple manifestations of an error in the trie to accurately identify errors and suggest corrections. We show that Pluribus produces the least number of false positives across a diverse set of real sequencing datasets when compared to other methods. Furthermore, Pluribus can be used in conjunction with other contemporary error correction methods to achieve higher levels of accuracy than either tool alone. These increases in error correction accuracy are also realized in the quality of the contigs that are generated during assembly. We explore, in-depth, the behavior of Pluribus , to explain the observed improvement in accuracy and assembly performance. Pluribus is freely available at http://compbio.case.edu/pluribus/.



A Survey of Software and Hardware Approaches to Performing Read Alignment in Next Generation Sequencing

12/07/2017 2:02 pm PST

Computational genomics is an emerging field that is enabling us to reveal the origins of life and the genetic basis of diseases such as cancer. Next Generation Sequencing (NGS) technologies have unleashed a wealth of genomic information by producing immense amounts of raw data. Before any functional analysis can be applied to this data, read alignment is applied to find the genomic coordinates of the produced sequences. Alignment algorithms have evolved rapidly with the advancement in sequencing technology, striving to achieve biological accuracy at the expense of increasing space and time complexities. Hardware approaches have been proposed to accelerate the computational bottlenecks created by the alignment process. Although several hardware approaches have achieved remarkable speedups, most have overlooked important biological features, which have hampered their widespread adoption by the genomics community. In this paper, we provide a brief biological introduction to genomics and NGS. We discuss the most popular next generation read alignment tools and algorithms. Furthermore, we provide a comprehensive survey of the hardware implementations used to accelerate these algorithms.



Modeling and Identification of Amnioserosa Cell Mechanical Behavior by Using Mass-Spring Lattices

12/07/2017 2:01 pm PST

Various mechanical models of live amnioserosa cells during Drosophila melanogaster’s dorsal closure are proposed. Such models account for specific biomechanical oscillating behaviors and depend on a different set of parameters. The identification of the parameters for each of the proposed models is accomplished according to a least-squares approach in such a way to best fit the cellular dynamics extracted from live images. For the purpose of comparison, the resulting models after identification are validated to allow for the selection of the most appropriate description of such a cell dynamics. The proposed methodology is general and it may be applied to other planar biological processes.



Strategies for Comparing Metabolic Profiles: Implications for the Inference of Biochemical Mechanisms from Metabolomics Data

12/07/2017 2:01 pm PST

Background: Large amounts of metabolomics data have been accumulated in recent years and await analysis. Previously, we had developed a systems biology approach to infer biochemical mechanisms underlying metabolic alterations observed in cancers and other diseases. The method utilized the typical Euclidean distance for comparing metabolic profiles. Here, we ask whether any of the numerous alternative metrics might serve this purpose better. Methods and Findings: We used enzymatic alterations in purine metabolism that were measured in human renal cell carcinoma to test various metrics with the goal of identifying the best metrics for discerning metabolic profiles of healthy and diseased individuals. The results showed that several metrics have similarly good performance, but that some are unsuited for comparisons of metabolic profiles. Furthermore, the results suggest that relative changes in metabolite levels, which reduce bias toward large metabolite concentrations, are better suited for comparisons of metabolic profiles than absolute changes. Finally, we demonstrate that a sequential search for enzymatic alterations, ranked by importance, is not always valid. Conclusions: We identified metrics that are appropriate for comparisons of metabolic profiles. In addition, we constructed strategic guidelines for the algorithmic identification of biochemical mechanisms from metabolomics data.



Novel Methods for Microglia Segmentation, Feature Extraction, and Classification

12/07/2017 2:01 pm PST

Segmentation and analysis of histological images provides a valuable tool to gain insight into the biology and function of microglial cells in health and disease. Common image segmentation methods are not suitable for inhomogeneous histology image analysis and accurate classification of microglial activation states has remained a challenge. In this paper, we introduce an automated image analysis framework capable of efficiently segmenting microglial cells from histology images and analyzing their morphology. The framework makes use of variational methods and the fast-split Bregman algorithm for image denoising and segmentation, and of multifractal analysis for feature extraction to classify microglia by their activation states. Experiments show that the proposed framework is accurate and scalable to large datasets and provides a useful tool for the study of microglial biology.



Multi-Block Bipartite Graph for Integrative Genomic Analysis

12/07/2017 2:02 pm PST

Human diseases involve a sequence of complex interactions between multiple biological processes. In particular, multiple genomic data such as Single Nucleotide Polymorphism (SNP), Copy Number Variation (CNV), DNA Methylation (DM), and their interactions simultaneously play an important role in human diseases. However, despite the widely known complex multi-layer biological processes and increased availability of the heterogeneous genomic data, most research has considered only a single type of genomic data. Furthermore, recent integrative genomic studies for the multiple genomic data have also been facing difficulties due to the high-dimensionality and complexity, especially when considering their intra- and inter-block interactions. In this paper, we introduce a novel multi-block bipartite graph and its inference methods, MB2I and sMB2I, for the integrative genomic study. The proposed methods not only integrate multiple genomic data but also incorporate intra/inter-block interactions by using a multi-block bipartite graph. In addition, the methods can be used to predict quantitative traits (e.g., gene expression, survival time) from the multi-block genomic data. The performance was assessed by simulation experiments that implement practical situations. We also applied the method to the human brain data of psychiatric disorders. The experimental results were analyzed by maximum edge biclique and biclustering, and biological findings were discussed.



Soft Ngram Representation and Modeling for Protein Remote Homology Detection

12/07/2017 2:01 pm PST

Remote homology detection represents a central problem in bioinformatics, where the challenge is to detect functionally related proteins when their sequence similarity is low. Recent solutions employ representations derived from the sequence profile, obtained by replacing each amino acid of the sequence by the corresponding most probable amino acid in the profile. However, the information contained in the profile could be exploited more deeply, provided that there is a representation able to capture and properly model such crucial evolutionary information. In this paper, we propose a novel profile-based representation for sequences, called soft Ngram. This representation, which extends the traditional Ngram scheme (obtained by grouping N consecutive amino acids), permits considering all of the evolutionary information in the profile: this is achieved by extracting Ngrams from the whole profile, equipping them with a weight directly computed from the corresponding evolutionary frequencies. We illustrate two different approaches to model the proposed representation and to derive a feature vector, which can be effectively used for classification using a support vector machine (SVM). A thorough evaluation on three benchmarks demonstrates that the new approach outperforms other Ngram-based methods, and shows very promising results also in comparison with a broader spectrum of techniques.



Triangular Alignment (TAME): A Tensor-Based Approach for Higher-Order Network Alignment

12/07/2017 2:02 pm PST

Network alignment has extensive applications in comparative interactomics. Traditional approaches aim to simultaneously maximize the number of conserved edges and the underlying similarity of aligned entities. We propose a novel formulation of the network alignment problem that extends topological similarity to higher-order structures and provides a new objective function that maximizes the number of aligned substructures. This objective function corresponds to an integer programming problem, which is NP-hard. Consequently, we identify a closely related surrogate function whose maximization results in a tensor eigenvector problem. Based on this formulation, we present an algorithm called Triangular AlignMEnt (TAME), which attempts to maximize the number of aligned triangles across networks. Using a case study on the NAPAbench dataset, we show that triangular alignment is capable of producing mappings with high node correctness. We further evaluate our method by aligning yeast and human interactomes. Our results indicate that TAME outperforms the state-of-art alignment methods in terms of conserved triangles. In addition, we show that the number of conserved triangles is more significantly correlated, compared to the conserved edge, with node correctness and co-expression of edges. Our formulation and resulting algorithms can be easily extended to arbitrary motifs.



Batch Mode TD($\lambda$ ) for Controlling Partially Observable Gene Regulatory Networks

12/07/2017 2:01 pm PST

External control of gene regulatory networks (GRNs) has received much attention in recent years. The aim is to find a series of actions to apply to a gene regulation system making it avoid its diseased states. In this work, we propose a novel method for controlling partially observable GRNs combining batch mode reinforcement learning (Batch RL) and TD($\lambda$ ) algorithms. Unlike the existing studies inferring a computational model from gene expression data, and obtaining a control policy over the constructed model, our idea is to interpret the time series gene expression data as a sequence of observations that the system produced, and obtain an approximate stochastic policy directly from the gene expression data without estimation of the internal states of the partially observable environment. Thereby, we get rid of the most time consuming phases of the existing studies, inferring a model and running the model for the control. Results show that our method is able to provide control solutions for regulation systems of several thousands of genes only in seconds, whereas existing studies cannot solve control problems of even a few dozens of genes. Results also show that our approximate stochastic policies are almost as good as the policies generated by the existing studies.



Significance and Functional Similarity for Identification of Disease Genes

12/07/2017 2:01 pm PST

One of the most significant research issues in functional genomics is insilico identification of disease related genes. In this regard, the paper presents a new gene selection algorithm, termed as SiFS, for identification of disease genes. It integrates the information obtained from interaction network of proteins and gene expression profiles. The proposed SiFS algorithm culls out a subset of genes from microarray data as disease genes by maximizing both significance and functional similarity of the selected gene subset. Based on the gene expression profiles, the significance of a gene with respect to another gene is computed using mutual information. On the other hand, a new measure of similarity is introduced to compute the functional similarity between two genes. Information derived from the protein-protein interaction network forms the basis of the proposed SiFS algorithm. The performance of the proposed gene selection algorithm and new similarity measure, is compared with that of other related methods and similarity measures, using several cancer microarray data sets.



ML-Space: Hybrid Spatial Gillespie and Particle Simulation of Multi-Level Rule-Based Models in Cell Biology

12/07/2017 2:02 pm PST

Spatio-temporal dynamics of cellular processes can be simulated at different levels of detail, from (deterministic) partial differential equations via the spatial Stochastic Simulation algorithm to tracking Brownian trajectories of individual particles. We present a spatial simulation approach for multi-level rule-based models, which includes dynamically hierarchically nested cellular compartments and entities. Our approach ML-Space combines discrete compartmental dynamics, stochastic spatial approaches in discrete space, and particles moving in continuous space. The rule-based specification language of ML-Space supports concise and compact descriptions of models and to adapt the spatial resolution of models easily.



Effect of Aggregation Operators on Network-Based Disease Gene Prioritization: A Case Study on Blood Disorders

12/07/2017 2:01 pm PST

Owing to the innate noise in the biological data sources, a single source or a single measure do not suffice for an effective disease gene prioritization. So, the integration of multiple data sources or aggregation of multiple measures is the need of the hour. The aggregation operators combine multiple related data values to a single value such that the combined value has the effect of all the individual values. In this paper, an attempt has been made for applying the fuzzy aggregation on the network-based disease gene prioritization and investigate its effect under noise conditions. This study has been conducted for a set of 15 blood disorders by fusing four different network measures, computed from the protein interaction network, using a selected set of aggregation operators and ranking the genes on the basis of the aggregated value. The aggregation operator-based rankings have been compared with the “Random walk with restart” gene prioritization method. The impact of noise has also been investigated by adding varying proportions of noise to the seed set. The results reveal that for all the selected blood disorders, the Mean of Maximal operator has relatively outperformed the other aggregation operators for noisy as well as non-noisy data.



Collective Prediction of Disease-Associated miRNAs Based on Transduction Learning

12/07/2017 2:02 pm PST

The discovery of human disease-related miRNA is a challenging problem for complex disease biology research. For existing computational methods, it is difficult to achieve excellent performance with sparse known miRNA-disease association verified by biological experiment. Here, we develop CPTL, a Collective Prediction based on Transduction Learning, to systematically prioritize miRNAs related to disease. By combining disease similarity, miRNA similarity with known miRNA-disease association, we construct a miRNA-disease network for predicting miRNA-disease association. Then, CPTL calculates relevance score and updates the network structure iteratively, until a convergence criterion is reached. The relevance score of node including miRNA and disease is calculated by the use of transduction learning based on its neighbors. The network structure is updated using relevance score, which increases the weight of important links. To show the effectiveness of our method, we compared CPTL with existing methods based on HMDD datasets. Experimental results indicate that CPTL outperforms existing approaches in terms of AUC, precision, recall, and F1-score. Moreover, experiments performed with different number of iterations verify that CPTL has good convergence. Besides, it is analyzed that the varying of weighted parameters affect predicted results. Case study on breast cancer has further confirmed the identification ability of CPTL.



Protein Inference from the Integration of Tandem MS Data and Interactome Networks

12/07/2017 2:02 pm PST

Since proteins are digested into a mixture of peptides in the preprocessing step of tandem mass spectrometry (MS), it is difficult to determine which specific protein a shared peptide belongs to. In recent studies, besides tandem MS data and peptide identification information, some other information is exploited to infer proteins. Different from the methods which first use only tandem MS data to infer proteins and then use network information to refine them, this study proposes a protein inference method named TMSIN, which uses interactome networks directly. As two interacting proteins should co-exist, it is reasonable to assume that if one of the interacting proteins is confidently inferred in a sample, its interacting partners should have a high probability in the same sample, too. Therefore, we can use the neighborhood information of a protein in an interactome network to adjust the probability that the shared peptide belongs to the protein. In TMSIN, a multi-weighted graph is constructed by incorporating the bipartite graph with interactome network information, where the bipartite graph is built with the peptide identification information. Based on multi-weighted graphs, TMSIN adopts an iterative workflow to infer proteins. At each iterative step, the probability that a shared peptide belongs to a specific protein is calculated by using the Bayes’ law based on the neighbor protein support scores of each protein which are mapped by the shared peptides. We carried out experiments on yeast data and human data to evaluate the performance of TMSIN in terms of ROC, q-value, and accuracy. The experimental results show that AUC scores yielded by TMSIN are 0.742 and 0.874 in yeast dataset and human dataset, respectively, and TMSIN yields the maximum number of true positives when q-value less than or equal to 0.05. The overlap analysis shows that TMSIN is an effective complementary approach for protein inference.



Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs

12/07/2017 2:02 pm PST

Protein-DNA interactions are ubiquitous in a wide variety of biological processes. Correctly locating DNA-binding residues solely from protein sequences is an important but challenging task for protein function annotations and drug discovery, especially in the post-genomic era where large volumes of protein sequences have quickly accumulated. In this study, we report a new predictor, named TargetDNA, for targeting protein-DNA binding residues from primary sequences. TargetDNA uses a protein's evolutionary information and its predicted solvent accessibility as two base features and employs a centered linear kernel alignment algorithm to learn the weights for weightedly combining the two features. Based on the weightedly combined feature, multiple initial predictors with SVM as classifiers are trained by applying a random under-sampling technique to the original dataset, the purpose of which is to cope with the severe imbalance phenomenon that exists between the number of DNA-binding and non-binding residues. The final ensembled predictor is obtained by boosting the multiple initially trained predictors. Experimental simulation results demonstrate that the proposed TargetDNA achieves a high prediction performance and outperforms many existing sequence-based protein-DNA binding residue predictors. The TargetDNA web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/TargetDNA/ for academic use.



Learning Parameter-Advising Sets for Multiple Sequence Alignment

10/06/2017 2:00 pm PST

While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (such as the choice of gap penalties and substitution scores), most users rely on the single default parameter setting provided by the aligner. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice. The parameter advisor picks the parameter choice from the set whose resulting alignment has highest estimated accuracy. In this paper, we consider for the first time the problem of learning the optimal set of parameter choices for a parameter advisor that uses a given accuracy estimator. The optimal set is one that maximizes the expected true accuracy of the resulting parameter advisor, averaged over a collection of training data. While we prove that learning an optimal set for an advisor is NP-complete, we show there is a natural approximation algorithm for this problem, and prove a tight bound on its approximation ratio. Experiments with an implementation of this approximation algorithm on biological benchmarks, using various accuracy estimators from the literature, show it finds sets for advisors that are surprisingly close to optimal. Furthermore, the resulting parameter advisors are significantly more accurate in practice than simply aligning with a single default parameter choice.



Pathway Analysis with Signaling Hypergraphs

10/06/2017 2:01 pm PST

Signaling pathways play an important role in the cell’s response to its environment. Signaling pathways are often represented as directed graphs, which are not adequate for modeling reactions such as complex assembly and dissociation, combinatorial regulation, and protein activation/inactivation. More accurate representations such as directed hypergraphs remain underutilized. In this paper, we present an extension of a directed hypergraph that we call a signaling hypergraph. We formulate a problem that asks what proteins and interactions must be involved in order to stimulate a specific response downstream of a signaling pathway. We relate this problem to computing the shortest acyclic $B$ -hyperpath in a signaling hypergraph—an NP-hard problem—and present a mixed integer linear program to solve it. We demonstrate that the shortest hyperpaths computed in signaling hypergraphs are far more informative than shortest paths, Steiner trees, and subnetworks containing many short paths found in corresponding graph representations. Our results illustrate the potential of signaling hypergraphs as an improved representation of signaling pathways and motivate the development of novel hypergraph algorithms.



A Sparse Learning Framework for Joint Effect Analysis of Copy Number Variants

10/06/2017 2:01 pm PST

Copy number variants (CNVs), including large deletions and duplications, represent an unbalanced change of DNA segments. Abundant in human genomes, CNVs contribute to a large proportion of human genetic diversity, with impact on many human phenotypes. Although recent advances in genetic studies have shed light on the impact of individual CNVs on different traits, the analysis of joint effect of multiple interactive CNVs lags behind from many perspectives. A primary reason is that the large number of CNV combinations and interactions in the human genome make it computationally challenging to perform such joint analysis. To address this challenge, we developed a novel sparse learning framework that combines sparse learning with biological networks to identify interacting CNVs with joint effect on particular traits. We showed that our approach performs well in identifying CNVs with joint phenotypic effect using simulated data. Applied to a real human genomic dataset from the 1,000 Genomes Project, our approach identified multiple CNVs that collectively contribute to population differentiation. We found a set of multiple CNVs that have joint effect in different populations, and affect gene expression differently in distinct populations. These results provided a collection of CNVs that likely have downstream biomedical implications in individuals from diverse population backgrounds.



Improving Identification of Key Players in Aging via Network De-Noising and Core Inference

10/06/2017 2:01 pm PST

Current “ground truth” knowledge about human aging has been obtained by transferring aging-related knowledge from well-studied model species via sequence homology or by studying human gene expression data. Since proteins function by interacting with each other, analyzing protein-protein interaction (PPI) networks in the context of aging is promising. Unlike existing static network research of aging, since cellular functioning is dynamic, we recently integrated the static human PPI network with aging-related gene expression data to form dynamic, age-specific networks. Then, we predicted as key players in aging those proteins whose network topologies significantly changed with age. Since current networks are noisy , here, we use link prediction to de-noise the human network and predict improved key players in aging from the de-noised data. Indeed, de-noising gives more significant overlap between the predicted data and the “ground truth” aging-related data. Yet, we obtain novel predictions, which we validate in the literature. Also, we improve the predictions by an alternative strategy: removing “redundant” edges from the age-specific networks and using the resulting age-specific network “cores” to study aging. We produce new knowledge from dynamic networks encompassing multiple data types, via network de-noising or core inference, complementing the existing knowledge obtained from sequence or expression data.



Predicting nsSNPs that Disrupt Protein-Protein Interactions Using Docking

10/06/2017 2:01 pm PST

The human genome contains a large number of protein polymorphisms due to individual genome variation. How many of these polymorphisms lead to altered protein-protein interaction is unknown. We have developed a method to address this question. The intersection of the SKEMPI database (of affinity constants among interacting proteins) and CAPRI 4.0 docking benchmark was docked using HADDOCK, leading to a training set of 166 mutant pairs. A random forest classifier based on the differences in resulting docking scores between the 166 mutant pairs and their wild-types was used, to distinguish between variants that have either completely or partially lost binding ability. Fifty percent of non-binders were correctly predicted with a false discovery rate of only 2 percent. The model was tested on a set of 15 HIV-1 – human, as well as seven human- human glioblastoma-related, mutant protein pairs: 50 percent of combined non-binders were correctly predicted with a false discovery rate of 10 percent. The model was also used to identify 10 protein-protein interactions between human proteins and their HIV-1 partners that are likely to be abolished by rare non-synonymous single-nucleotide polymorphisms (nsSNPs). These nsSNPs may represent novel and potentially therapeutically-valuable targets for anti-viral therapy by disruption of viral binding.



An Annotation Agnostic Algorithm for Detecting Nascent RNA Transcripts in GRO-Seq

10/06/2017 2:01 pm PST

We present a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels which are affected by transcription, post-transcriptional processing, and RNA stability. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed. Here, we describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, hidden Markov models and logistic regression, to classify which regions of the genome are transcribed. Given a small user-defined training set, our algorithm is accurate, robust to varying read depth, annotation agnostic, and fast. Analysis of GRO-seq data without a priori need for annotation uncovers surprising new insights into several aspects of the transcription process.



Unconstrained Diameters for Deep Coalescence

10/06/2017 2:01 pm PST

The minimizing-deep-coalescence (MDC) approach infers a median (species) tree for a given set of gene trees under the deep coalescence cost. This cost accounts for the minimum number of deep coalescences needed to reconcile a gene tree with a species tree where the leaf-genes are mapped to the leaf-species through a function called leaf labeling. In order to better understand the MDC approach we investigate here the diameter of a gene tree, which is an important property of the deep coalescence cost. This diameter is the maximal deep coalescence costs for a given gene tree under all leaf labelings for each possible species tree topology. While we prove that this diameter is generally infinite, this result relies on the diameter’s unrealistic assumption that species trees can be of infinite size. Providing a more practical definition, we introduce a natural extension of the gene tree diameter that constrains the species tree size by a given constant. For this new diameter, we describe an exact formula, present a complete classification of the trees yielding this diameter, derive formulas for its mean and variance, and demonstrate its ability using comparative studies.



IsAProteinDB: An Indexed Database of Trypsinized Proteins for Fast Peptide Mass Fingerprinting

10/06/2017 2:01 pm PST

In peptite mass fingerprinting, an unknown protein is fragmented into smaller peptides whose masses are accurately measured; the obtained list of weights is then compared with a reference database to obtain a set of matching proteins. The exponential growth of known proteins discourage the use of brute force methods, where the weights’ list is compared with each protein in the reference collection; luckily, the scientific literature in the database field highlights that well designed searching algorithms, coupled with a proper data organization, allow to quickly solve the identification problem even on standard desktop computers. In this paper, IsAProteinsDB, an indexed database of trypsinized proteins, is presented. The corresponding search algorithm shows a time complexity that does not significantly depends on the size of the reference protein database.



An Eigen-Binding Site Based Method for the Analysis of Anti-EGFR Drug Resistance in Lung Cancer Treatment

10/06/2017 2:01 pm PST

We explore the drug resistance mechanism in non-small cell lung cancer treatment by characterizing the drug-binding site of a protein mutant based on local surface and energy features. These features are transformed to an eigen-binding site space and used for drug resistance level prediction and analysis.



An Algorithm for Motif-Based Network Design

10/06/2017 2:01 pm PST

A determinant property of the structure of a biological network is the distribution of local connectivity patterns, i.e., network motifs. In this work, a method for creating directed, unweighted networks while promoting a certain combination of motifs is presented. This motif-based network algorithm starts with an empty graph and randomly connects the nodes by advancing or discouraging the formation of chosen motifs. The in- or out-degree distribution of the generated networks can be explicitly chosen. The algorithm is shown to perform well in producing networks with high occurrences of the targeted motifs, both ones consisting of three nodes as well as ones consisting of four nodes. Moreover, the algorithm can also be tuned to bring about global network characteristics found in many natural networks, such as small-worldness and modularity.



Deep Conditional Random Field Approach to Transmembrane Topology Prediction and Application to GPCR Three-Dimensional Structure Modeling

10/06/2017 2:01 pm PST

Transmembrane proteins play important roles in cellular energy production, signal transmission, and metabolism. Many shallow machine learning methods have been applied to transmembrane topology prediction, but the performance was limited by the large size of membrane proteins and the complex biological evolution information behind the sequence. In this paper, we proposed a novel deep approach based on conditional random fields named as dCRF-TM for predicting the topology of transmembrane proteins. Conditional random fields take into account more complicated interrelation between residue labels in full-length sequence than HMM and SVM-based methods. Three widely-used datasets were employed in the benchmark. DCRF-TM had the accuracy 95 percent over helix location prediction and the accuracy 78 percent over helix number prediction. DCRF-TM demonstrated a more robust performance on large size proteins (>350 residues) against 11 state-of-the-art predictors. Further dCRF-TM was applied to ab initio modeling three-dimensional structures of seven-transmembrane receptors, also known as G protein-coupled receptors. The predictions on 24 solved G protein-coupled receptors and unsolved vasopressin V2 receptor illustrated that dCRF-TM helped abGPCR-I-TASSER to improve TM-score 34.3 percent rather than using the random transmembrane definition. Two out of five predicted models caught the experimental verified disulfide bonds in vasopressin V2 receptor.



hMuLab: A Biomedical Hybrid MUlti-LABel Classifier Based on Multiple Linear Regression

10/06/2017 2:01 pm PST

Many biomedical classification problems are multi-label by nature, e.g., a gene involved in a variety of functions and a patient with multiple diseases. The majority of existing classification algorithms assumes each sample with only one class label, and the multi-label classification problem remains to be a challenge for biomedical researchers. This study proposes a novel multi-label learning algorithm, hMuLab, by integrating both feature-based and neighbor-based similarity scores. The multiple linear regression modeling techniques make hMuLab capable of producing multiple label assignments for a query sample. The comparison results over six commonly-used multi-label performance measurements suggest that hMuLab performs accurately and stably for the biomedical datasets, and may serve as a complement to the existing literature.



Identifying Stages of Kidney Renal Cell Carcinoma by Combining Gene Expression and DNA Methylation Data

10/06/2017 2:01 pm PST

In this study, in order to take advantage of complementary information from different types of data for better disease status diagnosis, we combined gene expression with DNA methylation data and generated a fused network, based on which the stages of Kidney Renal Cell Carcinoma (KIRC) can be better identified. It is well recognized that a network is important for investigating the connectivity of disease groups. We exploited the potential of the network's features to identify the KIRC stage. We first constructed a patient network from each type of data. We then built a fused network based on network fusion method. Based on the link weights of patients, we used a generalized linear model to predict the group of KIRC subjects. Finally, the group prediction method was applied to test the power of network-based features. The performance (e.g., the accuracy of identifying cancer stages) when using the fused network from two types of data is shown to be superior to that when using two patient networks from only one data type. The work provides a good example for using network based features from multiple data types for a more comprehensive diagnosis.



Classification of Protein Structure Classes on Flexible Neutral Tree

10/06/2017 2:01 pm PST

Accurate classification on protein structural is playing an important role in Bioinformatics. An increase in evidence demonstrates that a variety of classification methods have been employed in such a field. In this research, the features of amino acids composition, secondary structure's feature, and correlation coefficient of amino acid dimers and amino acid triplets have been used. Flexible neutral tree (FNT), a particular tree structure neutral network, has been employed as the classification model in the protein structures’ classification framework. Considering different feature groups owing diverse roles in the model, impact factors of different groups have been put forward in this research. In order to evaluate different impact factors, Impact Factors Scaling (IFS) algorithm, which aim at reducing redundant information of the selected features in some degree, have been put forward. To examine the performance of such framework, the 640, 1189, and ASTRAL datasets are employed as the low-homology protein structure benchmark datasets. Experimental results demonstrate that the performance of the proposed method is better than the other methods in the low-homology protein tertiary structures.



Nonconvex Penalty Based Low-Rank Representation and Sparse Regression for eQTL Mapping

10/06/2017 2:01 pm PST

This paper addresses the problem of accounting for confounding factors and expression quantitative trait loci (eQTL) mapping in the study of SNP-gene associations. The existing convex penalty based algorithm has limited capacity to keep main information of matrix in the process of reducing matrix rank. We present an algorithm, which use nonconvex penalty based low-rank representation to account for confounding factors and make use of sparse regression for eQTL mapping (NCLRS). The efficiency of the presented algorithm is evaluated by comparing the results of 18 synthetic datasets given by NCLRS and presented algorithm, respectively. The experimental results or biological dataset show that our approach is an effective tool to account for non-genetic effects than currently existing methods.



Cancer Subtype Discovery Based on Integrative Model of Multigenomic Data

10/06/2017 2:00 pm PST

One major goal of large-scale cancer omics study is to understand molecular mechanisms of cancer and find new biomedical targets. To deal with the high-dimensional multidimensional cancer omics data (DNA methylation, mRNA expression, etc.), which can be used to discover new insight on identifying cancer subtypes, clustering methods are usually used to find an effective low-dimensional subspace of the original data and then cluster cancer samples in the reduced subspace. However, due to data-type diversity and big data volume, few methods can integrate these data and map them into an effective low-dimensional subspace. In this paper, we develop a dimension-reduction and data-integration method for indentifying cancer subtypes, named Scluster. First, Scluster, respectively, projects the different original data into the principal subspaces by an adaptive sparse reduced-rank regression method. Then, a fused patient-by-patient network is obtained for these subgroups through a scaled exponential similarity kernel method. Finally, candidate cancer subtypes are identified using spectral clustering method. We demonstrate the efficiency of our Scluster method using three cancers by jointly analyzing mRNA expression, miRNA expression, and DNA methylation data. The evaluation results and analyses show that Scluster is effective for predicting survival and identifies novel cancer subtypes of large-scale multi-omics data.



Exploring Consensus RNA Substructural Patterns Using Subgraph Mining

10/06/2017 2:00 pm PST

Frequently recurring RNA ?> structural motifs play important roles in RNA folding process and interaction with other molecules. Traditional index-based and shape-based schemas are useful in modeling RNA secondary structures but ignore the structural discrepancy of individual RNA family member. Further, the in-depth analysis of underlying substructure pattern is insufficient due to varied and unnormalized substructure data. This prevents us from understanding RNAs functions and their inherent synergistic regulation networks. This article thus proposes a novel labeled graph-based algorithm RnaGraph to uncover frequently RNA substructure patterns. Attribute data and graph data are combined to characterize diverse substructures and their correlations, respectively. Further, a top-k graph pattern mining algorithm is developed to extract interesting substructure motifs by integrating frequency and similarity. The experimental results show that our methods assist in not only modelling complex RNA secondary structures but also identifying hidden but interesting RNA substructure patterns.



PSPEL: In Silico Prediction of Self-Interacting Proteins from Amino Acids Sequences Using Ensemble Learning

10/06/2017 2:01 pm PST

Self interacting proteins (SIPs) play an important role in various aspects of the structural and functional organization of the cell. Detecting SIPs is one of the most important issues in current molecular biology. Although a large number of SIPs data has been generated by experimental methods, wet laboratory approaches are both time-consuming and costly. In addition, they yield high false negative and positive rates. Thus, there is a great need for in silico methods to predict SIPs accurately and efficiently. In this study, a new sequence-based method is proposed to predict SIPs. The evolutionary information contained in Position-Specific Scoring Matrix (PSSM) is extracted from of protein with known sequence. Then, features are fed to an ensemble classifier to distinguish the self-interacting and non-self-interacting proteins. When performed on Saccharomyces cerevisiae and Human SIPs data sets, the proposed method can achieve high accuracies of 86.86 and 91.30 percent, respectively. Our method also shows a good performance when compared with the SVM classifier and previous methods. Consequently, the proposed method can be considered to be a novel promising tool to predict SIPs.



IPED2: Inheritance Path Based Pedigree Reconstruction Algorithm for Complicated Pedigrees

10/06/2017 2:01 pm PST

Reconstruction of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. The problem is known to be NP-hard even for datasets known to only contain siblings. Some recent methods have been developed to accurately and efficiently reconstruct pedigrees. These methods, however, still consider relatively simple pedigrees, for example, they are not able to handle half-sibling situations where a pair of individuals only share one parent. In this work, we propose an efficient method, IPED2, based on our previous work, which specifically targets reconstruction of complicated pedigrees that include half-siblings. We note that the presence of half-siblings makes the reconstruction problem significantly more challenging which is why previous methods exclude the possibility of half-siblings. We proposed a novel model as well as an efficient graph algorithm and experiments show that our algorithm achieves relatively accurate reconstruction. To our knowledge, this is the first method that is able to handle pedigree reconstruction from genotype data when half-sibling exists in any generation of the pedigree.









cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU

08/07/2017 2:05 pm PST

BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.



Omics Informatics: From Scattered Individual Software Tools to Integrated Workflow Management Systems

08/09/2017 2:02 pm PST

Omic data analyses pose great informatics challenges. As an emerging subfield of bioinformatics, omics informatics focuses on analyzing multi-omic data efficiently and effectively, and is gaining momentum. There are two underlying trends in the expansion of omics informatics landscape: the explosion of scattered individual omics informatics tools with each of which focuses on a specific task in both single- and multi- omic settings, and the fast-evolving integrated software platforms such as workflow management systems that can assemble multiple tools into pipelines and streamline integrative analysis for complicated tasks. In this survey, we give a holistic view of omics informatics, from scattered individual informatics tools to integrated workflow management systems. We not only outline the landscape and challenges of omics informatics, but also sample a number of widely used and cutting-edge algorithms in omics data analysis to give readers a fine-grained view. We survey various workflow management systems (WMSs), classify them into three levels of WMSs from simple software toolkits to integrated multi-omic analytical platforms, and point out the emerging needs for developing intelligent workflow management systems. We also discuss the challenges, strategies and some existing work in systematic evaluation of omics informatics tools. We conclude by providing future perspectives of emerging fields and new frontiers in omics informatics.



An IR-Based Approach Utilizing Query Expansion for Plagiarism Detection in MEDLINE

08/07/2017 2:06 pm PST

The identification of duplicated and plagiarized passages of text has become an increasingly active area of research. In this paper, we investigate methods for plagiarism detection that aim to identify potential sources of plagiarism from MEDLINE, particularly when the original text has been modified through the replacement of words or phrases. A scalable approach based on Information Retrieval is used to perform candidate document selection—the identification of a subset of potential source documents given a suspicious text—from MEDLINE. Query expansion is performed using the ULMS Metathesaurus to deal with situations in which original documents are obfuscated. Various approaches to Word Sense Disambiguation are investigated to deal with cases where there are multiple Concept Unique Identifiers (CUIs) for a given term. Results using the proposed IR-based approach outperform a state-of-the-art baseline based on Kullback-Leibler Distance.



Super-Thresholding: Supervised Thresholding of Protein Crystal Images

08/07/2017 2:05 pm PST

In general, a single thresholding technique is developed or enhanced to separate foreground objects from background for a domain of images. This idea may not generate satisfactory results for all images in a dataset, since different images may require different types of thresholding methods for proper binarization or segmentation. To overcome this limitation, in this study, we propose a novel approach called “super-thresholding” that utilizes a supervised classifier to decide an appropriate thresholding method for a specific image. This method provides a generic framework that allows selection of the best thresholding method among different thresholding techniques that are beneficial for the problem domain. A classifier model is built using features extracted priori from the original image only or posteriori by analyzing the outputs of thresholding methods and the original image. This model is applied to identify the thresholding method for new images of the domain. We performed our method on protein crystallization images, and then we compared our results with six thresholding techniques. Numerical results are provided using four different correctness measurements. Super-thresholding outperforms the best single thresholding method around 10 percent, and it gives the best performance for protein crystallization dataset in our experiments.



Prediction of Novel Drugs for Hepatocellular Carcinoma Based on Multi-Source Random Walk

08/07/2017 2:06 pm PST

Computational approaches for predicting drug-disease associations by integrating gene expression and biological network provide great insights to the complex relationships among drugs, targets, disease genes, and diseases at a system level. Hepatocellular carcinoma (HCC) is one of the most common malignant tumors with a high rate of morbidity and mortality. We provide an integrative framework to predict novel d rugs for HCC based on multi-source random walk (PD-MRW). Firstly, based on gene expression and protein interaction network, we construct a gene-gene weighted i nteraction network (GWIN). Then, based on multi-source random walk in GWIN, we build a drug-drug similarity network. Finally, based on the known drugs for HCC, we score all drugs in the drug-drug similarity network. The robustness of our predictions, their overlap with those reported in Comparative Toxicogenomics Database (CTD) and literatures, and their enriched KEGG pathway demonstrate our approach can effectively identify new drug indications. Specifically, regorafenib (Rank = 9 in top-20 list) is proven to be effective in Phase I and II clinical trials of HCC, and the Phase III trial is ongoing. And, it has 11 overlapping pathways with HCC with lower p-values. Focusing on a particular disease, we believe our approach is more accurate and possesses better scalability.



SuperMIC: Analyzing Large Biological Datasets in Bioinformatics with Maximal Information Coefficient

08/07/2017 2:06 pm PST

The maximal information coefficient (MIC) has been proposed to discover relationships and associations between pairs of variables. It poses significant challenges for bioinformatics scientists to accelerate the MIC calculation, especially in genome sequencing and biological annotations. In this paper, we explore a parallel approach which uses MapReduce framework to improve the computing efficiency and throughput of the MIC computation. The acceleration system includes biological data storage on HDFS, preprocessing algorithms, distributed memory cache mechanism, and the partition of MapReduce jobs. Based on the acceleration approach, we extend the traditional two-variable algorithm to multiple variables algorithm. The experimental results show that our parallel solution provides a linear speedup comparing with original algorithm without affecting the correctness and sensitivity.



Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources

08/07/2017 2:06 pm PST

Since the discovery of the regulatory function of microRNA (miRNA), increased attention has focused on identifying the relationship between miRNA and disease. It has been suggested that computational method is an efficient way to identify potential disease-related miRNAs for further confirmation using biological experiments. In this paper, we first highlighted three limitations commonly associated with previous computational methods. To resolve these limitations, we established disease similarity subnetwork and miRNA similarity subnetwork by integrating multiple data sources, where the disease similarity is composed of disease semantic similarity and disease functional similarity, and the miRNA similarity is calculated using the miRNA-target gene and miRNA-lncRNA (long non-coding RNA) associations. Then, a heterogeneous network was constructed by connecting the disease similarity subnetwork and the miRNA similarity subnetwork using the known miRNA-disease associations. We extended random walk with restart to predict miRNA-disease associations in the heterogeneous network. The leave-one-out cross-validation achieved an average area under the curve (AUC) of $0.8049$ across $341$ diseases and $476$ miRNAs. For five-fold cross-validation, our method achieved an AUC from $0.7970$ to $0.9249$ for $15$ human diseases. Case studies further demonstrated the feasibility of our method to discover potential miRNA-disease associations. An online service for prediction is freely available at http://ifmda.aliapp.com.[...]



Identifying Cell Populations in Flow Cytometry Data Using Phenotypic Signatures

08/07/2017 2:05 pm PST

Single-cell flow cytometry is a technology that measures the expression of several cellular markers simultaneously for a large number of cells. Identification of homogeneous cell populations, currently done by manual biaxial gating, is highly subjective and time consuming. To overcome the shortcomings of manual gating, automatic algorithms have been proposed. However, the performance of these methods highly depends on the shape of populations and the dimension of the data. In this paper, we have developed a time-efficient method that accurately identifies cellular populations. This is done based on a novel technique that estimates the initial number of clusters in high dimension and identifies the final clusters by merging clusters using their phenotypic signatures in low dimension. The proposed method is called SigClust. We have applied SigClust to four public datasets and compared it with five well known methods in the field. The results are promising and indicate higher performance and accuracy compared to similar approaches reported in literature.



ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution

08/07/2017 2:06 pm PST

The purpose of de novo assembly is to report more contiguous, complete, and less error prone contigs. Thanks to the advent of the next generation sequencing (NGS) technologies, the cost of producing high depth reads is reduced greatly. However, due to the disadvantages of NGS, de novo assembly has to face the difficulties brought by repeat regions, error rate, and low sequencing coverage in some regions. Although many de novo algorithms have been proposed to solve these problems, the de novo assembly still remains a challenge. In this article, we developed an iterative seed-extension algorithm for de novo assembly, called ISEA. To avoid the negative impact induced by error rate, ISEA utilizes reads overlap and paired-end information to correct error reads before assemblying. During extending seeds in a De Bruijn graph, ISEA uses an elaborately designed score function based on paired-end information and the distribution of insert size to solve the repeat region problem. By employing the distribution of insert size, the score function can also reduce the influence of error reads. In scaffolding, ISEA adopts a relaxed strategy to join contigs that were terminated for low coverage during the extension. The performance of ISEA was compared with six previous popular assemblers on four real datasets. The experimental results demonstrate that ISEA can effectively obtain longer and more accurate scaffolds.



Search for a Minimal Set of Parameters by Assessing the Total Optimization Potential for a Dynamic Model of a Biochemical Network

08/07/2017 2:06 pm PST

Selecting an efficient small set of adjustable parameters to improve metabolic features of an organism is important for a reduction of implementation costs and risks of unpredicted side effects. In practice, to avoid the analysis of a huge combinatorial space for the possible sets of adjustable parameters, experience-, and intuition-based subsets of parameters are often chosen, possibly leaving some interesting counter-intuitive combinations of parameters unrevealed. The combinatorial scan of possible adjustable parameter combinations at the model optimization level is possible; however, the number of analyzed combinations is still limited. The total optimization potential (TOP) approach is proposed to assess the full potential for increasing the value of the objective function by optimizing all possible adjustable parameters. This seemingly unpractical combination of adjustable parameters allows assessing the maximum attainable value of the objective function and stopping the combinatorial space scanning when the desired fraction of TOP is reached and any further increase in the number of adjustable parameters cannot bring any reasonable improvement. The relation between the number of adjustable parameters and the reachable fraction of TOP is a valuable guideline in choosing a rational solution for industrial implementation. The TOP approach is demonstrated on the basis of two case studies.



Finite-Time Stability Analysis of Reaction-Diffusion Genetic Regulatory Networks with Time-Varying Delays

08/07/2017 2:05 pm PST

This paper is concerned with the finite-time stability problem of the delayed genetic regulatory networks (GRNs) with reaction-diffusion terms under Dirichlet boundary conditions. By constructing a Lyapunov–Krasovskii functional including quad-slope integrations, we establish delay-dependent finite-time stability criteria by employing the Wirtinger-type integral inequality, Gronwall inequality, convex technique, and reciprocally convex technique. In addition, the obtained criteria are also reaction-diffusion-dependent. Finally, a numerical example is provided to illustrate the effectiveness of the theoretical results.



A New Feature Vector Based on Gene Ontology Terms for Protein-Protein Interaction Prediction

08/07/2017 2:05 pm PST

Protein-protein interaction (PPI) plays a key role in understanding cellular mechanisms in different organisms. Many supervised classifiers like Random Forest (RF) and Support Vector Machine (SVM) have been used for intra or inter-species interaction prediction. For improving the prediction performance, in this paper we propose a novel set of features to represent a protein pair using their annotated Gene Ontology (GO) terms, including their ancestors. In our approach, a protein pair is treated as a document (bag of words), where the terms annotating the two proteins represent the words. Feature value of each word is calculated using information content of the corresponding term multiplied by a coefficient, which represents the weight of that term inside a document (i.e., a protein pair). We have tested the performance of the classifier using the proposed feature on different well known data sets of different species like S. cerevisiae, H. Sapiens, E. Coli, and D. melanogaster. We compare it with the other GO based feature representation technique, and demonstrate its competitive performance.



Detection Copy Number Variants from NGS with Sparse and Smooth Constraints

08/07/2017 2:06 pm PST

It is known that copy number variations (CNVs) are associated with complex diseases and particular tumor types, thus reliable identification of CNVs is of great potential value. Recent advances in next generation sequencing (NGS) data analysis have helped manifest the richness of CNV information. However, the performances of these methods are not consistent. Reliably finding CNVs in NGS data in an efficient way remains a challenging topic, worthy of further investigation. Accordingly, we tackle the problem by formulating CNVs identification into a quadratic optimization problem involving two constraints. By imposing the constraints of sparsity and smoothness, the reconstructed read depth signal from NGS is anticipated to fit the CNVs patterns more accurately. An efficient numerical solution tailored from alternating direction minimization (ADM) framework is elaborated. We demonstrate the advantages of the proposed method, namely ADM-CNV, by comparing it with six popular CNV detection methods using synthetic, simulated, and empirical sequencing data. It is shown that the proposed approach can successfully reconstruct CNV patterns from raw data, and achieve superior or comparable performance in detection of the CNVs compared to the existing counterparts.



Impact of Synaptic Localization and Subunit Composition of Ionotropic Glutamate Receptors on Synaptic Function: Modeling and Simulation Studies

08/07/2017 2:06 pm PST

Ionotropic NMDA and AMPA glutamate receptors (iGluRs) play important roles in synaptic function under physiological and pathological conditions. iGluRs sub-synaptic localization and subunit composition are dynamically regulated by activity-dependent insertion and internalization. However, understanding the impact on synaptic transmission of changes in composition and localization of iGluRs is difficult to address experimentally. To address this question, we developed a detailed computational model of glutamatergic synapses, including spine and dendritic compartments, elementary models of subtypes of NMDA and AMPA receptors, glial glutamate transporters, intracellular calcium, and a calcium-dependent signaling cascade underlying the development of long-term potentiation (LTP). These synapses were distributed on a neuron model and numerical simulations were performed to assess the impact of changes in composition and localization (synaptic versus extrasynaptic) of iGluRs on synaptic transmission and plasticity following various patterns of presynaptic stimulation. In addition, the effects of various pharmacological compounds targeting NMDARs or AMPARs were determined. Our results showed that changes in NMDAR localization have a greater impact on synaptic plasticity than changes in AMPARs. Moreover, the results suggest that modulators of AMPA and NMDA receptors have differential effects on restoring synaptic plasticity under different experimental situations mimicking various human diseases.



A Novel Adaptive Penalized Logistic Regression for Uncovering Biomarker Associated with Anti-Cancer Drug Sensitivity

08/07/2017 2:06 pm PST

We propose a novel adaptive penalized logistic regression modeling strategy based on Wilcoxon rank sum test (WRST) to effectively uncover driver genes in classification. In order to incorporate significance of gene in classification, we first measure significance of each gene by gene ranking method based on WRST, and then the adaptive L $_{1}$ -type penalty is discriminately imposed on each gene depending on the measured importance degree of gene. The incorporating significance of genes into adaptive logistic regression enables us to impose a large amount of penalty on low ranking genes, and thus noise genes are easily deleted from the model and we can effectively identify driver genes. Monte Carlo experiments and real world example are conducted to investigate effectiveness of the proposed approach. In Sanger data analysis, we introduce a strategy to identify expression modules indicating gene regulatory mechanisms via the principal component analysis (PCA), and perform logistic regression modeling based on not a single gene but gene expression modules. We can see through Monte Carlo experiments and real world example that the proposed adaptive penalized logistic regression outperforms feature selection and classification compared with existing L $_{1}$ -type regularization. The discriminately imposed penalty based on WRST effectively performs crucial gene selection, and thus our method can improve classification accuracy without interruption of noise genes. Furthermore, it can be seen through Sanger da[...]



Pattern Classification of Instantaneous Cognitive Task-load Through GMM Clustering, Laplacian Eigenmap, and Ensemble SVMs

08/07/2017 2:06 pm PST

The identification of the temporal variations in human operator cognitive task-load (CTL) is crucial for preventing possible accidents in human-machine collaborative systems. Recent literature has shown that the change of discrete CTL level during human-machine system operations can be objectively recognized using neurophysiological data and supervised learning technique. The objective of this work is to design subject-specific multi-class CTL classifier to reveal the complex unknown relationship between the operator's task performance and neurophysiological features by combining target class labeling, physiological feature reduction and selection, and ensemble classification techniques. The psychophysiological data acquisition experiments were performed under multiple human-machine process control tasks. Four or five target classes of CTL were determined by using a Gaussian mixture model and three human performance variables. By using Laplacian eigenmap, a few salient EEG features were extracted, and heart rates were used as the input features of the CTL classifier. Then, multiple support vector machines were aggregated via majority voting to create an ensemble classifier for recognizing the CTL classes. Finally, the obtained CTL classification results were compared with those of several existing methods. The results showed that the proposed methods are capable of deriving a reasonable number of target classes and low-dimensional optimal EEG features for individual hum[...]



Derivative-Free Optimization of Rate Parameters of Capsid Assembly Models from Bulk in Vitro Data

08/07/2017 2:06 pm PST

The assembly of virus capsids proceeds by a complicated cascade of association and dissociation steps, the great majority of which cannot be directly experimentally observed. This has made capsid assembly a rich field for computational models, but there are substantial obstacles to model inference for such systems. Here, we describe progress on fitting kinetic rate constants defining capsid assembly models to experimental data, a difficult data-fitting problem because of the high computational cost of simulating assembly trajectories, the stochastic noise inherent to the models, and the limited and noisy data available for fitting. We evaluate the merits of data-fitting methods based on derivative-free optimization (DFO) relative to gradient-based methods used in prior work. We further explore the advantages of alternative data sources through simulation of a model of time-resolved mass spectrometry data, a technology for monitoring bulk capsid assembly that can be expected to provide much richer data than previously used static light scattering approaches. The results show that advances in both the data and the algorithms can improve model inference. More informative data sources lead to high-quality fits for all methods, but DFO methods show substantial advantages on less informative data sources that better represent current experimental practice.



A Generalized Lattice Based Probabilistic Approach for Metagenomic Clustering

08/07/2017 2:06 pm PST

Metagenomics involves the analysis of genomes of microorganisms sampled directly from their environment. Next Generation Sequencing allows a high-throughput sampling of small segments from genomes in the metagenome to generate reads. To study the properties and relationships of the microorganisms present, clustering can be performed based on the inherent composition of the sampled reads for unknown species. We propose a two-dimensional lattice based probabilistic model for clustering metagenomic datasets. The occurrence of a species in the metagenome is estimated using a lattice of probabilistic distributions over small sized genomic sequences. The two dimensions denote distributions for different sizes and groups of words, respectively. The lattice structure allows for additional support for a node from its neighbors when the probabilistic support for the species using the parameters of the current node is deemed insufficient. We also show convergence for our algorithm. We test our algorithm on simulated metagenomic data containing bacterial species and observe more than $85\text{percent}$ precision. We also evaluate our algorithm on an in vitro-simulated bacterial metagenome and on human patient data, and show a better clustering than other algorithms even for short reads and varied abundance. The software and datasets can be downloaded from https:// github.com/lattclus/lattice-metage .[...]



Brain Modulyzer: Interactive Visual Analysis of Functional Brain Connectivity

08/07/2017 2:05 pm PST

We present Brain Modulyzer, an interactive visual exploration tool for functional magnetic resonance imaging (fMRI) brain scans, aimed at analyzing the correlation between different brain regions when resting or when performing mental tasks. Brain Modulyzer combines multiple coordinated views—such as heat maps, node link diagrams, and anatomical views—using brushing and linking to provide an anatomical context for brain connectivity data. Integrating methods from graph theory and analysis, e.g., community detection and derived graph measures, makes it possible to explore the modular and hierarchical organization of functional brain networks. Providing immediate feedback by displaying analysis results instantaneously while changing parameters gives neuroscientists a powerful means to comprehend complex brain structure more effectively and efficiently and supports forming hypotheses that can then be validated via statistical analysis. To demonstrate the utility of our tool, we present two case studies—exploring progressive supranuclear palsy, as well as memory encoding and retrieval.



Circular Order Aggregation and Its Application to Cell-Cycle Genes Expressions

08/07/2017 2:05 pm PST

The aim of circular order aggregation is to find a circular order on a set of $n$ items using angular values from $p$ heterogeneous data sets. This problem is new in the literature and has been motivated by the biological question of finding the order among the peak expression of a group of cell cycle genes. In this paper, two very different approaches to solve the problem that use pairwise and triplewise information are proposed. Both approaches are analyzed and compared using theoretical developments and numerical studies, and applied to the cell cycle data that motivated the problem.