Subscribe: Sandrine Dudoit
http://works.bepress.com/sandrine_dudoit/recent.rss
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
based  cross validation  data  gene  multiple testing  multiple  parameter  statistics  test statistics  testing  validation 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Sandrine Dudoit

Sandrine Dudoit



Recent works by Sandrine Dudoit



Last Build Date: Mon, 01 Jan 2007 00:00:00 +0000

Copyright: Copyright (c) 2018 All rights reserved.
 



Prognosis of stage II colon cancer by non-neoplastic mucosa gene expression profiling

Mon, 01 Jan 2007 00:00:00 +0000

We have assessed the possibility to build a prognosis predictor (PP), based on non-neoplastic mucosa microarray gene expression measures, for stage II colon cancer patients. Non-neoplastic colonic mucosa mRNA samples from 24 patients (10 with a metachronous metastasis, 14 with no recurrence) were profiled using the Affymetrix HGU133A GeneChip. Patients were repeatedly and randomly divided into 1000 training sets (TSs) of size 16 and validation sets (VS) of size 8. For each TS/VS split, a 70-gene PP, identified on the TS by selecting the 70 most differentially expressed genes and applying diagonal linear discriminant analysis, was used to predict the prognoses of VS patients. Mean prognosis prediction performances of the 70-gene PP were 81.8% for accuracy, 73.0% for sensitivity and 87.1% for specificity. Informative genes suggested branching signal-transduction pathways with possible extensive networks between individual pathways. They also included genes coding for proteins involved in immune surveillance. In conclusion, our study suggests that one can build an accurate PP for stage II colon cancer patients, based on non-neoplastic mucosa microarray gene expression measures.



Test statistics null distributions in multiple testing: Simulation studies and applications to genomics

Tue, 01 Nov 2005 00:00:00 +0000

Multiple hypothesis testing problems arise frequently in biomedical and genomic research, for instance, when identifying differentially expressed and co-expressed genes in microarray experiments. We have developed generally applicable resamplingbased single-step and stepwise multiple testing procedures (MTP) for controlling a broad class of Type I error rates, defined as tail probabilities and expected values for arbitrary functions of the numbers of false positives and rejected null hypotheses. A key feature of the methodology is the general characterization and explicit construction of a test statistics null distribution (rather than data generating null distribution), which provides Type I error control in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses defined in terms of submodels, and test statistics. This article presents simulation studies comparing test statistics null distributions in two testing scenarios of great relevance to biomedical and genomic data analysis: tests for regression coefficients in linear models where covariates and error terms are allowed to be dependent and tests for correlation coefficients. The simulation studiesdemonstratethatthechoiceofnulldistributioncanhaveasubstantialimpact on the Type I error properties of a given multiple testing procedure. Procedures based on our proposed non-parametric bootstrap test statistics null distribution typically control the Type I error rate "on target" at the nominal level, while comparable procedures, based on parameter-specific bootstrap data generating null distributions, can be severely anti-conservative or conservative. The analysis of microRNA expression data from cancerous and non-cancerous tissues (Lu et al., 2005), using tests for logistic regression coefficients and correlation coefficients, illustrates the flexibility and power of our proposed methodology.



Optimization of the Architecture of Neural Networks Using a Deletion/Substitution/Addition Algorithm
Neural networks are a popular machine learning tool, particularly in applications such as the prediction of protein secondary structure. However, overfitting poses an obstacle to their effective use for this and other problems. Due to the large number of parameters in a typical neural network, one may obtain a network fit that perfectly predicts the learning data yet fails to generalize to other data sets. One way of reducing the size of the parameter space is to alter the network topology so that some edges are removed; however, it is often not immediately apparent which edges should be eliminated. We propose a data-adaptive method of selecting an optimal network architecture using the Deletion/Substitution/Addition algorithm introduced in Sinisi and van der Laan (2004) and Molinaro and van der Laan (2004). Results of this approach in the regression case are presented on two simulated data sets and the diabetes data of Efron et al. (2002).



Multiple Testing Procedures: R multtest Package and Applications to Genomics
The Bioconductor R package multtest implements widely applicable resampling-based single-step and stepwise multiple testing procedures (MTP) for controlling a broad class of Type I error rates, in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics. The current version of multtest provides MTPs for tests concerning means, differences in means, and regression parameters in linear and Cox proportional hazards models. Procedures are provided to control Type I error rates defined as tail probabilities for arbitrary functions of the numbers of false positives and rejected hypotheses. These error rates include tail probabilities for the number of false positives (generalized family-wise error rate, gFWER) and the proportion of false positives among the rejected hypotheses (TPPFP). Single-step and step-down common-cut-off (maxT) and common-quantile (minP) procedures, that take into account the joint distribution of the test statistics, are proposed to control the family-wise error rate (FWER), or chance of at least one Type I error. In addition, augmentation multiple testing procedures are provided to control the gFWER and TPPFP, based on any initial FWER-controlling procedure. The results of a multiple testing procedure can be summarized using rejection regions for the test statistics, confidence regions for the parameters of interest, or adjusted p-values. A key ingredient of our proposed MTPs is the test statistics null distribution (and estimator thereof) used to derive rejection regions and corresponding confidence regions and adjusted p-values. Both bootstrap and permutation estimators of the test statistics null distribution are available. The S4 class/method object-oriented programming approach was adopted to summarize the results of a MTP. The modular design of multtest allows interested users to readily extend the package's functionality. Typical testing scenarios are illustrated by applying various MTPs implemented in multtest to the Acute Lymphoblastic Leukemia (ALL) dataset of Chiaretti et al. (2004), with the aim of identifying genes whose expression measures are associated with (possibly censored) biological and clinical outcomes.



Loss-Based Estimation with Cross-Validation: Applications to Microarray Data Analysis and Motif Finding
Current statistical inference problems in genomic data analysis involve parameter estimation for high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables. Addressing these inference questions satisfactorily requires: (i) an intensive and thorough search of the parameter space to generate good candidate estimators, (ii) an approach for selecting an optimal estimator among these candidates, and (iii) a method for reliably assessing the performance of the resulting estimator. We propose a unified loss-based methodology for estimator construction, selection, and performance assessment with cross-validation. In this approach, the parameter of interest is defined as the risk minimizer for a suitable loss function and candidate estimators are generated using this (or possibly another) loss function. Cross-validation is applied to select an optimal estimator among the candidates and to assess the overall performance of the resulting estimator. This general estimation framework encompasses a number of problems which have traditionally been treated separately in the statistical literature, including multivariate outcome prediction and density estimation based on either uncensored or censored data. This article provides an overview of the methodology and describes its application to two problems in genomic data analysis: the prediction of biological and clinical outcomes (possibly censored) using microarray gene expression measures and the identification of regulatory motifs (i.e., transcription factor binding sites) in DNA sequences.



IBD Configuration Transition Matrices and Linkage Score Tests for Unilineal Relative Pairs
Properties of transition matrices between IBD configurations are derived for four general classes of unilineal relative pairs obtained from the grand-parent/ grand-child, half-sib, avuncular, and cousin relationships. In this setting, IBD configurations are defined as orbits of groups acting on a set of inheritance vectors. Properties of the transition matrix between IBD configurations at two linked loci are derived by relating its infinitesimal generator to the adjacency matrix of a quotient graph. The second largest eigenvalue of the infinitesimal generator and its multiplicity are key in determining the form of the transition matrix and of likelihood-based linkage tests such as score tests.



Asymptotic Optimality of Likelihood Based Cross-Validation
Likelihood-based cross-validation is a statistical tool for selecting a density estimate based on n i.i.d. observations from the true density among a collection of candidate density estimators. General examples are the selection of a model indexing a maximum likelihood estimator, and the selection of a bandwidth indexing a nonparametric (e.g. kernel) density estimator. In this article, we establish asymptotic optimality of a general class of likelihood based cross-validation procedures (as indexed by the type of sample splitting used, e.g. V-fold cross-validation), in the sense that the cross-validation selector performs asymptotically as well (w.r.t. to the Kullback-Leibler distance to the true density) as an optimal benchmark model selector which depends on the true density. Crucial conditions of our theorem are that the size of the validation sample converges to infinity, which excludes leave-one-out cross-validation, and that the candidate density estimates are bounded away from zero and infinity. We illustrate these asymptotic results and the practical performance of likelihood based cross-validation for the purpose of bandwidth selection with a simulation study.



Multiple Tests of Association with Biological Annotation Metadata
We propose a general and formal statistical framework for the multiple tests of associations between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest. The known fixed gene-annotation profiles, corresponding to the fixed features of the genome, may concern Gene Ontology (GO) annotation, pathway membership, regulation by particular transcription factors, nucleotide sequences, or protein sequences. The unknown gene-parameter profiles, corresponding to the variable features of the genome, may be, for example, regression coefficients relating genome-wide transcript levels or DNA copy numbers to possibly censored biological and clinical outcomes and covariates. A generic question of great interest in current genomic research, regarding the detection of associations between biological annotation metadata and genome-wide expression measures, may then be translated into the multiple tests of hypotheses concerning association measures between gene-annotation and gene-parameter profiles. A general and rigorous formulation of the statistical inference question allows us to apply the multiple testing methodology developed in Dudoit and van der Laan (2006) and related articles, to control a broad class of Type I error rates, in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics. Resampling-based single-step and stepwise multiple testing procedures, that take into account the joint distribution of the test statistics, are provided to control Type I error rates defined as tail probabilities for arbitrary functions of the numbers of false positives and rejected hypotheses. The proposed statistical and computational methods are illustrated using the acute lymphoblastic leukemia (ALL) microarray dataset of Chiaretti et al. (2004), with the aim of relating GO annotation to differential gene expression between B-cell ALL with the BCR/ABL fusion and cytogenetically normal NEG B-cell ALL. The sensitivity of the identified lists of GO terms to the choice of association parameter between GO annotation and differential gene expression demonstrates the importance of translating the biological question in terms of suitable gene-annotation profiles, gene-parameter profiles, and association measures. In particular, the results show the limitations of binary gene-parameter profiles of differential expression indicators, which are still the norm for combined GO annotation and microarray data analyses. Procedures based on such binary gene-parameter profiles tend to be conservative and lack robustness with respect to the estimator for the set of differentially expressed genes. WWW companion: www.stat.berkeley.edu/~sandrine/Docs/Papers/DFF06/DFF.html



Asymptotically Optimal Model Selection Method with Right Censored Outcomes
Over the last two decades, non-parametric and semi-parametric approaches that adapt well known techniques such as regression methods to the analysis of right censored data, e.g. right censored survival data, became popular in the statistics literature. However, the problem of choosing the best model (predictor) among a set of proposed models (predictors) in the right censored data setting have not gained much attention. In this paper, we develop a new cross-validation based model selection method to select among predictors of right censored outcomes such as survival times. The proposed method considers the risk of a given predictor based on the training sample as a parameter of the full data distribution in a right censored data model. Then, the doubly robust locally efficient estimation method or an ad hoc inverse probability of censoring weighting method as presented in Robins and Rotnitzky (1992) and van der Laan and Robins (2002) is used to estimate this conditional risk parameter based on the validation sample. We prove that, under general conditions, the proposed cross-validated selector is asymptotically equivalent with an oracle benchmark selector based on the true data generating distribution. The presented method covers model selection with right censored data in prediction (univariate and multivariate) and density/hazard estimation problems.



A General Framework for Statistical Performance Comparison of Evolutionary Computation Algorithms
This paper proposes a statistical methodology for comparing the performance of evolutionary computation algorithms. A two-fold sampling scheme for collecting performance data is introduced, and these data are analyzed using bootstrap-based multiple hypothesis testing procedures. The proposed method is sufficiently flexible to allow the researcher to choose how performance is measured, does not rely upon distributional assumptions, and can be extended to analyze many other randomized numeric optimization routines. As a result, this approach offers a convenient, flexible, and reliable technique for comparing algorithms in a wide variety of applications.