Bayesian Integrative Genomics

Phase I: Funded by the BBSRC Exploiting Genomics initiative, from May 2002 to February 2007.

Phase II: A collaborative programme started in 2008, coordinated by Prof Sylvia Richardson.

Previous Projects

Workshops

The Teams

Department of Epidemiology and Public Health, Imperial College School of Medicine

Statistics group, School of Mathematics, University of Bristol

Clinical Sciences Centre / Imperial College Microarray Centre

The research undertaken in this project has been very productive, leading to 7 publications in high quality journals and 2 publications currently under review. Software has been accepted by Bioconductor.

Estimating gene expression from probe-level data

This project forms the main part of Task M4, (Level of modelling) and was carried out as a close collaboration between Teams 1, 2 and 3. It led to a main publication

in which we present the BGX signal extraction model, a Bayesian hierarchical approach for the analysis of Affymetrix GeneChip data. The approach we take differs from other available approaches in two fundamental aspects. Firstly, we integrate all processing steps of the raw data in a common statistically coherent framework, allowing all components and thus associated errors to be considered simultaneously. Secondly, inference is based on the full posterior distribution of gene expression indices and derived quantities, such as fold changes or ranks, rather than on single point estimates. Measures of uncertainty on these quantities are thus available. The models developed represent the first building block for integrated Bayesian Analysis of Affymetrix GeneChip data: they take into account additive as well as multiplicative error, gene expression levels are estimated using perfect match and a fraction of mismatch probes and are modelled on the log scale. Background correction is incorporated by modelling true signal and cross-hybridization explicitly, and a need for further normalization is considerably reduced by allowing for array-specific distributions of nonspecific hybridization. When replicate arrays are available for a condition, posterior distributions of condition-specific gene expression indices are estimated directly, by a simultaneous consideration of replicate probe sets, avoiding averaging over estimates obtained from individual replicate arrays. The performance of the Bayesian model is compared to that of standard available point estimate methods on subsets of the well known GeneLogic and Affymetrix spike-in data. The Bayesian model is found to perform well and the integrated procedure appears to hold considerable promise for further development.

In this subsequent publication, the benefits of using the BGX signal extraction model were further evaluated; in particular a statistical procedure for performing differential expression analysis without replicates was presented. The procedure relies on the posterior distributions of expression that are are obtained in BGX regardless of the number of replicates available. We exploit these posterior distributions to create ranked gene lists that take into account the estimated expression difference as well as its associated uncertainty. We propose a new procedure to estimate the proportion of non-differentially expressed genes empirically, adapting an approach proposed by Efron, that allows an informed choice of cut-off for the ranked gene list. We assess the performance of the method on publicly available spike-in data sets, as well as in a proper biological setting. The method presented is thus found to be a powerful tool for analysing GeneChip expression studies with limited or no replicates.

Comparison of gene expression vectors under multiple conditions

This is Task M1 of the original proposal. The work was carried out by Team 1, in active collaboration with our French collaborator named in the grant proposal, Dr P. Broët. The main developments are summarized in three papers (two published and one submitted) addressing different aspects of the analysis of differential expression and multiclass experiments.

presents a Bayesian hierarchical model for detecting differentially expressing genes that includes simultaneous estimation of array effects. This complements the work on the BGX signal extraction model as the hiearchical models developed here can take as their starting point any chosen measure of gene expression, for example derived from RMA or MASS5 processing. We first give empirical evidence that expression-level dependent array effects are needed, and explore different non-linear functions as part of our model-based approach to normalization. The model includes gene-specific variances but imposes some necessary shrinkage through a hierarchical structure. Model criticism via posterior predictive checks is discussed. Modelling the array effects (normalization) simultaneously with differential expression gives fewer false positive results. Secondly, we show how to use the model output for selecting lists of genes for further investigation. To choose a list of genes, we propose to combine various criteria (for instance, fold change and overall expression) into a single indicator variable for each gene. The posterior distribution of these variables is used to pick the list of genes, thereby taking into account uncertainty in parameter estimates. The method is illustrated on this analysis of mouse knockout expression data generated by Tim Aitman at Hammersmith (Team 3) as part of a series of experiences investigating the genetic origin of the metabolic syndrome. We found that Gene Ontology annotations over and under-represented amongst the genes on the derived lists are consistent with biological expectations, highlighting in particular the role of inflammation-related genes in adipocyte biology.

Following this work, we considered how to extend the hierarchical model described above to include a mixture prior for the differential expression parameter. These developments are under review for publication.

In this paper, we present a new, readily-interpretable mixture model for classifying genes into different classes of differential expression. The model is implemented in a fully Bayesian manner, including the estimation of the proportion of differentially expressed genes. This is in contrast to empirical Bayes approaches that input known values for this proportion. We show in a range of simulations that this proportion can be estimated well, and that it leads to good estimates of the false discovery rate. Another key contribution of the paper is to show how predictive model checks can be used to investigate which of several possible parametric families is most appropriate for modelling differential expression in any given data set. For this purpose, we have developed specific model checking tools suited to hierarchical mixture models.

We illustrate the usefulness of the model checks on a data set for which we show two interesting features: (a) we find that non-differentially expressed genes are not well modelled using a point mass at zero for the differential expression parameter and that a `nugget' null is more appropriate; (b) we find that model fit is affected by the pre-processing method used. Thus, model checks need to be performed in order to make meaningful classifications based on mixture models.

In parallel, we developed strategies for analysing more complex experimental designs used in microarray experiments: multiclass response (MCR) experiments. We used again the flexible framework of mixture of distributions, applied here to the marginal distribution of a summary statistics for MCR.

discusses Bayesian inference for multiclass response experiments, in which more than two classes are compared. In these experiments, though the null hypothesis is simple, there are typically many patterns of gene expression changes across the different classes that lead to complex alternatives. The paper proposes a new strategy for selecting genes in multiclass experiments that is based on a flexible mixture model for the marginal distribution of a modified F statistic. Using this model, false positive and negative discovery rates can be estimated and combined to produce a rule for selecting a subset of genes. Moreover, the method allows calculation of these rates for any predefined subset of genes. The performance of the proposed approach is illustrated using simulated datasets and a real breast cancer microarray dataset from Hedenfalk et al. (2001). In this latter study, we investigate predefined subset of genes and point out interesting differences between three distinct biological pathways.

Bayesian profile clustering in a decision theory framework

is a key part of Task M2, underpinning the more specifically genomics-focussed work in later sub-projects. It establishes a general formulation for Bayesian model-based clustering, in which subset labels are exchangeable, and items are also exchangeable, possibly up to covariate effects. The notational framework is rich enough to encompass a variety of existing procedures, including some recently discussed methods involving stochastic search or hierarchical clustering; more importantly it allows formulation of clustering procedures that are optimal with respect to a specified loss function. We propose loss functions based on pairwise coincidences, i.e., whether pairs of items are clustered into the same subset or not.

Optimisation of the posterior expected loss function is a binary integer programming problem, readily solved by standard software when clustering a modest number of items, but which quickly becomes impractical as scale increases. To combat this, a new heuristic item-swapping algorithm is introduced. This performs well in our numerical experiments, on both simulated and real data. The paper includes a comparison of the performance of this (approximate) optimal clustering with earlier methods that are model-based but ad hoc in detail.

we present a Bayesian mixture model allowing us to express a gene expression profile across different experimental conditions as a linear combination of condition-specific covariates, plus error. It commonly occurs that some genes are not influenced by the covariates, but fall into a `background' class. This calls for an extension to standard models in which the clusters are a priori exchangeable. We thus extend the model to allow for a background cluster that is not exchangeable with the others, and we also build regression on covariates characterising experimental conditions into the expectation structure.

We define a particular heterogeneous Dirichlet process as a mixture of a random point mass and a Dirichlet process. The location of the point mass has a partially degenerate distribution, so that regression coefficients can be fixed at zero for the background cluster. Standard posterior sampling methods for DP models can be extended to make use of this heterogeneous prior model. In the case of conjugacy, we generalise the partition Gibbs sampler to this situation. The background or `top-table' cluster can be identified in the posterior sample. We use a loss function approach following Lau and Green (2007) to derive a point estimate of the remaining clusters. Furthermore, we adapt the approach to study differential gene expression, where we calculate the posterior probabilities of differential expression and posterior expected false discovery rates. Simulation studies and real data analysis are presented.

Two-way (gene by condition) clustering

This sub-project forms Task M3, and was the first activity undertaken by Team 2.

Methodologies for finding biologically meaningful groupings or clusters within data, for example gene expression measurements, have a long history. New statistical procedures for clustering, specifically designed for gene expression data, have also been proposed, such as gene shaving. In

we concentrated instead on approaches based on explicit statistical modelling of the data, following the Bayesian paradigm, which can extract much richer information from the data. Examples of such approaches include those of MacKay and Miskin (2001), and the Plaid model of Lazzeroni and Owen (2000).

Taking a fully Bayesian approach, we fit a simplified version of MacKay and Miskin's model using reversible jump MCMC simulation, which allows the model to choose the number of clusters based on the data. It has features in common with the method for mixture analysis proposed by Richardson and Green (1997). The simplifications are based on reasonable assumptions about the underlying biological processes and lead to a model which is far easier both to fit and to interpret than the original. We discuss methods for ensuring that the sampler mixes well using a variety of different types of moves.

In the second half of the project, this research has been further developed by the second Bristol RA, Dr John Lau, with work including some improvements in computational efficiency and further examples. It is planned to write up the extended work as a three-author paper in the near future.