- September 2005, One Day Meeting on Data Fusion in Genomic
- July 2003 Workshop, Wye, Imperial: Statistical Analysis of Gene Expression Data

- Tim Aitman
- Laurence Game
- Helen Causton
- Mahendra Navarange

The research undertaken in this project has been very productive, leading to 7 publications in high quality journals and 2 publications currently under review. Software has been accepted by Bioconductor.

This project forms the main part of Task M4, (*Level of
modelling*) and was carried out as a close collaboration between
Teams 1, 2 and 3. It led to a main publication

- Hein, A.-M.K., Richardson, S., Causton, H.C., Ambler, G.K.,
Green, P.J. (2005) BGX: a fully Bayesian gene expression index for
Affymetrix GeneChip data.
*Biostatistics*,**6**, 349-373; doi:10.1093/biostatistics/kxi016.

in which we present the BGX signal extraction model, a Bayesian hierarchical approach for the analysis of Affymetrix GeneChip data. The approach we take differs from other available approaches in two fundamental aspects. Firstly, we integrate all processing steps of the raw data in a common statistically coherent framework, allowing all components and thus associated errors to be considered simultaneously. Secondly, inference is based on the full posterior distribution of gene expression indices and derived quantities, such as fold changes or ranks, rather than on single point estimates. Measures of uncertainty on these quantities are thus available. The models developed represent the first building block for integrated Bayesian Analysis of Affymetrix GeneChip data: they take into account additive as well as multiplicative error, gene expression levels are estimated using perfect match and a fraction of mismatch probes and are modelled on the log scale. Background correction is incorporated by modelling true signal and cross-hybridization explicitly, and a need for further normalization is considerably reduced by allowing for array-specific distributions of nonspecific hybridization. When replicate arrays are available for a condition, posterior distributions of condition-specific gene expression indices are estimated directly, by a simultaneous consideration of replicate probe sets, avoiding averaging over estimates obtained from individual replicate arrays. The performance of the Bayesian model is compared to that of standard available point estimate methods on subsets of the well known GeneLogic and Affymetrix spike-in data. The Bayesian model is found to perform well and the integrated procedure appears to hold considerable promise for further development.

- Hein, A.-M.K., Richardson, S. (2006). A powerful method for
detecting differentially expressed genes from GenChip arrays that
does not require replicates.
*BMC Bioinformatics***7**, 353.

In this subsequent publication, the benefits of using the BGX signal extraction model were further evaluated; in particular a statistical procedure for performing differential expression analysis without replicates was presented. The procedure relies on the posterior distributions of expression that are are obtained in BGX regardless of the number of replicates available. We exploit these posterior distributions to create ranked gene lists that take into account the estimated expression difference as well as its associated uncertainty. We propose a new procedure to estimate the proportion of non-differentially expressed genes empirically, adapting an approach proposed by Efron, that allows an informed choice of cut-off for the ranked gene list. We assess the performance of the method on publicly available spike-in data sets, as well as in a proper biological setting. The method presented is thus found to be a powerful tool for analysing GeneChip expression studies with limited or no replicates.

This is Task M1 of the original proposal. The work was carried out by Team 1, in active collaboration with our French collaborator named in the grant proposal, Dr P. Broët. The main developments are summarized in three papers (two published and one submitted) addressing different aspects of the analysis of differential expression and multiclass experiments.

- Lewin, A., Richardson, S., Marshall, C., Glazier, A. and
Aitman, T. (2006) Bayesian modelling of differential gene
expression.
*Biometrics*,**62**, 1-9.

presents a Bayesian hierarchical model for detecting differentially expressing genes that includes simultaneous estimation of array effects. This complements the work on the BGX signal extraction model as the hiearchical models developed here can take as their starting point any chosen measure of gene expression, for example derived from RMA or MASS5 processing. We first give empirical evidence that expression-level dependent array effects are needed, and explore different non-linear functions as part of our model-based approach to normalization. The model includes gene-specific variances but imposes some necessary shrinkage through a hierarchical structure. Model criticism via posterior predictive checks is discussed. Modelling the array effects (normalization) simultaneously with differential expression gives fewer false positive results. Secondly, we show how to use the model output for selecting lists of genes for further investigation. To choose a list of genes, we propose to combine various criteria (for instance, fold change and overall expression) into a single indicator variable for each gene. The posterior distribution of these variables is used to pick the list of genes, thereby taking into account uncertainty in parameter estimates. The method is illustrated on this analysis of mouse knockout expression data generated by Tim Aitman at Hammersmith (Team 3) as part of a series of experiences investigating the genetic origin of the metabolic syndrome. We found that Gene Ontology annotations over and under-represented amongst the genes on the derived lists are consistent with biological expectations, highlighting in particular the role of inflammation-related genes in adipocyte biology.

Following this work, we considered how to extend the hierarchical model described above to include a mixture prior for the differential expression parameter. These developments are under review for publication.

- Lewin, A., Bochkina, N. and Richardson, S. (2007) Fully
Bayesian mixture model for differential gene expression:
simulations and model checks,
*submitted to Biostatistics*, available at www.bgx.org.uk

In this paper, we present a new, readily-interpretable mixture model for classifying genes into different classes of differential expression. The model is implemented in a fully Bayesian manner, including the estimation of the proportion of differentially expressed genes. This is in contrast to empirical Bayes approaches that input known values for this proportion. We show in a range of simulations that this proportion can be estimated well, and that it leads to good estimates of the false discovery rate. Another key contribution of the paper is to show how predictive model checks can be used to investigate which of several possible parametric families is most appropriate for modelling differential expression in any given data set. For this purpose, we have developed specific model checking tools suited to hierarchical mixture models.

We illustrate the usefulness of the model checks on a data set for which we show two interesting features: (a) we find that non-differentially expressed genes are not well modelled using a point mass at zero for the differential expression parameter and that a `nugget' null is more appropriate; (b) we find that model fit is affected by the pre-processing method used. Thus, model checks need to be performed in order to make meaningful classifications based on mixture models.

In parallel, we developed strategies for analysing more complex experimental designs used in microarray experiments: multiclass response (MCR) experiments. We used again the flexible framework of mixture of distributions, applied here to the marginal distribution of a summary statistics for MCR.

- Broët, P., Lewin, A., Richardson, S. and Dalmasso, C.
(2004). A mixture model-based strategy for selecting sets of genes
in multiclass response microarray experiments.
*Bioinformatics*,**20**, 2562-71.

discusses Bayesian inference for
multiclass response experiments, in which more than two classes are
compared. In these experiments, though the null hypothesis is
simple, there are typically many patterns of gene expression changes
across the different classes that lead to complex alternatives. The
paper
proposes a new strategy for selecting genes in multiclass experiments that
is based on a flexible mixture model for the marginal distribution
of a modified F statistic. Using this model, false positive and
negative discovery rates can be estimated and combined to produce a
rule for selecting a subset of genes. Moreover, the method allows
calculation of these rates for any predefined subset of genes. The
performance of the proposed approach is illustrated using simulated
datasets and a real breast cancer microarray dataset from Hedenfalk
*et al.* (2001). In this latter study, we investigate
predefined subset of genes and point out interesting differences
between three distinct biological pathways.

The work in Task M2 described below was carried out by Team 2. The paper

- Lau, J. W. and Green P. J. (2007), Bayesian Model Based
Clustering Procedures.
*Journal of Computational and Graphical Statistics*,*in press*.

is a key part of Task M2, underpinning the more specifically genomics-focussed work in later sub-projects. It establishes a general formulation for Bayesian model-based clustering, in which subset labels are exchangeable, and items are also exchangeable, possibly up to covariate effects. The notational framework is rich enough to encompass a variety of existing procedures, including some recently discussed methods involving stochastic search or hierarchical clustering; more importantly it allows formulation of clustering procedures that are optimal with respect to a specified loss function. We propose loss functions based on pairwise coincidences, i.e., whether pairs of items are clustered into the same subset or not.

Optimisation of the posterior expected loss function is a binary integer programming problem, readily solved by standard software when clustering a modest number of items, but which quickly becomes impractical as scale increases. To combat this, a new heuristic item-swapping algorithm is introduced. This performs well in our numerical experiments, on both simulated and real data. The paper includes a comparison of the performance of this (approximate) optimal clustering with earlier methods that are model-based but ad hoc in detail.

As a second major part of Task M2, in

- Lau, J. W. and Green P. J. (2007), Bayesian clustering using
a heterogeneous Dirichlet process, with application to parametric
gene expression profiles,
*in final preparation*.

we present a Bayesian mixture model allowing us to express a gene expression profile across different experimental conditions as a linear combination of condition-specific covariates, plus error. It commonly occurs that some genes are not influenced by the covariates, but fall into a `background' class. This calls for an extension to standard models in which the clusters are a priori exchangeable. We thus extend the model to allow for a background cluster that is not exchangeable with the others, and we also build regression on covariates characterising experimental conditions into the expectation structure.

We define a particular heterogeneous Dirichlet process as a mixture of a random point mass and a Dirichlet process. The location of the point mass has a partially degenerate distribution, so that regression coefficients can be fixed at zero for the background cluster. Standard posterior sampling methods for DP models can be extended to make use of this heterogeneous prior model. In the case of conjugacy, we generalise the partition Gibbs sampler to this situation. The background or `top-table' cluster can be identified in the posterior sample. We use a loss function approach following Lau and Green (2007) to derive a point estimate of the remaining clusters. Furthermore, we adapt the approach to study differential gene expression, where we calculate the posterior probabilities of differential expression and posterior expected false discovery rates. Simulation studies and real data analysis are presented.

This sub-project forms Task M3, and was the first activity undertaken by Team 2.

Methodologies for finding biologically meaningful groupings or clusters within data, for example gene expression measurements, have a long history. New statistical procedures for clustering, specifically designed for gene expression data, have also been proposed, such as gene shaving. In

- Ambler, G. K. and Green P. J. (2007), Bayesian two-way
clustering with applications to gene expression microarray data,
*Technical report, School of Mathematics, University of Bristol*.

we concentrated instead on approaches based on explicit statistical modelling of the data, following the Bayesian paradigm, which can extract much richer information from the data. Examples of such approaches include those of MacKay and Miskin (2001), and the Plaid model of Lazzeroni and Owen (2000).

Taking a fully Bayesian approach, we fit a simplified version of MacKay and Miskin's model using reversible jump MCMC simulation, which allows the model to choose the number of clusters based on the data. It has features in common with the method for mixture analysis proposed by Richardson and Green (1997). The simplifications are based on reasonable assumptions about the underlying biological processes and lead to a model which is far easier both to fit and to interpret than the original. We discuss methods for ensuring that the sampler mixes well using a variety of different types of moves.

In the second half of the project, this research has been further developed by the second Bristol RA, Dr John Lau, with work including some improvements in computational efficiency and further examples. It is planned to write up the extended work as a three-author paper in the near future.