bgx logo

BGX - Flexible Bayesian clustering and partition models for gene expression data

Funded by the BBSRC Exploiting Genomics initiative, from May 2002 to February 2007.

BIG - Bayesian Integrative Genomics

A collaborative programme started in 2008, coordinated by Prof Sylvia Richardson



Multi-Mapping Bayesian Gene eXpression for Affymetrix whole-transcript arrays

MMBGX works on Gene array data at the gene level and Exon array data at the Ensembl transcript or gene level, including alternative isoforms. The software has been tested on Ubuntu and Mac OS X only but should work on any Unix-like system. It has not been tested on Windows (though it has been known to work). The software is computationally intensive so you should try to run it on a powerful computer with lots of CPU cores (at least 4 preferably). If you are analysing lots of arrays, you might want to run the standalone binary on a cluster to speed things up. The package uses the same version of Ensembl as X:Map (currently, version 56). The method is described in Turro et al. 2010. Note that unlike the description in the paper, MMBGX now runs separately on each array and condition-specific expression estimates are generated afterwards as described below.

Getting started

  1. Make sure you have installed R and the standard Bioconductor packages.
  2. Download and install the latest version of the MMBGX R package.
  3. Place your CEL files into condition-specific directories (suggested).
  4. Load the MMBGX package and analyse your CEL files in turn. Suppose you have three CEL files in each of two conditions and they are in directories cond1 and cond2 respectively.
    library(mmbgx)
    setwd("cond1")
    d <- ReadAffy()
    mmbgx(d, geneLevel=FALSE, outputDir="transcriptRuns")
    mmbgx(d, geneLevel=TRUE, outputDir="geneRuns")
    setwd("../cond2")
    d <- ReadAffy()
    mmbgx(d, geneLevel=FALSE, outputDir="transcriptRuns")
    mmbgx(d, geneLevel=TRUE, outputDir="geneRuns")
    setwd("..")
    
    This will analyse each Exon array CEL file in directory cond1 at the transcript and gene level. The output of the transcript-level analysis will go in directories named run.1, run.2, run.3, etc. inside cond1/transcriptRuns/. The output of the gene-level analysis will end up in cond1/geneRuns/ The same will happen in directory cond2. Type ?mmbgx for documentation. Note that each array may be analysed simultaneously. If you have many arrays and a computing cluster, we recommend using the standalone binary.
  5. When the runs are finished, combine the output directories within conditions:
    combineRuns.mmbgx(c(dir("cond1/transcriptRuns", full.names=TRUE), dir("cond2/transcriptRuns", full.names=TRUE)), sampleSets=c(3,3), outputDir="combinedTranscriptRun")
    combineRuns.mmbgx(c(dir("cond1/geneRuns", full.names=TRUE), dir("cond2/geneRuns", full.names=TRUE)), sampleSets=c(3,3), outputDir="combinedGeneRun")
    
    This will loess-normalise all the arrays and average the posterior distributions of the expression parameter "mu" within conditions.
  6. Finally, read in the combined directories for analysis:
    resG <- readSingle.mmbgx("combinedGeneRun")
    resT <- readSingle.mmbgx("combinedTranscriptRun")
    
    You may then access several data structures. E.g.:

Differential expression analysis

It is useful to plot a histogram of the probability of up-regulation of each transcript or gene (cf. Hein et al. 2006, Figure 4). The peaks near 0 and 1 in the example below represent differentially expressed genes/transcripts while the hump around 0.5 represents the non-differentially expressed set. If the spline fit (the black line) is inadequate, try setting a lower value for the df argument. Type ?analysis.mmbgx for documentation.

plotDEHistogram(resT)
dehist

The genes/transcripts may be ranked by differential expression.

> ra <- rankByDE(resT, absolute=TRUE)
> head(ra)
                                Position DiffExpression
ENST00000382159                    33819       424.4531
ENST00000381340+ENST00000451599    33387       412.2417
ENST00000261637                     3963       384.4219
ENST00000371079                    26935       373.2364
ENST00000334258                    15571       365.8463
ENST00000358117                    21671       365.2377
The DiffExpression statistic is described in Turro et al. 2007, the greater its value the greater the probability of differential expression. The Position column tells you the index of the transcript/gene.
> plotDEDensity(resT, "ENST00000382159")
diffdens
> tail(ra)
                Position DiffExpression
ENST00000466277    88301   0.0004721673
ENST00000360019    22569   0.0002682017
ENST00000474642    95186   0.0002639391
ENST00000461757    84570   0.0001762428
ENST00000395730    38539   0.0001606146
ENST00000328642    14258   0.0001176768

> plotDEDensity(resT, 88301)
nondiffdens

Differential splicing analysis

The main goal of differential splicing analysis is to identify transcripts whose expression change between conditions is significantly different from the overall expression change of its gene:

ds <- differentialSplicing(resG, resT)
ds is a data frame with four columns: Genes of interest have g_maxpg0 near 0 or 1 and a relatively high abs(g_maxdiff). This plot may be useful:
plot(ds$g_maxpg0, ds$g_maxdiff)
maxpg0maxdiff
You can obtain a list of candidate genes:
> head(ds[(ds$g_maxpg0 > 0.95 | ds$g_maxpg0 < 0.05) & abs(ds$g_maxdiff) > 2.5,])
40       66 1.00000000  2.976158  111067                 8
104     168 0.02929688 -2.646944   76121                 2
128     218 0.00000000 -3.558733   57454                 7
192     304 0.00000000 -2.710747   28648                14
280     449 1.00000000  2.710265  106233                 8
645    1003 0.00781250 -3.058512  109931                18

> resT$geneIDs[ds$g_index[2]]
[1] "ENSG00000000457"

>  resT$PSids[ds$t_index[2]]
[1] "ENST00000367772"
You can plot the posterior densities of expression parameter "mu" for the various transcripts under each condition:
plotTranscriptDensities(resT, ds$g_index[2])
transdens

Standalone binary

To run MMBGX as a standalone binary, you must first compile it. Then, from R, run the mmbgx function with inputDirs set to a character vector of length equal to the number of arrays in the AffyBatch object and standalone=TRUE (each element of inputDirs is a temporary directory of your choice, one per array). You can then run the standalone binary once per array, passing path/to/inputdir/infile.txt as a single argument each time. E.g.

From the command line (just do this once):


tar -xzf mmbgx_0.99.9.tar.gz
cd mmbgx/src/
make -f Makefile.standalone

If you encounter problems during make, you might need to tinker with the Makefile.standalone file.

Then in R, prepare the input directories:


library(mmbgx)
setwd("cond1")
d <- ReadAffy()
# create input directories transcriptInput.1, transcriptInput.2, ..., geneInput.1, geneInput.2...
mmbgx(d, geneLevel=FALSE, inputDirs=paste("transcriptInput", c(1:length(d)), sep="."), standalone=TRUE)
mmbgx(d, geneLevel=TRUE, inputDirs=paste("geneInput", c(1:length(d)), sep="."), standalone=TRUE)
setwd("../cond2")
d <- ReadAffy()
mmbgx(d, geneLevel=FALSE, inputDirs=paste("transcriptInput", c(1:length(d)), sep="."), standalone=TRUE)
mmbgx(d, geneLevel=TRUE, inputDirs=paste("geneInput", c(1:length(d)), sep="."), standalone=TRUE)

From the command line run MMBGX:


cd cond1; mkdir transcriptRuns; mkdir geneRuns
cd transcriptRuns
for i in `ls .. | grep transcriptInput.`; do 
  ../../mmbgx/src/mmbgx $i/infile.txt
done
cd ../geneRuns
for i in `ls .. | grep geneInput.`; do 
  ../../mmbgx/src/mmbgx $i/infile.txt
done
cd ../../cond2; mkdir transcriptRuns; mkdir geneRuns
# etc

Notes: