bgx logo

Bayesian Integrative Genomics

Phase I: Funded by the BBSRC Exploiting Genomics initiative, from May 2002 to February 2007.

Phase II: A collaborative programme started in 2008, coordinated by Prof Sylvia Richardson.



MMSEQ: haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads

pipeline

The flowchart to the right depicts the MMSEQ pipeline for obtaining expression estimates from RNA-seq data. There are two routes, with starting points labelled A and B. Route A is quite fast and straightforward to run and uses pre-existing transcript sequences for alignment. Route B requires more time, as it involves the creation of custom transcript sequences based on the data. The two routes are described below. Basic knowledge of how to work in a unix-based environment is required. The routeB.sh example bash script included in the MMSEQ package implements an exhaustive execution of the pipeline (route B) and may be straightforwardly adapted to suit your needs. Route A may be run through the R-based ArrayExpressHTS pipeline. Please cite Turro et al. 2011 (Genome Biology) if you use MMSEQ in your work.

Recent news:

ROUTE A: use a ready-made transcript FASTA file consisting of isoform or haplo-isoform sequences

  1. Software requirements:

    MMSEQ consists of two binaries (bam2hits and mmseq) and several Ruby, Perl and bash scripts. Make sure the directories containing the Bowtie, SAMtools and MMSEQ binaries and scripts are in your PATH, otherwise some of the commands in this manual will not work (see Note 8).

  2. Input files needed:
  3. Running the pipeline:
    1. Create Bowtie indexes for your FASTA file using bowtie-build. It is advisable to use a lower-than-default value for --offrate (such as 2 or 3) as long as the resulting index fits in memory. E.g. if you're using the ready-made filtered file for Human, try:
      gunzip Homo_sapiens.GRCh37.64.ref_transcripts.fa.gz
      bowtie-build --offrate 3 Homo_sapiens.GRCh37.64.ref_transcripts.fa Homo_sapiens.GRCh37.64.ref_transcripts
      
    2. Generate BAM file of alignments using Bowtie. Mandatory flags are -a --best --strata -S. Suppress alignments for reads that map to a huge number of transcripts with the -m option (e.g. -m 100). Other flags/options may be needed depending on your data (e.g. -I and -X for the minimum and maximum insert size and/or --norc if you are using a forward-stranded protocol). If your reference FASTA file doesn't use the vertebrate Ensembl naming convention, then you must specify --fullref. If you are using barcoding and the read names of your second mates end with /3, you need to change your FASTQ files so that they end with /2 instead because bowtie only trims read names ending with /1 or /2 (sed -e "s/\(@.*\)\/3$/\\1\/2/" secondreads.fq will do the trick). If you have multiple FASTQ files from the same library, feed them all together to Bowtie in one go (separate the FASTQ file names with commas). The output BAM file must be sorted by read name. With paired-end data, only pairs where both reads have been aligned are used so you can use the samtools 0xC filtering flag to reduce the size of your output. E.g. to align the paired files 3125_7_1.fastq and 3125_7_2.fastq (from the Montgomery et al. HapMap data):
      bowtie -a --best --strata -S -m 100 -X 400 -p 8 Homo_sapiens.GRCh37.64.ref_transcripts \
        -1 3125_7_1.fastq -2 3125_7_2.fastq | samtools view -F 0xC -bS - | samtools sort -n - 3125_7.namesorted
      
    3. Generate mappings from reads to transcript sets with bam2hits. Basic usage: bam2hits transcript_fasta namesorted_bam > hits_file (see Notes 3, 4, 5, 6 for advanced usage). E.g.:
      bam2hits Homo_sapiens.GRCh37.64.ref_transcripts.fa 3125_7.namesorted.bam > 3125_7.hits
      
    4. Run the mmseq program. Basic usage: mmseq hits_file output_base. E.g.:
      mmseq 3125_7.hits 3125_7
      
  4. Description of output: MMSEQ operates on a sample-by-sample basis, and thus the expression estimates are proportional to the RNA concentrations in each sample. Some scaling of the estimates may be required to make them comparable across biological replicates and conditions. The Monte Carlo standard errors (MCSEs) capture the uncertainty due to Poisson counting noise and due to the ambiguity in the mappings between reads and transcripts. The biological variance between samples can only be discerned with the use of biological replicates (see section on differential expression below).

ROUTE B: create new transcript FASTA files using your data

  1. Software requirements:

    MMSEQ consists of three binaries (bam2hits, mmseq and extract_transcripts) and several Ruby, R, Perl and bash scripts. Make sure the directories containing the Bowtie, TopHat, SAMtools and MMSEQ binaries and scripts are in your PATH, otherwise some of the commands in this manual will not work. Also, add the path to polyHap2.jar to the CLASSPATH, otherwise the Java commands in this manual will not work (see Note 8).

  2. Input files needed:
  3. Running the pipeline:
    1. Create Bowtie indexes for your genome FASTA file using bowtie-build. E.g. bowtie-build -f Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa Homo_sapiens.GRCh37.64.dna.toplevel.ref. Alternatively, download ready-made Bowtie indexes for Human (Ensembl v64 (GRCh37/hg19), gzipped).
    2. For each sample, align your reads to the genome using TopHat and the GFF file with options --no-novel-juncs --min-isoform-fraction 0.0 --min-anchor-length 3. The output directory should be specified with the -o option. This example is based on the Montgomery et al. HapMap dataset:
      tophat -G Homo_sapiens.GRCh37.64.ref.gff --no-novel-juncs --min-isoform-fraction 0.0 --min-anchor-length 3 \
             -r 192 -p 8 -o 3125_7.tophat Homo_sapiens.GRCh37.64.dna.toplevel.ref 3125_7_1.fastq 3125_7_2.fastq 
      
    3. Call SNPs using SAMtools mpileup and save as a filtered VCF file (see documentation). E.g. from the main directory:
      SBAMS=( `find . -wholename "*.tophat/accepted_hits.bam"` )
      samtools mpileup -ugf Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa ${SBAMS[@]} | bcftools view -bvcg - > var.raw.bcf
      bcftools view var.raw.bcf | vcfutils.pl varFilter > var.flt.vcf
      
    4. Import SNPs for use with polyHap2. You need to specify the path to the GFF annotation file used in the alignment, the filtered VCF file produced above and a temporary directory in which to save files. E.g.:
      java dataFormat.ImportSNPs Homo_sapiens.GRCh37.64.ref.gff var.flt.vcf ph2temp 
      
      If you are working with a small number of human samples (< 10), it may be worth improving your phasing by combining the VCF file [coming soon] obtained from the 60 CEU samples in the Montgomery et al. HapMap dataset with your own VCF file, instead of using only your own data (see merge-vcf). However, if you intend to do arbitrary phasing just to reduce allelic mapping biases, then there is no need for this (see below).
    5. To phase the genotypes, run the phase.sh script in the temporary directory. E.g.:
      ./ph2temp/phase.sh
      Note that commands listed in phase.sh may safely be run simultaneously if you have access to many CPUs.
      Alternatively, if you are only interested in reducing allelic mapping biases and not in haplo-isoform expression, run the phase.sh script with the -r option to assign the alleles to the two haplotypes without phasing (REF alleles are assigned to the _A haplotype and ALT alleles to the _B haplotype). This option may be used even when the number of samples is as low as 1.
    6. Create the ATCG-based phased haplotypes for each transcript, saving the files in a polyHap2 output directory. You need to specify the path to the GFF annotation file used in the alignment, the path to the temporary directory created above and an output directory. E.g.:
      java dataFormat.ConvertABtoATGC Homo_sapiens.GRCh37.64.ref.gff ph2temp ph2output
      The files produced are a SNPinfo position file and sample-specific haplotype files containing the two haplotypes (suffixed _A and _B) of each transcript.
    7. Enter the polyHap2 output directory and construct custom transcriptome reference FASTA files, one for each sample, using the SNPinfo position file and the sample-specific haplotype files. This is done with the haploref.rb script. Usage: haploref.rb transcript_fasta gff_file pos_file hap_file > hapiso_fasta. E.g. for sample 3125_7:
      haploref.rb ../Homo_sapiens.GRCh37.64.ref_transcripts.fa ../Homo_sapiens.GRCh37.64.ref.gff SNPinfo 3125_7.hap > 3125_7.fa
      
      Now you have a separate haplo-isoform FASTA file for each sample. For each sample, follow the Route A pipeline using the new customised FASTA file (in the example above, 3125_7.fa) as the reference. If you just did arbitrary phasing to reduce allelic mapping biases, bear in mind Note 6.

Differential expression analysis with edgeR or DESeq

The MMSEQ expression estimates are in FPKM units (fragments per kilobase of transcript per million mapped reads or read pairs), which makes different samples broadly comparable. You may read in the output from multiple samples using the readmmseq.R R script included in the MMSEQ package. The readmmseq function takes four arguments:

  1. mmseq_files: vector of *.mmseq output files to read in
  2. sample_names: vector of strings containing the sample names
  3. normalize: logical scalar specifying whether to normalise the values for each sample by their median deviation from the mean (like DESeq)
  4. counts: logical scalar indicating whether the expression values should be converted to count equivalents
If counts=FALSE, this function returns a list with an element named log_mu_gibbs containing the expression estimates and an element named mcse containing the standard errors. If counts=TRUE, this function returns a single table with the count equivalents. To test for differential expression with edgeR or DESeq, the estimates first need to be converted to count equivalents. E.g., to test for DE between two groups of two samples, you might run the following code in R from the directory containing the mmseq output files:

source("/path/to/readmmseq.R")
library(edgeR)
library(DESeq) # version >=1.5

ms = readmmseq(counts=TRUE, normalize=TRUE)
tab = ms[apply(ms,1,function(x) any(x>0)),] # only keep features with counts

# edgeR 
d = DGEList(counts = tab, group = c("group1", "group1", "group2","group2"))
d = estimateCommonDisp(d)
de.edgeR = exactTest(d)

# DESeq
cds = newCountDataSet(round(tab), c("group1", "group1", "group2", "group2") )
sizeFactors(cds) = rep(1, dim(cds)[2])
cds = estimateDispersions(cds)
de.DESeq = nbinomTest(cds,"group1", "group2")

Notes

  1. fastagrep.sh is similar to unix grep except it works on multi-line sequence entries in a FASTA file. E.g., to remove alternative haplotype/supercontig entries from the Ensembl DNA, cDNA and ncRNA FASTA files and combine the filtered cDNA and ncRNA files:

    fastagrep.sh -v 'supercontig|GRCh37:Y:1:10000|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.dna.toplevel.fa > Homo_sapiens.GRCh37.64.dna.toplevel.ref.fa
    fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.cdna.all.fa > Homo_sapiens.GRCh37.64.cdna.ref.fa
    fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.64.ncrna.fa > Homo_sapiens.GRCh37.64.ncrna.ref.fa
    cp Homo_sapiens.GRCh37.64.cdna.ref.fa Homo_sapiens.GRCh37.64.ref_transcripts.fa
    cat Homo_sapiens.GRCh37.64.ncrna.ref.fa >> Homo_sapiens.GRCh37.64.ref_transcripts.fa

    (note how this gets rid of the pesky 'GRCh37:Y:1:10000' entry in Ensembl human DNA FASTA files, which only contains Ns and breaks samtools indexing)

  2. It is possible to use reference transcript FASTA files with entries that are formatted differently than the cDNA and ncRNA files provided by Ensembl. To do this, tell bam2hits how to access the transcript IDs and the gene IDs in the reference entries using the -m tg_regexp t_ind g_ind option. tg_regexp specifies a regular expression that matches FASTA entries (excluding the >) where pairs of brackets are used to capture transcript and gene IDs. t_ind and g_ind are indexes for which pair of brackets capture the transcript and gene IDs respectively. E.g. for the human Ensembl file, entries look like this:

    >ENST00000416067 cdna:known chromosome:GRCh37:21:9907194:9968585:-1 gene:ENSG00000227328

    so (assuming you ran bowtie with --fullref) you can tell bam2hits how to extract the transcript and gene IDs like this:

    bam2hits -m "(E\S+).*gene:(E\S+)" 1 2 Homo_sapiens.GRCh37.64.ref_transcripts.fa 3125_7.sam > 3125_7.hits

    If you are unsure what regular expression to use, try out by trial and error using the testregexp.rb script. Usage: testregexp.rb -m tg_regexp t_ind g_ind transcript_fasta. If you run bam2hits with the -m option, double-check the IDs in the header and main lines of the output hits file before feeding it to mmseq.

  3. In cases where multiple alignments exist for a read, you can choose to keep only those with an insert size that is within x bp from the expected insert size (unless all alignments are beyond the threshold, in which case they are all retained). This is done with the -i expected_isize deviation_thres option in bam2hits. expected_isize is the expected insert size and deviation_thres is the threshold (in bp) away from the expected insert size. E.g. bam2hits -i 200 50 ....

  4. You may correct the transcript lengths for non-uniformity effects with Poisson regression method of Li et al. by specifying the -c alt_lengths option when running bam2hits, where alt_lengths is the name of an output file in which to save the adjusted lengths for future use. This is quite a time-consuming task, so if you have multiple runs with similar non-uniformity biases, you may choose to use the adjusted lengths calculated from one run when processing a subsequent run. This can be achieved by specifying -u alt_lengths in bam2hits, where alt_lengths is the name of a previously produced file containing the adjusted lengths.

  5. If you are not interested in haplo-isoform estimation and have constructed a new transcript reference with arbitrary-phase haplo-isoform sequences solely to reduce allelic mapping biases, then the alignments to the "_A" and "_B" haplo-isoforms should be merged. This can be done by specifying the -b flag when running bam2hits.

  6. Once you have downloaded a GTF file from Ensembl, run the filterGTF.rb and ensembl_gtf_to_gff.pl scripts to remove GTF entries which are not in the transcript FASTA and convert to GFF. E.g. for human:

    filterGTF.rb Homo_sapiens.GRCh37.64.ref_transcripts.fa Homo_sapiens.GRCh37.64.gtf > Homo_sapiens.GRCh37.64.ref.gtf
    ensembl_gtf_to_gff.pl Homo_sapiens.GRCh37.64.ref.gtf > Homo_sapiens.GRCh37.64.ref.gff
    
  7. Examples of how you might set the PATH and CLASSPATH environment variables to use the programs in this manual:
    # rename bundled Mac OS X binaries
    unzip mmseq_0.9.18.zip && cd mmseq_0.9.17
    mv bam2hits-0.9.18-Darwin-x86_64 bam2hits
    mv mmseq-0.9.18-Darwin-x86_64 mmseq 
    
    # set paths for mmseq and samtools binaries and scripts
    export PATH=$PATH:/home/bob/mmseq_0.9.18:/home/bob/samtools:/home/bob/samtools/misc
    
    # set java path for polyHap2
    export CLASSPATH=$CLASSPATH:/home/bob/polyHap2_20110607/polyHap2.jar
    
  8. You may control the number of threads spawned by mmseq with the OMP_NUM_THREADS environment variable.