bgx logo

BGX - Flexible Bayesian clustering and partition models for gene expression data

Funded by the BBSRC Exploiting Genomics initiative, from May 2002 to February 2007.

BIG - Bayesian Integrative Genomics

A collaborative programme started in 2008, coordinated by Prof Sylvia Richardson



MMSEQ: fast, scalable isoform expression estimation using multi-mapping RNA-seq reads

(Manuscript under review)

  1. Software requirements:
  2. Data downloads:
  3. Basic instructions:
    1. Create Bowtie indexes using bowtie-build on the filtered cDNA FASTA file. This must only be done once per species. E.g. once you've downloaded the ready-made filtered file for Human, and assuming you have paired-end FASTQ files 1181_1_1.fastq and 1181_1_2.fastq:
      gunzip Homo_sapiens.GRCh37.56.cdna.ref.fa.gz
      bowtie-build -f Homo_sapiens.GRCh37.56.cdna.ref.fa Homo_sapiens.GRCh37.56.cdna.ref
      
    2. Run Bowtie on your fastq file(s) using the -a --best --strata -S flags. Set other options as necessary (e.g. -I and -X for the minimum and maximum insert size). E.g.
      bowtie -a --best --strata -S -p 8 Homo_sapiens.GRCh37.56.cdna.ref -1 1184_1_1.fastq -2 1184_1_2.fastq 1184_1.sam
      
    3. Generate transcript mappings using the sam2hits.rb script. Usage: sam2hits.rb sam_file cdna_file > hits_file†. E.g.:
      ./mmseq/sam2hits.rb 1184_1.sam Homo_sapiens.GRCh37.56.cdna.ref.fa > 1184_1.hits
      
    4. Run the mmseq programme. Usage: mmseq hits_file. E.g.:
      ./mmseq/src/mmseq 1184_1.hits
      
  4. Description of output:

Notes

* fastagrep.sh is similar to Unix grep except it works on multi-line sequence entries in a FASTA file. E.g.:

./fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.56.cdna.all.fa > Homo_sapiens.GRCh37.56.cdna.ref.fa

† It is possible to use reference transcript FASTA files with entries that are formatted differently than the cDNA files provided by Ensembl. To do this, you must run Bowtie with the --fullref flag. Then, tell sam2hits.rb how to access the transcript IDs and the gene IDs in the reference entries using the -m tg_regexp t_ind g_ind flag. tg_regexp specifies a regular expression that matches entire FASTA entries where pairs of brackets are used to capture transcript and gene IDs. t_ind and g_ind are indexes for which pair of brackets capture the transcript and gene IDs respectively. E.g. for the default Ensembl files, entries look like this:

>ENST00000416067 cdna:known chromosome:GRCh37:21:9907194:9968585:-1 gene:ENSG00000227328

so you can tell sam2hits.rb how to extract the transcript and gene IDs like this:

sam2hits.rb -m ">(E\S+).*gene:(E\S+)" 1 2 1184_1.sam Homo_sapiens.GRCh37.56.cdna.ref.fa > 1184_1.hits

If you are unsure what regular expression to use, try out by trial and error using the testregexp.sh script. Usage: testregexp.sh -m tg_regexp t_ind g_ind cdna_file. If you run sam2hits.rb with the -m option, double-check the IDs in the header and main lines of the output hits file before feeding it to mmseq.