port install boostapt-get install libboost-devyum install boostport install gslapt-get install libgsl0-devyum install gslsrc/ subdirectory and run make. Edit the Makefile if you wish to use a different compiler (e.g. the excellent ICC). If the Boost and GSL libraries aren't found by the compiler, try setting the LIBRARY_PATH and CPLUS_INCLUDE_PATH environmental variables. If you use Mac OS X or GNU/Linux on the x86_64 architecture and are having trouble compiling, try the pre-compiled binaries in the bin/ subdirectory. If all fails, please contact the author.fastagrep.sh script included in the MMSEQ package*.bowtie-build on the filtered cDNA FASTA file. This must only be done once per species. E.g. once you've downloaded the ready-made filtered file for Human, and assuming you have paired-end FASTQ files 1181_1_1.fastq and 1181_1_2.fastq:
gunzip Homo_sapiens.GRCh37.56.cdna.ref.fa.gz
bowtie-build -f Homo_sapiens.GRCh37.56.cdna.ref.fa Homo_sapiens.GRCh37.56.cdna.ref
-a --best --strata -S flags. Set other options as necessary (e.g. -I and -X for the minimum and maximum insert size). E.g.
bowtie -a --best --strata -S -p 8 Homo_sapiens.GRCh37.56.cdna.ref -1 1184_1_1.fastq -2 1184_1_2.fastq 1184_1.sam
sam2hits.rb script. Usage: sam2hits.rb sam_file cdna_file > hits_file†. E.g.:
./mmseq/sam2hits.rb 1184_1.sam Homo_sapiens.GRCh37.56.cdna.ref.fa > 1184_1.hits
mmseq programme. Usage: mmseq hits_file. E.g.:
./mmseq/src/mmseq 1184_1.hits
hits.mmseq:: transcript-level EM estimates, Gibbs posterior mean estimates (use this as your expression measure), Monte Carlo Standard Errors and Integrated Autocorrelation Timeshits.amalgamated.mmseq: as above for sets of identical transcriptshits.gene.mmseq: as above for genes* fastagrep.sh is similar to Unix grep except it works on multi-line sequence entries in a FASTA file. E.g.:
./fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.56.cdna.all.fa > Homo_sapiens.GRCh37.56.cdna.ref.fa
† It is possible to use reference transcript FASTA files with entries that are formatted differently than the cDNA files provided by Ensembl. To do this, you must run Bowtie with the --fullref flag. Then, tell sam2hits.rb how to access the transcript IDs and the gene IDs in the reference entries using the -m tg_regexp t_ind g_ind flag. tg_regexp specifies a regular expression that matches entire FASTA entries where pairs of brackets are used to capture transcript and gene IDs. t_ind and g_ind are indexes for which pair of brackets capture the transcript and gene IDs respectively. E.g. for the default Ensembl files, entries look like this:
>ENST00000416067 cdna:known chromosome:GRCh37:21:9907194:9968585:-1 gene:ENSG00000227328
so you can tell sam2hits.rb how to extract the transcript and gene IDs like this:
sam2hits.rb -m ">(E\S+).*gene:(E\S+)" 1 2 1184_1.sam Homo_sapiens.GRCh37.56.cdna.ref.fa > 1184_1.hits
If you are unsure what regular expression to use, try out by trial and error using the testregexp.sh script. Usage: testregexp.sh -m tg_regexp t_ind g_ind cdna_file. If you run sam2hits.rb with the -m option, double-check the IDs in the header and main lines of the output hits file before feeding it to mmseq.