INSTALLATION
The package distributed consists of:
main source file gmx.f
auxiliary routines in
fortran and C algama.f dgauss4.f pnorm.f dgamma.f xflush.f sd.c
makefile Makefile_gmx
example data files enz.dat galx.dat lnacid.dat (used in Richardson and Green)
AACRresiduals.dat (used in Broët et al)
this file readme.txt
All files are text files, so can be run under Unix or Windows (subject to correct format for line endings).
---------------------------------------------
RUNNING
Gmix is a Fortran program for Unix-like systems, implementing
the methodology for univariate normal mixture
analysis described in Richardson and Green
(J. R. Statist. Soc. B, 1997, 59, 731-792; see also the
correction in J. R. Statist. Soc. B, 1998, 60, 661).
These notes assume familiarity with this paper.
The model described in the above papers has been extended
so that one component has fixed mean and possible fixed
variance. Instead of one parameter k for the number of components,
there are now two parameters kleft and kright for the numbers of
components to the left and right respectively of the fixed mean
component. Either of these parameters can be zero.
The program reads a single main data file, and produces
multiple output text files summarising the posterior
distribution of the model, for subsequent analysis
and display (R or Splus are good choices for these
tasks). (Only very limited summary information
is computed by this program itself.)
The run of the program is controlled by options, switches
and parameter settings that have sensible default values,
and can be set by command line arguments in Unix shell
style.
For example, the run on the Breast Cancer dataset in Broët et al,
with run-length of 100000, burn-in of 100000, and minimal output
options could be replicated by:
gmix -range -empty -dfix1 -fixl -kil0 -kpois2 -nb100000 -n100000 AACRresiduals.dat
The principal options are as follows, listed in
several main groups. In this list, N denotes
a numerical value passed in as a parameter of the model
or sampler, other options are logical settings
that are by default off, turned on by specifying the
option.
1. Model options
-prior disregard the likelihood and the data (except possibly
in setting prior settings, etc.), thus computing the
prior instead of the posterior.
-fixl suppress dimension-changing moves, thus fixing the
-fixr number of mixture components (to the value set by the
-kil and -kir options below).
2. Parameter settings
-range use range-based hyperparameters (default: don't use them)
-kpoisN set kleft,kright poisson(N) (default Uniform, default for N=2)
-dN set delta (hyperparameter for weights) (default 1)
-spN set s (spacing parameter in prior for means) (default 1)
(see rejoinder to discussion of Richardson and Green, page 786)
-aN set alpha (hyperparameter for variances) (default 2)
-bN set beta (hyperparameter for variances) (default Gamma(g,h))
-gN set g (hyperparameter for beta) (default 0.2)
-hN set h (hyperparameter for beta) (default 10)
The upper limit kmax of the number of components
is hard-wired equal to 15 for both kleft and kright in the program - it can be changed by editing line 1 of gmx.f (where they are termed ncmaxleft,ncmaxright)
3. Monte Carlo sampler options
-nN number of sweeps (default 10000)
-nbN length of burn-in (default 0)
-kilN initial number of components (default 1) similar for right
-empty use empty-component birth-death moves
-seedN random number seed (default 0, meaning use clock time to
initialise). In the log file for the run, the seed
actually used is printed, and the run can be repeated
with the same random numbers by specifying that value
in this option on the subsequent run.
4. Output options
-pLETTERS additional output files, if LETTERS includes
d: .avn .den .dev
c: .pcl .scl
w: .wms.*
a: .z.*
(default: none of these)
See below for contents of these files.
-nsN sets nsamp: number of (equi-spaced) Monte Carlo samples
dumped in .out file (default 100)
-nspN sets nspace: spacing in sweeps between successive records
in trace files .ent, .bk, .dev, .wms*, .z* (default 20)
-debug turn on verbose output
Input and output file names
The input data should be in a file with extension .dat,
with the sample size n on the first line, and the data
values y_1, y_2, ..., y_n following in free format, with arbitrary
spaces and newlines.
All output files are given names of the
form N.ext, where N is a unique integer assigned by the program
to label successive runs (it chooses the smallest positive integer
such that N.log does not already exist), and .ext signifies the
type of output the file contains.
The extensions and their meanings are:
(a) normal output:
.log logs information about run, parameter settings,
acceptance rates, etc.
.kl posterior for kleft, Bayes factors, prior on kleft,
posterior for kposleft, empty-component statistics
and, optionally, average deviances, etc. (.kr similar for right)
.pe parameter estimates: posterior means for weights, means
and variances for each component, conditional on each
pair of values of kleft,kright
.out [*] sample of output states: k, kpos,
weight, mean, variance and number of observations
assigned, for each component
.bdlogl [*] changes to kleft or kposleft: number of sweep (file not
produced when -fix option set) .bdlogr same for right
.ent [nsp] k, kpos, entropy
(b) additionally, if beta or kappa are random:
.bk [nsp] beta, kappa, k, kpos
(c) optionally, if -debug specified
.db verbose records of all moves of the sampler
(d) optionally, as determined by -p... options (see above)
.avn for each (kleft,kright), number of visits to (kleft,kright), average numbers
of observations assigned to each component
.den posterior expected density, on a grid of 200 equi-spaced
y values (spanning the range of the data, expanded by
5% in each direction): output is grid, density, conditional
on k=1,2,...,6, and unconditionally
.dev [nsp] k, kpos, deviance, modified deviance
.pcl predictive classification: for each (kleft,kright), for each j=1,2,..,k-1,
for each y* on grid (see .den above), posterior probability
that z*=j
.scl within-sample classification: for each (kleft,kright), for each j=1,2,..,k-1,
for each i = 1,2,...,n, posterior probability that z(i)=j
.wms.K [nsp]in a separate file for each k, current values for weights,
means and variances of each component j=1,2,...,k (thus each
sample record is of k consecutive lines)
.z.K [nsp] in a separate file for each k, current values for allocations
z(i),i=1,...,n
key: kpos = number of non-empty components
[nsp] - trace files with output dumped every nspace sweeps (see
-nsp option above)
[*] - other trace files
other output files contain information aggregated over the run.
Gmix is free of charge for educational and non-commercial research purposes.