Index of /scone
SCONE (Sequence CONservation Evaluation) reports position-specific measures of
conservation. There are five required inputs:
1) --maf <file>
An alignment file in MAF format. MAF is a standard, widely used format described
here: http://genome.ucsc.edu/FAQ/FAQformat#format5
2) --alias <file>
An alias file. This defines the set of species to be considered for scoring.
Each line in the file has two space-separated fields: the species name and
the alias - the string identifier specifying it in the MAF file. For example:
human hg17
This means whenever SCONE encounters the string 'hg17' in the MAF file, it
should consider it to be 'human' sequence. You may define multiple aliases
for each species in a single file, and lines may be easily commented out by
prepending them with a '#'. Note that SCONE will ONLY consider species
present in the alias file. This property may be used to restrict scoring to
a subset of the species in a MAF file (primates only, or invertebrates only).
3) --phylo <file>
A phylogenetic tree file. SCONE employs the Newick format popularized by Phylip:
http://evolution.genetics.washington.edu/phylip/newicktree.html
Species names are specified according to the first (species) field in the
alias file (e.g., in the tree file we would write 'human' rather than 'hg17'
using the above alias example). Time MUST be specified in
substitutions/site, NOT in years.
3) --matrix <file>
A substitution rate matrix file. This is a 64x64 matrix of the instantaneous
mutation rates between nucleotide triplets (e.g. the probability of mutation
ATG -> ATA). (Please consult the source code for details on how triplets
are ordered if you wish to construct your own matrix).
4 & 5) --ins <file> --del <file>
A vector of insertions and a vector of deletions. These are 5-dimensional
vectors describing the frequency of insertions of size 0,1,2,3, or larger
than 3 in unit time. A vector file consists of a line with the size of the
vector followed by one value per line.
Some recommended inputs:
--scale <num>
This defines the length of a unit time. For the given matrix file, this value
should be the length of the human-chimp branch in substitutions per site.
--reference <species>
Define the reference species in your alignment. This is the species whose
coordinates will be used to report conservation scores. May be considered
all-important, since by default this is 'human'.
--use-reference
By default SCONE excludes the reference species from conservation score
computations. Sometimes you may not want to include the reference species
sequence in conservation scoring; This option will include it in scoring
calculations.
--iterations <num>
If you are using an option that computes p-values, this option will specify
how many rounds of Monte Carlo simulation should be used to estimate the
distribution of scores for p-value computation. Higher is more accurate but
slower, O(n).
--gapless
Don't use the gap model. Might be useful for unreliable alignments, or if gap
distributions are not available
Some debugging options:
--debug=<level>
Not necessary, of course, but it may be indispensible in tracing strange
behaviors. From 0-5.
--select-position <pos>
Useful if you're only interested in one position in your alignment file. But
beware SCONE's overhead time! It may not be advisable.
--count <num>
Only compute <count> positions. Useful if you just want to collect a certain sample size of data.
Some other useful options:
--select-position <pos>
Runs SCONE only for reference position <pos> (no chromosome is specified;
SCONE assumes one reference chromosome per file, unfortunately).
--start-position <pos>
Skips outputting until position <pos> is reached.
--count <num>
Compute scores for exactly <num> positions. Combined with --start-position,
this may be used to easily divide your large maf file into bite-sized chunks.
--bytemode <file>
Runs SCONE in bytewise output mode. This writes ONLY the position number and
p-value in C-style binary numbers (an int and a float, respectively). For
very large runs, this produces a more compact output than SCONE normally
does. A simple C script for unpacking this output is included in the
'dist/files' directory.
SCONE has three modes of operation:
--parsimony
Uses a simple parsimony-based model (counting the number of substitutions in the tree) to score
conservation. The output columns for each position read:
<chr>:<pos> [<base>] <substitutions in tree> <p-value> <length of tree>
--likelihood
Uses an ML estimate of the rate at which a site is evolving to score conservation. Output:
<chr>:<pos> [<base>] <rate of site evolution> <p-value> <length of tree>
--bayes
Like --likelihood, but corrected according to a prior distribution (by
default a uniform prior between 0 and 1, but any tabulated distribution may
be inputted using --prior <vector file> and --priori <interval size>). Output:
<chr>:<pos> [<base>] <rate of site evolution> <p-value> <length of tree>
We recommend using --bayes.
A note on overheard: SCONE pre-computes several matrices when it first starts.
This step takes a fair amount of computing time. After this step is completed,
the program runs much more efficiently. A great deal of time may therefore be
saved by minimizing the number of times SCONE is run. Fortunately, MAF files may
be easily concatenated to avoid this issue. Blocks may also be trimmed from
MAF files.
Questions or comments should be e-mailed to Saurabh Asthana (sasthana at fas . harvard. edu).