Index of /scone

[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory   -  
[DIR]dist/ 05-Feb-2008 21:02 -  
[   ]scone-0.5.1.tar.bz2 10-Dec-2007 13:53 103K 
[   ]scone-0.6.1.tar.bz2 02-Jun-2008 14:03 53K 

SCONE (Sequence CONservation Evaluation) reports position-specific measures of
conservation. There are five required inputs:
1) --maf <file>
   An alignment file in MAF format. MAF is a standard, widely used format described
   here: http://genome.ucsc.edu/FAQ/FAQformat#format5
2) --alias <file>
   An alias file. This defines the set of species to be considered for scoring.
   Each line in the file has two space-separated fields: the species name and
   the alias - the string identifier specifying it in the MAF file. For example:
   human hg17
   This means whenever SCONE encounters the string 'hg17' in the MAF file, it
   should consider it to be 'human' sequence. You may define multiple aliases
   for each species in a single file, and lines may be easily commented out by
   prepending them with a '#'.  Note that SCONE will ONLY consider species
   present in the alias file. This property may be used to restrict scoring to
   a subset of the species in a MAF file (primates only, or invertebrates only).
3) --phylo <file>
   A phylogenetic tree file. SCONE employs the Newick format popularized by Phylip:
   http://evolution.genetics.washington.edu/phylip/newicktree.html
   Species names are specified according to the first (species) field in the
   alias file (e.g., in the tree file we would write 'human' rather than 'hg17'
   using the above alias example). Time MUST be specified in
   substitutions/site, NOT in years.
3) --matrix <file>
   A substitution rate matrix file. This is a 64x64 matrix of the instantaneous
   mutation rates between nucleotide triplets (e.g. the probability of mutation
   ATG -> ATA). (Please consult the source code for details on how triplets
   are ordered if you wish to construct your own matrix).
4 & 5) --ins <file> --del <file>
   A vector of insertions and a vector of deletions. These are 5-dimensional
   vectors describing the frequency of insertions of size 0,1,2,3, or larger
   than 3 in unit time.  A vector file consists of a line with the size of the
   vector followed by one value per line.

Some recommended inputs:
--scale <num>      
  This defines the length of a unit time. For the given matrix file, this value
  should be the length of the human-chimp branch in substitutions per site.
--reference <species>
  Define the reference species in your alignment. This is the species whose
  coordinates will be used to report conservation scores. May be considered
  all-important, since by default this is 'human'.
--use-reference
  By default SCONE excludes the reference species from conservation score
  computations.  Sometimes you may not want to include the reference species
  sequence in conservation scoring; This option will include it in scoring
  calculations.
--iterations <num>
  If you are using an option that computes p-values, this option will specify
  how many rounds of Monte Carlo simulation should be used to estimate the
  distribution of scores for p-value computation.  Higher is more accurate but
  slower, O(n).
--gapless
  Don't use the gap model. Might be useful for unreliable alignments, or if gap
  distributions are not available


Some debugging options:
--debug=<level>
  Not necessary, of course, but it may be indispensible in tracing strange
  behaviors. From 0-5.
--select-position <pos>
  Useful if you're only interested in one position in your alignment file. But
  beware SCONE's overhead time! It may not be advisable.
--count <num>
  Only compute <count> positions. Useful if you just want to collect a certain sample size of data.

Some other useful options:
--select-position <pos>
  Runs SCONE only for reference position <pos> (no chromosome is specified;
  SCONE assumes one reference chromosome per file, unfortunately).
--start-position <pos>
  Skips outputting until position <pos> is reached.
--count <num>
  Compute scores for exactly <num> positions. Combined with --start-position,
  this may be used to easily divide your large maf file into bite-sized chunks.
--bytemode <file>
  Runs SCONE in bytewise output mode. This writes ONLY the position number and
  p-value in C-style binary numbers (an int and a float, respectively). For
  very large runs, this produces a more compact output than SCONE normally
  does. A simple C script for unpacking this output is included in the
  'dist/files' directory.

SCONE has three modes of operation:
--parsimony
  Uses a simple parsimony-based model (counting the number of substitutions in the tree) to score
  conservation. The output columns for each position read:
  <chr>:<pos> [<base>] <substitutions in tree> <p-value> <length of tree>
--likelihood
  Uses an ML estimate of the rate at which a site is evolving to score conservation. Output:
  <chr>:<pos> [<base>] <rate of site evolution> <p-value> <length of tree>
--bayes
  Like --likelihood, but corrected according to a prior distribution (by
  default a uniform prior between 0 and 1, but any tabulated distribution may
  be inputted using --prior <vector file> and --priori <interval size>). Output:
  <chr>:<pos> [<base>] <rate of site evolution> <p-value> <length of tree>

We recommend using --bayes.

A note on overheard: SCONE pre-computes several matrices when it first starts.
This step takes a fair amount of computing time. After this step is completed,
the program runs much more efficiently. A great deal of time may therefore be
saved by minimizing the number of times SCONE is run. Fortunately, MAF files may
be easily concatenated to avoid this issue. Blocks may also be trimmed from
MAF files.

Questions or comments should be e-mailed to Saurabh Asthana (sasthana at fas . harvard. edu).