Original GeneEvolve

What is it?

GeneEvolve was originally written as an R script written to accurately simulate genetically informative data and evolutionary genetics. It has since been supplanted by a version written in C++ that does everything the original version does and much more including simulating realistic whole-genome molecular data. Find the page for the new version here. Below is information on the old version, which is no longer supported.


Downloads (original version):

Program: GeneEvolve73.zip
Manual: GeneEvolveManual
Requires: R statistical program (http://cran.r-project.org)
Optional: Mx statistical program (http://www.vcu.edui/mx)

You’ll need R on your system to run it, but beyond that, you do not have to have proficiency in R to use GeneEvolve.


Uses of GeneEvolve (original versino)

Why simulate genetically informative (e.g., pedigree or twin) data? With complicated models, it is difficult/impossible to find expected equilibrium parameter values analytically (e.g., the increase in epistatic variation due to assortative mating). Doing so through simulation, however, is straightforward: simulate the effects of genes, alleles, environments, etc. on individuals in a population, and allow this population to evolve (meet, mate, and have offspring, who meet, mate, and have offspring, etc…) for many generations, until parameters reach equilibrium.

Thus, GeneEvolve is a tool for predicting complex dynamics that arise in evolutionary genetics. By simulating such processes, creating datasets, and running them through statistical models, it also offers an independent check on a wide-range of human and animal statistical models

GeneEvolve is written in R and is open-source. This not only allows you to look into the black box and figure out how it works, it also allows you to modify the script. If you alter it, please send modifications of the script to me so that alternative versions can be found at one central location.

Modeling aid: GeneEvolve allows you to simulate any of the flavors of genetically informative designs—from the Classical Twin Design to Extended Twin Family Designs and everything in between—given user input parameters (the canonical A, D, C, and E parameters, plus many more). As of Oct. 2007, GeneEvolve also allows you to simulate A*age, A*sex, A*E, and A*C interaction effects as well as repeated-measures data but not yet other types of multivariate data. By simulating such data, the script can help:

  • Check model bias: Feed GeneEvolve values of parameters that are in your model, simulate the data, and check whether your model is actually recovering what you know is there. If it is not, this might indicate identification issues, biases, etc…
  • Check model sensitivity to assumptions: Simulate violations of assumptions (e.g., simulate a little bit of epistasis and run it through an ACE model) and note the effects on parameter estimates. This allows us to better understand how a model’s conclusions are affected by its assumptions.
  • Estimate power & sampling distributions: Run GeneEvolve multiple times given the same parameter values and create sampling distributions of whatever parameters you are interested in. This is a more general method of finding sampling distributions than alternative approaches such as bootstrapping because it is not dependent on the correctness of our assumptions (e.g., we can accurately characterize sampling distributions given violations of assumptions).

Predictor of population genetics & evolutionary genetics:

  • Find changes in variance parameters and relative covariances across time: GeneEvolve can help us predict these statistics that, otherwise, are analytically intractable in complex situations (see “GeneEvolveResults.pdf” graphs produced at the end).
  • Simulate the effects of various population genetics issues: If you have some proficiency in R, it is simple to have GeneEvolve calculate several additional statistics of interest in population genetics. For example, you could write a bit of code at the end of the script that graphs the allele frequencies and changes in genetic variance as a function of numbers of generations and population size. In the future, I may incorporate some of these types of things into the main script.


How to cite it:

GeneEvolve will be written up in a paper next Winter (2008), so if you can wait to cite it, do so. If you need to cite it now, I presented it at the 2007 BGA conference (note the name change):

Keller, M. C. (2007). PedEvolve: A simulator of genetically informative data implemented in R. Paper presented at the 2007 Annual Meeting of the Behavior Genetics Association, Amsterdam, NL.


Update History

GeneEvolve42: April 26, 2007. Corrected mistake in DZ & MZ relatedness that was caused by incorrect calculation of E in twins.

GeneEvolve47: May 27, 2007. Simulates repeated-measures data.

GeneEvolve65: Oct 14, 2007. Name changed. Redundant parts of scripts turned into functions. Added AxSex, AxS, and AxU interactions and A,S, A,U correlations.

GeneEvolve73: Aug 8, 2008. Runs faster. Allows binary phenotypes.


Known Problems

  • Outdated versions of R: GeneEvolve was written on R version 2.4.1 (2006). It is known to be incompatible with R version 2.1 (2005), and may not work on other earlier versions. If you have an outdated version of R and are having problems running the script, please install the newest version of R.
  • Making the population sizes and/or number of genes too large. R stores all objects created during a session in RAM rather than on the harddrive. This makes R fast but is problematic for scripts that take up a lot of memory. Thus, you need to be aware of your computer’s RAM – if you receive an error saying something like “cannot allocate vector of size 1003 bytes”, this is an indication that your system has run out of RAM. Simply make the population size smaller, specify fewer genes, and/or set the option save.objects to “min” or “none” (save.objects should be set to these values anyway, unless there is some reason you want to look at all the internal objects created during a session – e.g., it is useful for debugging). Note that newer versions of GeneEvolve run into fewer problems like this. A system with 1GB RAM should easily be able to handle population sizes ~50,000 and 10 genes or so.
  • Operating Platform Issues. This script is intended to run on UNIX, Windows, and Macs. However, problems related to the computer’s operating system may arise. One common (and easily avoided) issue stems from attempting to run the GeneEvolve and/or the Mx script on UNIX when the scripts have been saved in DOS format. To get around this, simply type, e.g., “dos2unix GeneEvolve65.R” in a UNIX command window.


[Matthew C Keller's Home Page] [Biosketch] [Vita] [Publications] [Keller Lab] [Program Code] [GeneEvolve] [FISHR] [Cascade] [Plot Indeterminacy] [R] [Courses] [Links]