Methods to infer populations history

SoftwareClass of methodPurposeSpecificsIssues and warningsLinkReference
DsuiteABBA-BABAIdentifying past events of admixture between populationsFast, handles VCF format. Suited for low-depth sequencing (handles uncertainties on genotypes). Provides a set of summary statistics that are useful to investigate complex admixture eventsRequires an outgroup sequence. The methods cannot estimate the direction of gene flow.https://github.com/millanek/Dsuite(Malinsky et al., 2020)
RENT+Ancestral Recombination Graphs/coalescenceRetracing the whole process of recombination and coalescence along a genomeFaster than first version of ARGWeaver.Requires phased haplotypes. Specific input format. No built-in functions to extract information from genealogies.https://github.com/SajadMirzaei/RentPlus(Mirzaei and Wu, 2017)
TREEMIXClustering and characterizing admixtureAdmixture graph, infers most likely admixture events in a treeBased on allele frequencies and can be used for pooled data.Requires multiple runs to properly assess the likelihood of each modelhttps://bitbucket.org/nygcresearch/treemix/src(Pickrell and Pritchard, 2012)
G-PhoCSCoalescence/BayesianEstimating population divergence and migration parameters using a coalescent frameworkBayesian + MCMC, handles ancient samplesParameters scaled by mutation rate, no admixturehttp://compgen.cshl.edu/GPhoCS/(Gronau et al., 2011)
IMa3Coalescence/BayesianInferring parameters from an isolation with migration (IM) modelFully bayesian approach, can perform joint estimates of parameters in L-mode and test for nested models. Can estimate phylogenetic relationships and migration ratesIM model is the only one available. Discrete admixture cannot be tested. Can only use subsets of whole-genome resequencing data. Recent splits lead to overestimate migration rateshttps://github.com/jodyhey/IMa3(Hey and Nielsen, 2007)
ABLECoalescence/Composite LikelihoodModel comparison and parameters estimationUses both allele frequency spectrum and linkage disequilibrium within blocks of a pre-specified size.Relies on ms syntax. Determining the most informative size for blocks requires performing pilot runs.https://github.com/champost/ABLE(Beeravolu et al., 2016)
Stairway2Coalescence/Composite LikelihoodInferring change in Ne with timeUser-friendly. Fast. Suitable for pools or low-depth sequencing.Cannot handle migration or population splits.https://github.com/xiaoming-liu/stairway-plot-v2(Liu and Fu, 2020)
fastsimcoal2Coalescence/LikelihoodModel comparison and parameters estimationPerforms coalescent simulations, parameter estimation and model testing using a fast likelihood method. Can handle arbitrarily complex scenarios for any type of markerThe maximum-likelihood method only uses the allele frequency spectrum. Several runs (20-100) are needed to explore the likelihood space.http://cmpg.unibe.ch/software/fastsimcoal2/(Excoffier et al., 2013)
∂a∂iDiffusion approximation of the AFSModel comparison and parameters estimationRun time does not depend on the number of SNPs included, does not require coalescent simulations, handles arbitrarily complex scenarios. Fast estimation of confidence intervals around parameters estimates (Godambe method). Suitable for pools/low-depth sequencingrequires some knowledge of Python. Limited to 3 populations. Several runs (20-100) are needed to explore the likelihood space.https://bitbucket.org/gutenkunstlab/dadi(Gutenkunst et al., 2009)
momentsDiffusion approximation of the AFSModel comparison and parameters estimationBased on Python, syntax similar to ∂a∂i. Can handle selection. Can use VCF files as input.Requires some knowledge of Python. Limited to 5 populations. Several runs (20-100) are needed to explore the likelihood space.https://bitbucket.org/simongravel/moments/src/master/(Jouganous et al., 2017)
momi2Diffusion approximation of the AFSModel comparison and parameters estimationCan scale to ten populations. Can simulate and read data in the VCF format. Detailed tutorials availableDoes not handle continuous gene flowhttps://github.com/popgenmethods/momi2(Kamm et al., 2020)
KIMTreeDiffusion approximation/BayesianEstimating divergence time between populations and testing for topologies. Estimate divergence times and past effective sex-ratio along branches of a populations tree.Fast and user-friendly. R scripts to obtain plots are available. Suitable for pools/low-depth sequencing. The method is conditional on a prior topology provided by the user. It computes DIC for a given topology, allowing to test for the best one.Strong selection on the sex chromosome can produce male-biased sex-ratios. Times are given in diffusion time scale, and can be converted in demographic times using independent estimates of Ne.http://www1.montpellier.inra.fr/CBGP/software/kimtree/download.html(Clemente et al., 2018)
GADMAGenetic algorithmModel comparison and parameters estimationBased on moments and ∂a∂i. Automates the search for the best set of models explaining a given frequency spectrum.Limited to three populations at the moment.https://github.com/ctlab/GADMA(Noskova et al., 2020)
DoRISIdentity by Descent (IBD) tractTesting various demographic scenarioUses variation in IBD tracts length to test for various demographic models.IBD must be inferred first with, e.g., BEAGLE. Handles a limited set of demographic scenarios. Modification in the code is required for more complex scenarioshttps://github.com/pierpal/DoRIS(Palamara and Pe’er, 2013)
Unnamed.Identity by state (IBS) tractPredict observed patterns of Identity by state along a genome by fittingan appropriate, arbitrary complex demographic modelAllows bootstrapping and estimating confidence over parameter estimates with msSpecific input format (similar to MSMC or ARGWeaver)https://github.com/kelleyharris/Inferring-demography-from-IBS(Harris and Nielsen, 2013)
ASTRAL-2PhylogenyBuilds species trees using short non-recombining sequencesCoalescence-based. Suitable for short loci (e.g. RAD-seq and GBS)More reliable under high incomplete lineage sorting that SVDQuartets and NJst (Chou et al. 2015)https://github.com/smirarab/ASTRAL(Mirarab and Warnow, 2015)
BEAST2PhylogenyNetwork reconstruction and phylogenetic relationshipsUser friendly. Can be used to track changes in effective population sizes (Bayesian Skyline Plots). Possible to estimate divergence timesSlow for large datasets. Requires sequence data that can be produced by , e.g., Stacks for RAD-seq datahttp://beast2.org/(Drummond and Rambaut, 2007; Bouckaert et al., 2014)
IQ-Tree 2PhylogenyDivergence time estimation and phylogenetic relationshipsUser-friendly, can be run locally or on a webserver, very detailed tutorials. Fast and accurate.Still no tutorial for analyzing big data (last checked December 2020).http://www.iqtree.org/(Minh et al., 2020)
MCMCTree and MCMCTreeRPhylogenyDivergence time estimation and phylogenetic relationshipsIncluded in PAML. A R program is designed to help choosing relevant priors and interpreting results https://github.com/PuttickMacroevolution/MCMCtreeRBayesian, sensitive to priors. Requires a resolved phylogeny and an alignment. Slow for large datasets. Not suited for recent divergence and high gene flow.http://abacus.gene.ucl.ac.uk/software/paml.html(Yang, 2007; Puttick, 2019)
NJstPhylogenyBuilds species trees using short non-recombining sequencesAvailable in the R package phybase. Estimates populations/species tree from gene treesRequires splitting part of the genome into non-recombining "loci".https://github.com/bomeara/phybase/(Liu and Yu, 2010, 2011)
PHRAPLPhylogenyAdmixture graph, reticulated evolutionUses trees in the NEWICK format as an input to infer topology, migration rates, divergence times. Similar to ABC in spirit, using tree topology as a summary statistics.Cannot handle more than 16 taxa at a time, and requires subsetting larger datasetshttp://www.phrapl.org/(Jackson et al., 2017)
PhyMLPhylogenyPhylogenetic relationshipsMaximum Likelihood inference of phylogenetic relationships. An online version is availableShould be used on complex of species or divergent populations with little migration. Can be ran on genomic windows to detect introgression (with e.g. TWISST, Dsuite)http://www.atgc-montpellier.fr/phyml/binaries.php(Guindon et al., 2010)
RAxMLPhylogenyNetwork reconstruction and phylogenetic relationshipsMaximum Likelihood inference of phylogenetic relationshipsShould be used on complex of species or divergent populations with little migrationhttp://sco.h-its.org/exelixis/web/software/raxml/index.html(Stamatakis, 2014)
SNAPPPhylogenyPhylogenetic relationshipsHandles SNP dataRemains slow for medium to large datasets (>1,000SNPs)http://beast2.org/snapp/(Bryant et al., 2012)
SNPhyloPhylogenyNetwork reconstruction and phylogenetic relationshipsComplete pipeline from SNP filtering to tree reconstructionShould be used on complex of species or divergent populations with little migrationhttp://chibba.pgml.uga.edu/snphylo/(Lee et al., 2014)
SVDQuartetsPhylogenyPhylogenetic relationshipsEstimates populations/species tree from gene treesRemains slow for large datasets. Requires PAUP*.https://www.asc.ohio-state.edu/kubatko.2/software/SVDquartets/(Chifman and Kubatko, 2014)
SVDQuestPhylogenyPhylogenetic relationshipsEstimates populations/species tree from gene treesFaster than SVDQuartetshttps://github.com/pranjalv123/SVDquest(Vachaspati and Warnow, 2018)
*BEASTPhylogeny and species tree inferenceDivergence time estimation and phylogenetic relationshipsOutputs a species tree instead of concatenated gene tree. Allows for testing consistency between phylogenetic signals at different lociSlow for large datasets. Requires sequence data. Not suited for situations where gene flow/admixture is importanthttp://beast2.org/(Heled and Drummond, 2010)
SplitstreePhylogeny/NetworkNetwork reconstruction and phylogenetic relationshipsUser friendly interface, proposes a variety of methods for networks reconstructionMostly descriptivehttp://www.splitstree.org/(Huson and Bryant, 2006)
diCal2Sequentially Markovian coalescentTesting any arbitrary demographic scenarioWorks with smaller, more fragmented datasets than PSMC. Handles more complex demographic models than MSMC (including admixture).Requires phased whole genome data and a model to be definedhttps://sourceforge.net/projects/dical2/(Sheehan et al., 2013)
MSMC and MSMC-IMSequentially Markovian coalescentInferring change in Ne and migration rates with time between two populationsAllows to track population size changes in time without a priori. Allows estimating variation in cross-coalescence rate between two populationsLimited to the study of 8 diploid individuals from 2 populations at once. Requires whole genome phased data and masking regions with insufficient sequencing depthhttps://github.com/stschiff/msmc and https://github.com/wangke16/MSMC-IM(Schiffels and Durbin, 2014)
PSMCSequentially Markovian coalescentInferring change in effective population sizes (Ne) with time using a single diploid genomeAllows to track population size changes in time without a priori.Limited to one population and one diploid individual. Better used within MSMC. Requires phased whole genome data and masking regions with insufficient sequencing depthhttps://github.com/lh3/psmc(Li and Durbin, 2011)
SMC++Sequentially Markovian coalescentInferring change in Ne with time and splitting time between two populationsCan analyze hundreds of individuals at a time and does not require phasingMasking regions as in MSMC. The ancestral allele is assumed to be the reference allele by default. Assumes a clean split for populations divergence. Future versions should allow gene flow inference.https://github.com/popgenmethods/smcpp(Terhorst et al., 2016)
TWISSTTopology weightingChromosome painting, clustering and branching between populationsRetrieves the most likely coalescence pattern between several taxa along the genome. Can be seen as an extension of the ABBA/BABA testNeeds a priori grouping of individuals into taxa. Requires at least 4 taxa. Impractical for more than 6 taxa. Windows size must include enough SNPs to retrieve the correct topology but at the risk that regions with different histories are includedhttps://github.com/simonhmartin/twisst(Martin and Van Belleghem, 2016)
BAYPASS/BayenvVariance/covariance matrixBuilding a population covariance matrix across population allele frequencies, similar to TREEMIXCan handle pooled dataMatrices are mostly designed to provide a neutral model for assessing selection, but can be used to infer population structurehttp://www1.montpellier.inra.fr/CBGP/software/baypass/ ; https://bitbucket.org/tguenther/bayenv2_public/src(Günther and Coop, 2013; Gautier, 2015)
ETEToolkitPhylogeny and species tree inferencePhylogenetic relationshipsWell documented suite of python commands to perform phylogenetic analysesSpecies trees can be biased by important gene flow or admixture (general issue, not specific to ETEToolkit)http://etetoolkit.org/(Huerta-Cepas et al., 2016)

References

Beeravolu, C. R., Hickerson, M. J., Frantz, L. A. F., & Lohse, K. (2016). Approximate Likelihood Inference of Complex Population Histories and Recombination from Multiple Genomes. BioarXiv, 1–31. doi: 10.1101/077958

Bouckaert, R., Heled, J., Kühnert, D., Vaughan, T., Wu, C. H., Xie, D., … Drummond, A. J. (2014). BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLoS Computational Biology10(4), 1–6. doi: 10.1371/journal.pcbi.1003537

Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N. A., & Roychoudhury, A. (2012). Inferring species trees directly from biallelic genetic markers: Bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution29(8), 1917–1932. doi: 10.1093/molbev/mss086

Chifman, J., & Kubatko, L. (2014). Quartet inference from SNP data under the coalescent model. Bioinformatics30(23), 3317–3324. doi: 10.1093/bioinformatics/btu530

Clemente, F., Gautier, M., & Vitalis, R. (2018). Inferring sex-specific demographic history from SNP data. PLoS Genetics14(1), 1–32. doi: 10.1371/journal.pgen.1007191

Drummond, A. J., & Rambaut, A. (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology7, 214. doi: 10.1186/1471-2148-7-214

Excoffier, L., Dupanloup, I., Huerta-Sanchez, E., Sousa, V. C., & Foll, M. (2013). Robust Demographic Inference from Genomic and SNP Data. PLoS Genetics9(10). doi: 10.1371/journal.pgen.1003905

Gautier, M. (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics201(September), 1555–1579. doi: doi:0.1534/genetics.115.181453

Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G., & Siepel, A. (2011). Bayesian inference of ancient human demography from individual genome sequences. Nature Genetics43(10), 1031–1034. doi: 10.1038/ng.937

Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Systematic Biology59(3), 307–321. doi: 10.1093/sysbio/syq010

Günther, T., & Coop, G. (2013). Robust identification of local adaptation from allele frequencies. Genetics195(1), 205–220. doi: 10.1534/genetics.113.152462

Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H., & Bustamante, C. D. (2009). Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics5(10). doi: 10.1371/journal.pgen.1000695

Harris, K., & Nielsen, R. (2013). Inferring Demographic History from a Spectrum of Shared Haplotype Lengths. PLoS Genetics9(6). doi: 10.1371/journal.pgen.1003521

Heled, J., & Drummond, A. J. (2010). Bayesian Inference of Species Trees from Multilocus Data. Molecular Biology and Evolution27(3), 570–580. doi: 10.1093/molbev/msp274

Hey, J., & Nielsen, R. (2007). Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proceedings of the National Academy of Sciences of the United States of America104(8), 2785–2790. doi: 10.1073/pnas.0611164104

Huson, D. H., & Bryant, D. (2006). Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution23(2), 254–267. doi: 10.1093/molbev/msj030

Jackson, N. D., Morales, A. E., Carstens, B. C., & O’Meara, B. C. (2017). PHRAPL: Phylogeographic Inference Using Approximate Likelihoods. Systematic Biology66(6), 1045–1053. doi: 10.1093/sysbio/syx001

Jouganous, J., Long, W., Ragsdale, A. P., & Gravel, S. (2017). Inferring the joint demographic history of multiple populations: Beyond the diffusion approximation. Genetics206(3), 1549–1567. doi: 10.1534/genetics.117.200493

Kamm, J., Terhorst, J., Durbin, R., & Song, Y. S. (2020). Efficiently Inferring the Demographic History of Many Populations With Allele Count Data. Journal of the American Statistical Association115(531), 1472–1487. doi: 10.1080/01621459.2019.1635482

Lee, T.-H., Guo, H., Wang, X., Kim, C., & Paterson, A. H. (2014). SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics15(1), 162. doi: 10.1186/1471-2164-15-162

Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature475(7357), 493–496. doi: 10.1038/nature10231

Liu, L., & Yu, L. (2010). Phybase: An R package for species tree analysis. Bioinformatics26(7), 962–963. doi: 10.1093/bioinformatics/btq062

Liu, L., & Yu, L. (2011). Estimating species trees from unrooted gene trees. Systematic Biology60(5), 661–667. doi: 10.1093/sysbio/syr027

Liu, X., & Fu, Y. X. (2020). Stairway Plot 2: demographic history inference with folded SNP frequency spectra. Genome Biology21(1), 1–9. doi: 10.1186/s13059-020-02196-9

Malinsky, M., Matschiner, M., & Svardal, H. (2020). Dsuite – Fast D-statistics and related admixture evidence from VCF files. Molecular Ecology Resources. doi: 10.1111/1755-0998.13265

Martin, S. H., & Van Belleghem, S. M. (2016). Exploring evolutionary relationships across the genome using topology weighting. BioRxiv, 069112. doi: 10.1101/069112

Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., Von Haeseler, A., … Teeling, E. (2020). IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution37(5), 1530–1534. doi: 10.1093/molbev/msaa015

Mirarab, S., & Warnow, T. (2015). ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics31(12), i44–i52. doi: 10.1093/bioinformatics/btv234

Mirzaei, S., & Wu, Y. (2017). RENT+: An improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics33(7), 1021–1030. doi: 10.1093/bioinformatics/btw735

Noskova, E., Ulyantsev, V., Koepfli, K. P., O’brien, S. J., & Dobrynin, P. (2020). GADMA: Genetic algorithm for inferring demographic history of multiple populations from allele frequency spectrum data. GigaScience9(3), 1–18. doi: 10.1093/gigascience/giaa005

Palamara, P. F., & Pe’er, I. (2013). Inference of historical migration rates via haplotype sharing. Bioinformatics29(13), 180–188. doi: 10.1093/bioinformatics/btt239

Pickrell, J. K., & Pritchard, J. K. (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics8(11), e1002967. doi: 10.1371/journal.pgen.1002967

Puttick, M. N. (2019). MCMCtreeR: Functions to prepare MCMCtree analyses and visualize posterior ages on trees. Bioinformatics35(24), 5321–5322. doi: 10.1093/bioinformatics/btz554

Schiffels, S., & Durbin, R. (2014). Inferring human population size and separation history from multiple genome sequences. Nature Genetics46(8), 919–925. doi: 10.1038/ng.3015

Sheehan, S., Harris, K., & Song, Y. S. (2013). Estimating Variable Effective Population Sizes from Multiple Genomes : A Sequentially Markov Conditional Sampling Distribution Approach. Genetics194, 647–662. doi: 10.1534/genetics.112.149096

Stamatakis, A. (2014). RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics30(9), 1312–1313. doi: 10.1093/bioinformatics/btu033

Terhorst, J., Kamm, J. A., & Song, Y. S. (2016). Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature Genetics49(2), 303–309. doi: 10.1038/ng.3748

Vachaspati, P., & Warnow, T. (2018). SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Molecular Phylogenetics and Evolution124, 122–136. doi: 10.1016/j.ympev.2018.03.006

Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution24(8), 1586–1591. doi: 10.1093/molbev/msm088