Methods to infer populations history – Methods in population genomics

Software	Class of method	Purpose	Specifics	Issues and warnings	Link	Reference
Dsuite	ABBA-BABA	Identifying past events of admixture between populations	Fast, handles VCF format. Suited for low-depth sequencing (handles uncertainties on genotypes). Provides a set of summary statistics that are useful to investigate complex admixture events	Requires an outgroup sequence. The methods cannot estimate the direction of gene flow.	https://github.com/millanek/Dsuite	(Malinsky et al., 2020)
RENT+	Ancestral Recombination Graphs/coalescence	Retracing the whole process of recombination and coalescence along a genome	Faster than first version of ARGWeaver.	Requires phased haplotypes. Specific input format. No built-in functions to extract information from genealogies.	https://github.com/SajadMirzaei/RentPlus	(Mirzaei and Wu, 2017)
TREEMIX	Clustering and characterizing admixture	Admixture graph, infers most likely admixture events in a tree	Based on allele frequencies and can be used for pooled data.	Requires multiple runs to properly assess the likelihood of each model	https://bitbucket.org/nygcresearch/treemix/src	(Pickrell and Pritchard, 2012)
G-PhoCS	Coalescence/Bayesian	Estimating population divergence and migration parameters using a coalescent framework	Bayesian + MCMC, handles ancient samples	Parameters scaled by mutation rate, no admixture	http://compgen.cshl.edu/GPhoCS/	(Gronau et al., 2011)
IMa3	Coalescence/Bayesian	Inferring parameters from an isolation with migration (IM) model	Fully bayesian approach, can perform joint estimates of parameters in L-mode and test for nested models. Can estimate phylogenetic relationships and migration rates	IM model is the only one available. Discrete admixture cannot be tested. Can only use subsets of whole-genome resequencing data. Recent splits lead to overestimate migration rates	https://github.com/jodyhey/IMa3	(Hey and Nielsen, 2007)
ABLE	Coalescence/Composite Likelihood	Model comparison and parameters estimation	Uses both allele frequency spectrum and linkage disequilibrium within blocks of a pre-specified size.	Relies on ms syntax. Determining the most informative size for blocks requires performing pilot runs.	https://github.com/champost/ABLE	(Beeravolu et al., 2016)
Stairway2	Coalescence/Composite Likelihood	Inferring change in Ne with time	User-friendly. Fast. Suitable for pools or low-depth sequencing.	Cannot handle migration or population splits.	https://github.com/xiaoming-liu/stairway-plot-v2	(Liu and Fu, 2020)
fastsimcoal2	Coalescence/Likelihood	Model comparison and parameters estimation	Performs coalescent simulations, parameter estimation and model testing using a fast likelihood method. Can handle arbitrarily complex scenarios for any type of marker	The maximum-likelihood method only uses the allele frequency spectrum. Several runs (20-100) are needed to explore the likelihood space.	http://cmpg.unibe.ch/software/fastsimcoal2/	(Excoffier et al., 2013)
∂a∂i	Diffusion approximation of the AFS	Model comparison and parameters estimation	Run time does not depend on the number of SNPs included, does not require coalescent simulations, handles arbitrarily complex scenarios. Fast estimation of confidence intervals around parameters estimates (Godambe method). Suitable for pools/low-depth sequencing	requires some knowledge of Python. Limited to 3 populations. Several runs (20-100) are needed to explore the likelihood space.	https://bitbucket.org/gutenkunstlab/dadi	(Gutenkunst et al., 2009)
moments	Diffusion approximation of the AFS	Model comparison and parameters estimation	Based on Python, syntax similar to ∂a∂i. Can handle selection. Can use VCF files as input.	Requires some knowledge of Python. Limited to 5 populations. Several runs (20-100) are needed to explore the likelihood space.	https://bitbucket.org/simongravel/moments/src/master/	(Jouganous et al., 2017)
momi2	Diffusion approximation of the AFS	Model comparison and parameters estimation	Can scale to ten populations. Can simulate and read data in the VCF format. Detailed tutorials available	Does not handle continuous gene flow	https://github.com/popgenmethods/momi2	(Kamm et al., 2020)
KIMTree	Diffusion approximation/Bayesian	Estimating divergence time between populations and testing for topologies. Estimate divergence times and past effective sex-ratio along branches of a populations tree.	Fast and user-friendly. R scripts to obtain plots are available. Suitable for pools/low-depth sequencing. The method is conditional on a prior topology provided by the user. It computes DIC for a given topology, allowing to test for the best one.	Strong selection on the sex chromosome can produce male-biased sex-ratios. Times are given in diffusion time scale, and can be converted in demographic times using independent estimates of Ne.	http://www1.montpellier.inra.fr/CBGP/software/kimtree/download.html	(Clemente et al., 2018)
GADMA	Genetic algorithm	Model comparison and parameters estimation	Based on moments and ∂a∂i. Automates the search for the best set of models explaining a given frequency spectrum.	Limited to three populations at the moment.	https://github.com/ctlab/GADMA	(Noskova et al., 2020)
DoRIS	Identity by Descent (IBD) tract	Testing various demographic scenario	Uses variation in IBD tracts length to test for various demographic models.	IBD must be inferred first with, e.g., BEAGLE. Handles a limited set of demographic scenarios. Modification in the code is required for more complex scenarios	https://github.com/pierpal/DoRIS	(Palamara and Pe’er, 2013)
Unnamed.	Identity by state (IBS) tract	Predict observed patterns of Identity by state along a genome by fittingan appropriate, arbitrary complex demographic model	Allows bootstrapping and estimating confidence over parameter estimates with ms	Specific input format (similar to MSMC or ARGWeaver)	https://github.com/kelleyharris/Inferring-demography-from-IBS	(Harris and Nielsen, 2013)
ASTRAL-2	Phylogeny	Builds species trees using short non-recombining sequences	Coalescence-based. Suitable for short loci (e.g. RAD-seq and GBS)	More reliable under high incomplete lineage sorting that SVDQuartets and NJst (Chou et al. 2015)	https://github.com/smirarab/ASTRAL	(Mirarab and Warnow, 2015)
BEAST2	Phylogeny	Network reconstruction and phylogenetic relationships	User friendly. Can be used to track changes in effective population sizes (Bayesian Skyline Plots). Possible to estimate divergence times	Slow for large datasets. Requires sequence data that can be produced by , e.g., Stacks for RAD-seq data	http://beast2.org/	(Drummond and Rambaut, 2007; Bouckaert et al., 2014)
IQ-Tree 2	Phylogeny	Divergence time estimation and phylogenetic relationships	User-friendly, can be run locally or on a webserver, very detailed tutorials. Fast and accurate.	Still no tutorial for analyzing big data (last checked December 2020).	http://www.iqtree.org/	(Minh et al., 2020)
MCMCTree and MCMCTreeR	Phylogeny	Divergence time estimation and phylogenetic relationships	Included in PAML. A R program is designed to help choosing relevant priors and interpreting results https://github.com/PuttickMacroevolution/MCMCtreeR	Bayesian, sensitive to priors. Requires a resolved phylogeny and an alignment. Slow for large datasets. Not suited for recent divergence and high gene flow.	http://abacus.gene.ucl.ac.uk/software/paml.html	(Yang, 2007; Puttick, 2019)
NJst	Phylogeny	Builds species trees using short non-recombining sequences	Available in the R package phybase. Estimates populations/species tree from gene trees	Requires splitting part of the genome into non-recombining "loci".	https://github.com/bomeara/phybase/	(Liu and Yu, 2010, 2011)
PHRAPL	Phylogeny	Admixture graph, reticulated evolution	Uses trees in the NEWICK format as an input to infer topology, migration rates, divergence times. Similar to ABC in spirit, using tree topology as a summary statistics.	Cannot handle more than 16 taxa at a time, and requires subsetting larger datasets	http://www.phrapl.org/	(Jackson et al., 2017)
PhyML	Phylogeny	Phylogenetic relationships	Maximum Likelihood inference of phylogenetic relationships. An online version is available	Should be used on complex of species or divergent populations with little migration. Can be ran on genomic windows to detect introgression (with e.g. TWISST, Dsuite)	http://www.atgc-montpellier.fr/phyml/binaries.php	(Guindon et al., 2010)
RAxML	Phylogeny	Network reconstruction and phylogenetic relationships	Maximum Likelihood inference of phylogenetic relationships	Should be used on complex of species or divergent populations with little migration	http://sco.h-its.org/exelixis/web/software/raxml/index.html	(Stamatakis, 2014)
SNAPP	Phylogeny	Phylogenetic relationships	Handles SNP data	Remains slow for medium to large datasets (>1,000SNPs)	http://beast2.org/snapp/	(Bryant et al., 2012)
SNPhylo	Phylogeny	Network reconstruction and phylogenetic relationships	Complete pipeline from SNP filtering to tree reconstruction	Should be used on complex of species or divergent populations with little migration	http://chibba.pgml.uga.edu/snphylo/	(Lee et al., 2014)
SVDQuartets	Phylogeny	Phylogenetic relationships	Estimates populations/species tree from gene trees	Remains slow for large datasets. Requires PAUP*.	https://www.asc.ohio-state.edu/kubatko.2/software/SVDquartets/	(Chifman and Kubatko, 2014)
SVDQuest	Phylogeny	Phylogenetic relationships	Estimates populations/species tree from gene trees	Faster than SVDQuartets	https://github.com/pranjalv123/SVDquest	(Vachaspati and Warnow, 2018)
*BEAST	Phylogeny and species tree inference	Divergence time estimation and phylogenetic relationships	Outputs a species tree instead of concatenated gene tree. Allows for testing consistency between phylogenetic signals at different loci	Slow for large datasets. Requires sequence data. Not suited for situations where gene flow/admixture is important	http://beast2.org/	(Heled and Drummond, 2010)
Splitstree	Phylogeny/Network	Network reconstruction and phylogenetic relationships	User friendly interface, proposes a variety of methods for networks reconstruction	Mostly descriptive	http://www.splitstree.org/	(Huson and Bryant, 2006)
diCal2	Sequentially Markovian coalescent	Testing any arbitrary demographic scenario	Works with smaller, more fragmented datasets than PSMC. Handles more complex demographic models than MSMC (including admixture).	Requires phased whole genome data and a model to be defined	https://sourceforge.net/projects/dical2/	(Sheehan et al., 2013)
MSMC and MSMC-IM	Sequentially Markovian coalescent	Inferring change in Ne and migration rates with time between two populations	Allows to track population size changes in time without a priori. Allows estimating variation in cross-coalescence rate between two populations	Limited to the study of 8 diploid individuals from 2 populations at once. Requires whole genome phased data and masking regions with insufficient sequencing depth	https://github.com/stschiff/msmc and https://github.com/wangke16/MSMC-IM	(Schiffels and Durbin, 2014)
PSMC	Sequentially Markovian coalescent	Inferring change in effective population sizes (Ne) with time using a single diploid genome	Allows to track population size changes in time without a priori.	Limited to one population and one diploid individual. Better used within MSMC. Requires phased whole genome data and masking regions with insufficient sequencing depth	https://github.com/lh3/psmc	(Li and Durbin, 2011)
SMC++	Sequentially Markovian coalescent	Inferring change in Ne with time and splitting time between two populations	Can analyze hundreds of individuals at a time and does not require phasing	Masking regions as in MSMC. The ancestral allele is assumed to be the reference allele by default. Assumes a clean split for populations divergence. Future versions should allow gene flow inference.	https://github.com/popgenmethods/smcpp	(Terhorst et al., 2016)
TWISST	Topology weighting	Chromosome painting, clustering and branching between populations	Retrieves the most likely coalescence pattern between several taxa along the genome. Can be seen as an extension of the ABBA/BABA test	Needs a priori grouping of individuals into taxa. Requires at least 4 taxa. Impractical for more than 6 taxa. Windows size must include enough SNPs to retrieve the correct topology but at the risk that regions with different histories are included	https://github.com/simonhmartin/twisst	(Martin and Van Belleghem, 2016)
BAYPASS/Bayenv	Variance/covariance matrix	Building a population covariance matrix across population allele frequencies, similar to TREEMIX	Can handle pooled data	Matrices are mostly designed to provide a neutral model for assessing selection, but can be used to infer population structure	http://www1.montpellier.inra.fr/CBGP/software/baypass/ ; https://bitbucket.org/tguenther/bayenv2_public/src	(Günther and Coop, 2013; Gautier, 2015)
ETEToolkit	Phylogeny and species tree inference	Phylogenetic relationships	Well documented suite of python commands to perform phylogenetic analyses	Species trees can be biased by important gene flow or admixture (general issue, not specific to ETEToolkit)	http://etetoolkit.org/	(Huerta-Cepas et al., 2016)

References

Beeravolu, C. R., Hickerson, M. J., Frantz, L. A. F., & Lohse, K. (2016). Approximate Likelihood Inference of Complex Population Histories and Recombination from Multiple Genomes. BioarXiv, 1–31. doi: 10.1101/077958

Bouckaert, R., Heled, J., Kühnert, D., Vaughan, T., Wu, C. H., Xie, D., … Drummond, A. J. (2014). BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLoS Computational Biology, 10(4), 1–6. doi: 10.1371/journal.pcbi.1003537

Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N. A., & Roychoudhury, A. (2012). Inferring species trees directly from biallelic genetic markers: Bypassing gene trees in a full coalescent analysis. Molecular Biology and Evolution, 29(8), 1917–1932. doi: 10.1093/molbev/mss086

Chifman, J., & Kubatko, L. (2014). Quartet inference from SNP data under the coalescent model. Bioinformatics, 30(23), 3317–3324. doi: 10.1093/bioinformatics/btu530

Clemente, F., Gautier, M., & Vitalis, R. (2018). Inferring sex-specific demographic history from SNP data. PLoS Genetics, 14(1), 1–32. doi: 10.1371/journal.pgen.1007191

Drummond, A. J., & Rambaut, A. (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evolutionary Biology, 7, 214. doi: 10.1186/1471-2148-7-214

Excoffier, L., Dupanloup, I., Huerta-Sanchez, E., Sousa, V. C., & Foll, M. (2013). Robust Demographic Inference from Genomic and SNP Data. PLoS Genetics, 9(10). doi: 10.1371/journal.pgen.1003905

Gautier, M. (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics, 201(September), 1555–1579. doi: doi:0.1534/genetics.115.181453

Gronau, I., Hubisz, M. J., Gulko, B., Danko, C. G., & Siepel, A. (2011). Bayesian inference of ancient human demography from individual genome sequences. Nature Genetics, 43(10), 1031–1034. doi: 10.1038/ng.937

Guindon, S., Dufayard, J. F., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Systematic Biology, 59(3), 307–321. doi: 10.1093/sysbio/syq010

Günther, T., & Coop, G. (2013). Robust identification of local adaptation from allele frequencies. Genetics, 195(1), 205–220. doi: 10.1534/genetics.113.152462

Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H., & Bustamante, C. D. (2009). Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics, 5(10). doi: 10.1371/journal.pgen.1000695

Harris, K., & Nielsen, R. (2013). Inferring Demographic History from a Spectrum of Shared Haplotype Lengths. PLoS Genetics, 9(6). doi: 10.1371/journal.pgen.1003521

Heled, J., & Drummond, A. J. (2010). Bayesian Inference of Species Trees from Multilocus Data. Molecular Biology and Evolution, 27(3), 570–580. doi: 10.1093/molbev/msp274

Hey, J., & Nielsen, R. (2007). Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proceedings of the National Academy of Sciences of the United States of America, 104(8), 2785–2790. doi: 10.1073/pnas.0611164104

Huson, D. H., & Bryant, D. (2006). Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution, 23(2), 254–267. doi: 10.1093/molbev/msj030

Jackson, N. D., Morales, A. E., Carstens, B. C., & O’Meara, B. C. (2017). PHRAPL: Phylogeographic Inference Using Approximate Likelihoods. Systematic Biology, 66(6), 1045–1053. doi: 10.1093/sysbio/syx001

Jouganous, J., Long, W., Ragsdale, A. P., & Gravel, S. (2017). Inferring the joint demographic history of multiple populations: Beyond the diffusion approximation. Genetics, 206(3), 1549–1567. doi: 10.1534/genetics.117.200493

Kamm, J., Terhorst, J., Durbin, R., & Song, Y. S. (2020). Efficiently Inferring the Demographic History of Many Populations With Allele Count Data. Journal of the American Statistical Association, 115(531), 1472–1487. doi: 10.1080/01621459.2019.1635482

Lee, T.-H., Guo, H., Wang, X., Kim, C., & Paterson, A. H. (2014). SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics, 15(1), 162. doi: 10.1186/1471-2164-15-162

Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature, 475(7357), 493–496. doi: 10.1038/nature10231

Liu, L., & Yu, L. (2010). Phybase: An R package for species tree analysis. Bioinformatics, 26(7), 962–963. doi: 10.1093/bioinformatics/btq062

Liu, L., & Yu, L. (2011). Estimating species trees from unrooted gene trees. Systematic Biology, 60(5), 661–667. doi: 10.1093/sysbio/syr027

Liu, X., & Fu, Y. X. (2020). Stairway Plot 2: demographic history inference with folded SNP frequency spectra. Genome Biology, 21(1), 1–9. doi: 10.1186/s13059-020-02196-9

Malinsky, M., Matschiner, M., & Svardal, H. (2020). Dsuite – Fast D-statistics and related admixture evidence from VCF files. Molecular Ecology Resources. doi: 10.1111/1755-0998.13265

Martin, S. H., & Van Belleghem, S. M. (2016). Exploring evolutionary relationships across the genome using topology weighting. BioRxiv, 069112. doi: 10.1101/069112

Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., Von Haeseler, A., … Teeling, E. (2020). IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution, 37(5), 1530–1534. doi: 10.1093/molbev/msaa015

Mirarab, S., & Warnow, T. (2015). ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics, 31(12), i44–i52. doi: 10.1093/bioinformatics/btv234

Mirzaei, S., & Wu, Y. (2017). RENT+: An improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics, 33(7), 1021–1030. doi: 10.1093/bioinformatics/btw735

Noskova, E., Ulyantsev, V., Koepfli, K. P., O’brien, S. J., & Dobrynin, P. (2020). GADMA: Genetic algorithm for inferring demographic history of multiple populations from allele frequency spectrum data. GigaScience, 9(3), 1–18. doi: 10.1093/gigascience/giaa005

Palamara, P. F., & Pe’er, I. (2013). Inference of historical migration rates via haplotype sharing. Bioinformatics, 29(13), 180–188. doi: 10.1093/bioinformatics/btt239

Pickrell, J. K., & Pritchard, J. K. (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics, 8(11), e1002967. doi: 10.1371/journal.pgen.1002967

Puttick, M. N. (2019). MCMCtreeR: Functions to prepare MCMCtree analyses and visualize posterior ages on trees. Bioinformatics, 35(24), 5321–5322. doi: 10.1093/bioinformatics/btz554

Schiffels, S., & Durbin, R. (2014). Inferring human population size and separation history from multiple genome sequences. Nature Genetics, 46(8), 919–925. doi: 10.1038/ng.3015

Sheehan, S., Harris, K., & Song, Y. S. (2013). Estimating Variable Effective Population Sizes from Multiple Genomes : A Sequentially Markov Conditional Sampling Distribution Approach. Genetics, 194, 647–662. doi: 10.1534/genetics.112.149096

Stamatakis, A. (2014). RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9), 1312–1313. doi: 10.1093/bioinformatics/btu033

Terhorst, J., Kamm, J. A., & Song, Y. S. (2016). Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature Genetics, 49(2), 303–309. doi: 10.1038/ng.3748

Vachaspati, P., & Warnow, T. (2018). SVDquest: Improving SVDquartets species tree estimation using exact optimization within a constrained search space. Molecular Phylogenetics and Evolution, 124, 122–136. doi: 10.1016/j.ympev.2018.03.006

Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution, 24(8), 1586–1591. doi: 10.1093/molbev/msm088