It has become clear in the last few years that the interactions between mutation and recombination rate, introgression, demography, selective sweeps and background selection have to be integrated into analyses of genetic variation (Li et al., 2012; Andrew et al., 2013; Ravinet et al., 2017). This complexity might nevertheless be tackled through simulations-based approaches. Two non-exclusive methodologies are promising: Approximate Bayesian Computation (ABC), and supervised machine learning/deep learning approaches. Both rely on pseudo-observed data simulated under a range of parameters and use them to identify the combination of parameter values that are most likely to have generated the observed data. Both are flexible and powerful, and have become increasingly popular in population genomics (Csilléry et al., 2010; Schrider and Kern, 2018). These methods allow handling any type of marker and arbitrarily complex models. By measuring the distance between carefully chosen summary statistics describing each simulation with those from the observed dataset, it is possible to infer which combination of selective and demographic parameters best explains the data. These methods also provide a way to estimate the rate of false positives, for example by estimating how many times simulations of neutral sequences are classified as selected.
Software | Class of method | Purpose | Specifics | Issues and warnings | Link | Reference |
---|---|---|---|---|---|---|
abc/abcrf | ABC | Performs all steps for model-checking and parameters estimation for ABC analyses. abcrf includes random forest methods (a type of supervised machine-learning) | Informative vignette, allows graphical representation, complete and robust | Does not perform coalescent simulations (but can be used in combination with coala) | https://cran.r-project.org/web/packages/abc/index.html https://cran.r-project.org/web/packages/abcrf/index.html | (Csilléry et al., 2012; Raynal et al., 2019) |
ABCToolbox | ABC | Complete ABC analysis, from simulations to model checking and parameters estimation | Modular, facilitates the computation of summary statistics | NA | https://bitbucket.org/wegmannlab/abctoolbox/wiki/Home | (Wegmann et al., 2010) |
DIYABC | ABC | Complete ABC analysis, from simulations to model checking and parameters estimation | User-friendly. Many ways to check goodness-of-fit. Good introduction to ABC models. | Does not model continuous gene flow. | http://www1.montpellier.inra.fr/CBGP/diyabc/ | (Cornuet et al., 2008) |
PopSizeABC | ABC | Inferring change in Ne using whole-genome data | Supposed to better assess recent events. Uses a set of summary statistics for the AFS and LD between markers. Handles multiple individuals | Approximate bayesian approaches do not retrieve the whole information | https://forge-dga.jouy.inra.fr/projects/popsizeabc/ | (Boistard et al., 2016) |
coala | ABC/coalescent simulations | Combining coalescent simulators within a single framework | Facilitates the building of scenarios and computes summary statistics for simulations. Can be easily combined with the abc or abc-rf packages in R. | Includes so far ms, msms and scrm | https://cran.r-project.org/web/packages/coala/index.html | (Staab and Metzler, 2016) |
FacSexCoalescent | coalescent simulations | Simulate demographic scenarios for asexual/facultatively sexual species | Can handle varying levels of sexual reproduction, inbreeding, selfing and cloning. | Does not handle population size changes nor selection yet. | https://github.com/MattHartfield/FacSexCoalescent https://github.com/MattHartfield/FacSexCoalescentR | (Hartfield et al., 2016) |
fastsimcoal2 | coalescent simulations | Building any arbitrary scenario using a coalescent framework | Any arbitrary scenario can be implemented. Handles SNP, microsatellites and sequence data. | Does not handle selection. Slower than ms with no recombination, much faster with recombination (see manual) | http://cmpg.unibe.ch/software/fastsimcoal2/ | (Excoffier and Foll, 2011) |
ms, msms, msABC | coalescent simulations | Building any arbitrary scenario using a coalescent framework | Any arbitrary scenario can be implemented. Handles SNP, microsatellites and sequence data. msms can include selection in the model. | Syntax can be difficult to handle for new users compared to, e.g., fastsimcoal2 (but see coala) | http://www.bio.lmu.de/~pavlidis/home/?Software:msABC | (Hudson, 2002; Ewing and Hermisson, 2010; Pavlidis et al., 2010) |
msms | coalescent simulations | Simulate demographic scenarios including selection | Flexible, syntax similar to ms, handles arbitratily complex models. Can be used in an ABC framework to include selection as a parameter to be estimated | Syntax can be difficult to handle for the naive user (but see coala) | http://www.mabs.at/ewing/msms/index.shtml | (Ewing and Hermisson, 2010) |
scrm | coalescent simulations | Fast simulation of chromosome-scale sequences | Syntax similar to ms, handles any arbitrary scenario | Does not handle gene conversion and fixed number of segregating sites (unlike ms) | https://scrm.github.io/ | (Staab et al., 2015) |
SPLATCHE3 | coalescent simulations | Simulating demographic scenarios in their spatial context | Coalescent simulator for genetic data, forward-in-time for demography in space. Spatially explicit. | Simulations can be slow (>1 hour) for large datasets (>100,000 SNPs) over more than 1,000 generations. Does not incorporate selection | http://www.splatche.com/splatche3 | (Currat et al., 2019) |
Coalescence | discoal | Simulate selective sweeps under arbitrary demographic scenarios | Relatively fast for short genomic fragments. Designed to simulate "hard" and "soft" sweeps. | Mostly used with diploS/HIC. Other simulators such as msms may be more suited for some scenarios. | https://github.com/kr-colab/discoal | (Kern and Schrider, 2016) |
QuantiNemo2 | Forward-in-time simulations | Simulating demographic and selection scenarios in their spatial context | Comprehensive simulator.Designed for the study of selection in a spatially-explicit context. Simulates quantitative traits, fitness landscapes and underlying genetic variation with migration. Include both population and individual-based simulations | Scan be slow for large/complex models | https://www2.unil.ch/popgen/softwares/quantinemo/ | (Currat et al., 2019)(Neuenschwander et al., 2019) |
SliM3 | Forward-in-time simulations | Simulating genomic sequences with intrinsic and extrinsic factors | One of the most comprehensive simulators. Can simulate genetic data in their spatio-temporal context, the effects of selection at linked sites, coding and non-coding variation, inbreeding and selfing. Supports tree-sequence recording for faster simulations. Large community. | Slow for large genomic regions/large populations | https://messerlab.org/slim/ | (Haller and Messer, 2019) |
diploS/HIC | Supervised Machine Learning | Detecting selective sweeps | Classifies genomic windows as neutral, selected, or impacted by selection at linked sites. Also distinguishes between selection on standing and de novo variation. Uses a set of summary statistics describing frequency spectrum and LD, does not require phasing. Good tutorial explaining the pipeline. | Good performance depends on the parameters used to simulate sweeps (window size, selective coefficient, demography). Requires some trial and error for new model species. Interpretation of "soft" and "hard" sweeps remains discussed. | https://github.com/kr-colab/diploSHIC | (Schrider and Kern, 2016; Kern and Schrider, 2018) |
evoNet | Supervised Machine Learning | Detecting selective sweeps, balancing selection, and estimate demographic history | Uses deep-learning algorithms to classify genomic regions as selected or neutral, and estimate effective population sizes. Flexible (any number of summary statistics can be provided by the investigator). | Requires summary statistics as an input. Difficult for a naïve user. | https://sourceforge.net/projects/evonet/?source=typ_redirect | (Sheehan and Song, 2016) |
FastEPRR | Supervised Machine Learning | Estimating effective recombination rates | Uses regression to estimate effective recombination rates from SNP alignments. Can use the VCF format. No clear bias due to phasing errors observed. Can incorporate demographic history (using ms command line). | Requires phased data. | https://www.picb.ac.cn/evolgen/softwares/index.html | (Gao et al., 2016) |
FILET | Supervised Machine Learning | Detecting introgression | Uses Extra Trees classifiers and dedicated summary statistics to classify genomic windows as being introgressed or not. Identifies the direction of introgression. | Targets pulse of introgression rather than continuous gene flow, but can detect the latter. Requires phased data in a fasta format. | https://github.com/kr-colab/FILET | (Gao et al., 2016) |
genomatnn | Supervised Machine Learning | Detecting adaptive introgression | Uses convolutional neural networks to identify adaptive introgression. Trained using the tree-sequence records obtained from SliM3. Can handle VCF files and unphased data. | Strong computational bottleneck with SliM simulations. | https://github.com/grahamgower/genomatnn | (Gower et al., 2020) |
ImaGene | Supervised Machine Learning | Detecting selective sweeps | Uses convolutional neural networks to classify genomic windows in bins of distinct selection coefficients. Directly uses the image of the alignment, avoiding compression (i.e. using summary statistics) | Can be slow for large datasets. | https://github.com/mfumagalli/ImaGene | (Torada et al., 2019) |
RELERNN | Supervised Machine Learning | Estimating recombination rates | Uses recurrent neural networks to estimate recombination rates from SNP alignments. Handles unphased and pooled data. Uses msprime (Python implementation of ms) to generate simulations upon which the algorithm is trained. Can incorporate known demographic history provided by the user. | Can be computationnally intensive for large effective population sizes. Accuracy on pooled data is modest for low depth of coverage. Absolute estimates of recombination rates depends on the accuracy of the mutation rate used for simulations. | https://github.com/kr-colab/ReLERNN | (Adrion, Galloway, et al., 2020) |
SWiFr | Supervised Machine Learning | Detecting selective sweeps | Uses averaged one-dependence estimator to classify genomic regions as selected or neutral. Flexible in terms of which summary statistics are used. Can incorporate demographic history. | Requires summary statistics as an input. Only distinguishes between selective sweeps and neutral regions. | https://github.com/ramachandran-lab/SWIFr/blob/master/README.md | (Sugden et al., 2018) |
References
Adrion, J. R., Galloway, J. G., & Kern, A. D. (2020). Predicting the landscape of recombination using deep learning. Molecular Biology and Evolution, 37(6), 1790–1808. doi: 10.1093/molbev/msaa038
Boistard, S., Rodriguez, W., Jay, F., Mona, S., & Austerlitz, F. (2016). Inferring Population Size History from Large Samples of Genome-Wide Molecular Data – An Approximate Bayesian Computation Approach. PLoS Genetics, 858–865. doi: 10.1371/journal.pgen.1005877
Cornuet, J.-M., Santos, F., Beaumont, M. A., Robert, C. P., Marin, J.-M., Balding, D. J., … Estoup, A. (2008). Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics, 24(23), 2713–2719. doi: 10.1093/bioinformatics/btn514
Csilléry, K., François, O., & Blum, M. G. B. (2012). abc: an R package for approximate Bayesian computation (ABC). Methods in Ecology and Evolution, 3(3), 475–479. doi: 10.1111/j.2041-210X.2011.00179.x
Currat, M., Arenas, M., Quilodràn, C. S., Excoffier, L., & Ray, N. (2019). SPLATCHE3: Simulation of serial genetic data under spatially explicit evolutionary scenarios including long-distance dispersal. Bioinformatics, 35(21), 4480–4483. doi: 10.1093/bioinformatics/btz311
Ewing, G., & Hermisson, J. (2010). MSMS: A coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics, 26(16), 2064–2065. doi: 10.1093/bioinformatics/btq322
Excoffier, L., & Foll, M. (2011). Fastsimcoal: a Continuous-Time Coalescent Simulator of Genomic Diversity Under Arbitrarily Complex Evolutionary Scenarios. Bioinformatics, 27(9), 1332–1334. doi: 10.1093/bioinformatics/btr124
Gao, F., Ming, C., Hu, W., & Li, H. (2016). New software for the fast estimation of population recombination rates (FastEPRR) in the genomic era. G3: Genes, Genomes, Genetics, 6(6), 1563–1571. doi: 10.1534/g3.116.028233
Gower, G., Picazo, P. I., Fumagalli, M., & Racimo, F. (2020). Detecting adaptive introgression in human evolution using convolutional neural networks. BioRxiv, 2020.09.18.301069. Retrieved from https://doi.org/10.1101/2020.09.18.301069
Haller, B. C., & Messer, P. W. (2019). SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model. Molecular Biology and Evolution, 36(3), 632–637. doi: 10.1093/molbev/msy228
Hartfield, M., Wright, S. I., & Agrawal, A. F. (2016). Coalescent times and patterns of genetic diversity in species with facultative sex: Effects of gene conversion, population structure, and heterogeneity. Genetics, 202(1), 297–312. doi: 10.1534/genetics.115.178004
Hudson, R. R. (2002). Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics, 18(2), 337–338. doi: 10.1093/bioinformatics/18.2.337
Kern, A. D., & Schrider, D. R. (2016). Discoal: flexible coalescent simulations with selection. Bioinformatics, 32(24), 3839–3841.
Kern, A. D., & Schrider, D. R. (2018). diploS/HIC: An Updated Approach to Classifying Selective Sweeps. G3; Genes|Genomes|Genetics, g3.200262.2018. doi: 10.1534/g3.118.200262
Neuenschwander, S., Michaud, F., & Goudet, J. (2019). QuantiNemo 2: A Swiss knife to simulate complex demographic and genetic scenarios, forward and backward in time. Bioinformatics, 35(5), 886–888. doi: 10.1093/bioinformatics/bty737
Pavlidis, P., Laurent, S., & Stephan, W. (2010). MsABC: A modification of Hudson’s ms to facilitate multi-locus ABC analysis. Molecular Ecology Resources, 10(4), 723–727. doi: 10.1111/j.1755-0998.2010.02832.x
Raynal, L., Marin, J. M., Pudlo, P., Ribatet, M., Robert, C. P., & Estoup, A. (2019). ABC random forests for Bayesian parameter inference. Bioinformatics, 35(10), 1720–1728. doi: 10.1093/bioinformatics/bty867
Schrider, D. R., & Kern, A. D. (2016). S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning. PLoS Genetics, 12(3), 1–31. doi: 10.1371/journal.pgen.1005928
Sheehan, S., & Song, Y. S. (2016). Deep Learning for Population Genetic Inference. PLoS Computational Biology, 12(3), 1–28. doi: 10.1371/journal.pcbi.1004845
Staab, P. R., & Metzler, D. (2016). Coala: An R framework for coalescent simulation. Bioinformatics, 32(12), 1903–1904. doi: 10.1093/bioinformatics/btw098
Staab, P. R., Zhu, S., Metzler, D., & Lunter, G. (2015). Scrm: Efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10), 1680–1682. doi: 10.1093/bioinformatics/btu861
Sugden, L. A., Atkinson, E. G., Fischer, A. P., Rong, S., Henn, B. M., & Ramachandran, S. (2018). Localization of adaptive variants in human genomes using averaged one-dependence estimation. Nature Communications, 9(1). doi: 10.1038/s41467-018-03100-7
Torada, L., Lorenzon, L., Beddis, A., Isildak, U., Pattini, L., Mathieson, S., & Fumagalli, M. (2019). ImaGene: A convolutional neural network to quantify natural selection from genomic data. BMC Bioinformatics, 20(Suppl 9), 1–12. doi: 10.1186/s12859-019-2927-x
Wegmann, D., Leuenberger, C., Neuenschwander, S., & Excoffier, L. (2010). ABCtoolbox: a versatile toolkit for approximate Bayesian computations. BMC Bioinformatics, 11, 116. doi: 10.1186/1471-2105-11-116