Special section: Approximate Bayesian Computation, Machine learning and simulators

It has become clear in the last few years that the interactions between mutation and recombination rate, introgression, demography, selective sweeps and background selection have to be integrated into analyses of genetic variation (Li et al., 2012; Andrew et al., 2013; Ravinet et al., 2017). This complexity might nevertheless be tackled through simulations-based approaches.  Two non-exclusive methodologies are promising: Approximate Bayesian Computation (ABC), and supervised machine learning/deep learning approaches. Both rely on pseudo-observed data simulated under a range of parameters and use them to identify the combination of parameter values that are most likely to have generated the observed data (see Box 3 for a discussion on simulators). Both are flexible and powerful, and have become increasingly popular in population genomics (Csilléry et al., 2010; Schrider and Kern, 2018). These methods allow handling any type of marker and arbitrarily complex models. By measuring the distance between carefully chosen summary statistics describing each simulation with those from the observed dataset, it is possible to infer which combination of selective and demographic parameters best explains the data. These methods also provide a way to estimate the rate of false positives, for example by estimating how many times simulations of neutral sequences are classified as selected.

 

SoftwareClass of methodPurposeSpecificsIssues and warningsLinkReference
abc/abcrfABCPerforms all steps for model-checking and parameters estimation for ABC analyses. abcrf includes random forest methods (a type of supervised machine-learning)Informative vignette, allows graphical representation, complete and robustDoes not perform coalescent simulations (but can be used in combination with coala)https://cran.r-project.org/web/packages/abc/index.html

https://cran.r-project.org/web/packages/abcrf/index.html
(Csilléry et al., 2012; Raynal et al., 2019)
ABCToolboxABCComplete ABC analysis, from simulations to model checking and parameters estimationModular, facilitates the computation of summary statisticsNAhttps://bitbucket.org/wegmannlab/abctoolbox/wiki/Home(Wegmann et al., 2010)
DIYABCABCComplete ABC analysis, from simulations to model checking and parameters estimationUser-friendly. Many ways to check goodness-of-fit. Good introduction to ABC models.Does not model continuous gene flow.http://www1.montpellier.inra.fr/CBGP/diyabc/(Cornuet et al., 2008)
PopSizeABCABCInferring change in Ne using whole-genome dataSupposed to better assess recent events. Uses a set of summary statistics for the AFS and LD between markers. Handles multiple individualsApproximate bayesian approaches do not retrieve the whole informationhttps://forge-dga.jouy.inra.fr/projects/popsizeabc/(Boistard et al., 2016)
coalaABC/coalescent simulationsCombining coalescent simulators within a single frameworkFacilitates the building of scenarios and computes summary statistics for simulations. Can be easily combined with the abc or abc-rf packages in R.Includes so far ms, msms and scrmhttps://cran.r-project.org/web/packages/coala/index.html(Staab and Metzler, 2016)
FacSexCoalescentcoalescent simulationsSimulate demographic scenarios for asexual/facultatively sexual speciesCan handle varying levels of sexual reproduction, inbreeding, selfing and cloning.Does not handle population size changes nor selection yet.https://github.com/MattHartfield/FacSexCoalescent

https://github.com/MattHartfield/FacSexCoalescentR
(Hartfield et al., 2016)
fastsimcoal2coalescent simulationsBuilding any arbitrary scenario using a coalescent frameworkAny arbitrary scenario can be implemented. Handles SNP, microsatellites and sequence data.Does not handle selection. Slower than ms with no recombination, much faster with recombination (see manual)http://cmpg.unibe.ch/software/fastsimcoal2/(Excoffier and Foll, 2011)
ms, msms, msABCcoalescent simulationsBuilding any arbitrary scenario using a coalescent frameworkAny arbitrary scenario can be implemented. Handles SNP, microsatellites and sequence data. msms can include selection in the model.Syntax can be difficult to handle for new users compared to, e.g., fastsimcoal2 (but see coala)http://www.bio.lmu.de/~pavlidis/home/?Software:msABC(Hudson, 2002; Ewing and Hermisson, 2010; Pavlidis et al., 2010)
msmscoalescent simulationsSimulate demographic scenarios including selectionFlexible, syntax similar to ms, handles arbitratily complex models. Can be used in an ABC framework to include selection as a parameter to be estimatedSyntax can be difficult to handle for the naive user (but see coala)http://www.mabs.at/ewing/msms/index.shtml(Ewing and Hermisson, 2010)
scrmcoalescent simulationsFast simulation of chromosome-scale sequencesSyntax similar to ms, handles any arbitrary scenarioDoes not handle gene conversion and fixed number of segregating sites (unlike ms)https://scrm.github.io/(Staab et al., 2015)
SPLATCHE3coalescent simulationsSimulating demographic scenarios in their spatial contextCoalescent simulator for genetic data, forward-in-time for demography in space. Spatially explicit. Simulations can be slow (>1 hour) for large datasets (>100,000 SNPs) over more than 1,000 generations. Does not incorporate selectionhttp://www.splatche.com/splatche3(Currat et al., 2019)
CoalescencediscoalSimulate selective sweeps under arbitrary demographic scenariosRelatively fast for short genomic fragments. Designed to simulate "hard" and "soft" sweeps.Mostly used with diploS/HIC. Other simulators such as msms may be more suited for some scenarios.https://github.com/kr-colab/discoal(Kern and Schrider, 2016)
QuantiNemo2Forward-in-time simulationsSimulating demographic and selection scenarios in their spatial contextComprehensive simulator.Designed for the study of selection in a spatially-explicit context. Simulates quantitative traits, fitness landscapes and underlying genetic variation with migration. Include both population and individual-based simulationsScan be slow for large/complex modelshttps://www2.unil.ch/popgen/softwares/quantinemo/(Currat et al., 2019)(Neuenschwander et al., 2019)
SliM3Forward-in-time simulationsSimulating genomic sequences with intrinsic and extrinsic factorsOne of the most comprehensive simulators. Can simulate genetic data in their spatio-temporal context, the effects of selection at linked sites, coding and non-coding variation, inbreeding and selfing. Supports tree-sequence recording for faster simulations. Large community.Slow for large genomic regions/large populationshttps://messerlab.org/slim/(Haller and Messer, 2019)
diploS/HICSupervised Machine LearningDetecting selective sweepsClassifies genomic windows as neutral, selected, or impacted by selection at linked sites. Also distinguishes between selection on standing and de novo variation. Uses a set of summary statistics describing frequency spectrum and LD, does not require phasing. Good tutorial explaining the pipeline. Good performance depends on the parameters used to simulate sweeps (window size, selective coefficient, demography). Requires some trial and error for new model species. Interpretation of "soft" and "hard" sweeps remains discussed.https://github.com/kr-colab/diploSHIC(Schrider and Kern, 2016; Kern and Schrider, 2018)
evoNetSupervised Machine LearningDetecting selective sweeps, balancing selection, and estimate demographic historyUses deep-learning algorithms to classify genomic regions as selected or neutral, and estimate effective population sizes. Flexible (any number of summary statistics can be provided by the investigator).Requires summary statistics as an input. Difficult for a naïve user.https://sourceforge.net/projects/evonet/?source=typ_redirect(Sheehan and Song, 2016)
FastEPRRSupervised Machine LearningEstimating effective recombination ratesUses regression to estimate effective recombination rates from SNP alignments. Can use the VCF format. No clear bias due to phasing errors observed. Can incorporate demographic history (using ms command line).Requires phased data. https://www.picb.ac.cn/evolgen/softwares/index.html(Gao et al., 2016)
FILETSupervised Machine LearningDetecting introgressionUses Extra Trees classifiers and dedicated summary statistics to classify genomic windows as being introgressed or not. Identifies the direction of introgression.Targets pulse of introgression rather than continuous gene flow, but can detect the latter. Requires phased data in a fasta format.https://github.com/kr-colab/FILET(Gao et al., 2016)
genomatnnSupervised Machine LearningDetecting adaptive introgressionUses convolutional neural networks to identify adaptive introgression. Trained using the tree-sequence records obtained from SliM3. Can handle VCF files and unphased data.Strong computational bottleneck with SliM simulations.https://github.com/grahamgower/genomatnn(Gower et al., 2020)
ImaGeneSupervised Machine LearningDetecting selective sweepsUses convolutional neural networks to classify genomic windows in bins of distinct selection coefficients. Directly uses the image of the alignment, avoiding compression (i.e. using summary statistics)Can be slow for large datasets.https://github.com/mfumagalli/ImaGene(Torada et al., 2019)
RELERNNSupervised Machine LearningEstimating recombination ratesUses recurrent neural networks to estimate recombination rates from SNP alignments. Handles unphased and pooled data. Uses msprime (Python implementation of ms) to generate simulations upon which the algorithm is trained. Can incorporate known demographic history provided by the user.Can be computationnally intensive for large effective population sizes. Accuracy on pooled data is modest for low depth of coverage. Absolute estimates of recombination rates depends on the accuracy of the mutation rate used for simulations.https://github.com/kr-colab/ReLERNN(Adrion, Galloway, et al., 2020)
SWiFrSupervised Machine LearningDetecting selective sweepsUses averaged one-dependence estimator to classify genomic regions as selected or neutral. Flexible in terms of which summary statistics are used. Can incorporate demographic history.Requires summary statistics as an input. Only distinguishes between selective sweeps and neutral regions.https://github.com/ramachandran-lab/SWIFr/blob/master/README.md(Sugden et al., 2018)

References

 

Adrion, J. R., Galloway, J. G., & Kern, A. D. (2020). Predicting the landscape of recombination using deep learning. Molecular Biology and Evolution, 37(6), 1790–1808. doi: 10.1093/molbev/msaa038

Boistard, S., Rodriguez, W., Jay, F., Mona, S., & Austerlitz, F. (2016). Inferring Population Size History from Large Samples of Genome-Wide Molecular Data – An Approximate Bayesian Computation Approach. PLoS Genetics, 858–865. doi: 10.1371/journal.pgen.1005877

Cornuet, J.-M., Santos, F., Beaumont, M. A., Robert, C. P., Marin, J.-M., Balding, D. J., … Estoup, A. (2008). Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics, 24(23), 2713–2719. doi: 10.1093/bioinformatics/btn514

Csilléry, K., François, O., & Blum, M. G. B. (2012). abc: an R package for approximate Bayesian computation (ABC). Methods in Ecology and Evolution, 3(3), 475–479. doi: 10.1111/j.2041-210X.2011.00179.x

Currat, M., Arenas, M., Quilodràn, C. S., Excoffier, L., & Ray, N. (2019). SPLATCHE3: Simulation of serial genetic data under spatially explicit evolutionary scenarios including long-distance dispersal. Bioinformatics, 35(21), 4480–4483. doi: 10.1093/bioinformatics/btz311

Ewing, G., & Hermisson, J. (2010). MSMS: A coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics, 26(16), 2064–2065. doi: 10.1093/bioinformatics/btq322

Excoffier, L., & Foll, M. (2011). Fastsimcoal: a Continuous-Time Coalescent Simulator of Genomic Diversity Under Arbitrarily Complex Evolutionary Scenarios. Bioinformatics, 27(9), 1332–1334. doi: 10.1093/bioinformatics/btr124

Gao, F., Ming, C., Hu, W., & Li, H. (2016). New software for the fast estimation of population recombination rates (FastEPRR) in the genomic era. G3: Genes, Genomes, Genetics, 6(6), 1563–1571. doi: 10.1534/g3.116.028233

Gower, G., Picazo, P. I., Fumagalli, M., & Racimo, F. (2020). Detecting adaptive introgression in human evolution using convolutional neural networks. BioRxiv, 2020.09.18.301069. Retrieved from https://doi.org/10.1101/2020.09.18.301069

Haller, B. C., & Messer, P. W. (2019). SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model. Molecular Biology and Evolution, 36(3), 632–637. doi: 10.1093/molbev/msy228

Hartfield, M., Wright, S. I., & Agrawal, A. F. (2016). Coalescent times and patterns of genetic diversity in species with facultative sex: Effects of gene conversion, population structure, and heterogeneity. Genetics, 202(1), 297–312. doi: 10.1534/genetics.115.178004

Hudson, R. R. (2002). Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics, 18(2), 337–338. doi: 10.1093/bioinformatics/18.2.337

Kern, A. D., & Schrider, D. R. (2016). Discoal: flexible coalescent simulations with selection. Bioinformatics, 32(24), 3839–3841.

Kern, A. D., & Schrider, D. R. (2018). diploS/HIC: An Updated Approach to Classifying Selective Sweeps. G3; Genes|Genomes|Genetics, g3.200262.2018. doi: 10.1534/g3.118.200262

Neuenschwander, S., Michaud, F., & Goudet, J. (2019). QuantiNemo 2: A Swiss knife to simulate complex demographic and genetic scenarios, forward and backward in time. Bioinformatics, 35(5), 886–888. doi: 10.1093/bioinformatics/bty737

Pavlidis, P., Laurent, S., & Stephan, W. (2010). MsABC: A modification of Hudson’s ms to facilitate multi-locus ABC analysis. Molecular Ecology Resources, 10(4), 723–727. doi: 10.1111/j.1755-0998.2010.02832.x

Raynal, L., Marin, J. M., Pudlo, P., Ribatet, M., Robert, C. P., & Estoup, A. (2019). ABC random forests for Bayesian parameter inference. Bioinformatics, 35(10), 1720–1728. doi: 10.1093/bioinformatics/bty867

Schrider, D. R., & Kern, A. D. (2016). S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning. PLoS Genetics, 12(3), 1–31. doi: 10.1371/journal.pgen.1005928

Sheehan, S., & Song, Y. S. (2016). Deep Learning for Population Genetic Inference. PLoS Computational Biology, 12(3), 1–28. doi: 10.1371/journal.pcbi.1004845

Staab, P. R., & Metzler, D. (2016). Coala: An R framework for coalescent simulation. Bioinformatics, 32(12), 1903–1904. doi: 10.1093/bioinformatics/btw098

Staab, P. R., Zhu, S., Metzler, D., & Lunter, G. (2015). Scrm: Efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10), 1680–1682. doi: 10.1093/bioinformatics/btu861

Sugden, L. A., Atkinson, E. G., Fischer, A. P., Rong, S., Henn, B. M., & Ramachandran, S. (2018). Localization of adaptive variants in human genomes using averaged one-dependence estimation. Nature Communications, 9(1). doi: 10.1038/s41467-018-03100-7

Torada, L., Lorenzon, L., Beddis, A., Isildak, U., Pattini, L., Mathieson, S., & Fumagalli, M. (2019). ImaGene: A convolutional neural network to quantify natural selection from genomic data. BMC Bioinformatics, 20(Suppl 9), 1–12. doi: 10.1186/s12859-019-2927-x

Wegmann, D., Leuenberger, C., Neuenschwander, S., & Excoffier, L. (2010). ABCtoolbox: a versatile toolkit for approximate Bayesian computations. BMC Bioinformatics, 11, 116. doi: 10.1186/1471-2105-11-116