Methods to detect selection

Checking for population structure is an essential step when performing analyses on genome-level datasets. Neglecting it can bias demographic inferences (Chikhi et al., 2010; Heller et al., 2013) or the detection of loci under selection (e.g. Nielsen et al., 2007); thus, checking for outlier individuals and assessing the global structure is required prior to any more sophisticated analysis. On the other hand, selection acts both on correlations i) between alleles and environment at selected loci and ii) between alleles from different loci, either directly under selection or not. This is reflected respectively by i) variation in polymorphism within and between populations and ii) linkage disequilibrium (LD) between loci. If selection is widespread in the genome, the study of population history can thererore be biased, making necessary the joint study of selection and population structure.
This table summarizes current methods to detect selection.

SoftwareClass of methodPurposeSpecificsIssues and warningsLinkReference
ARGWeaver/ARGweaver-DAncestral Recombination Graphs/coalescenceRetracing the whole process of recombination and coalescence along a genomeProvides quantitative estimates for TMRCA and topologies at each locus. ARGWeaver-D can estimate introgression. Estimates effective population size. Provides tools to extract summary statistics for the topologies retrieved. Does not require phasing (but slower).High computing cost. Slower on unphased or low depth data. ARGWeaver-D is not part of the Anaconda (Python) distribution ( be installed via conda: conda install -c genomedk argweaver and and et al., 2014; Hubisz et al., 2020)
GAPIT3AssociationDetecting association with environmental/phenotypical featuresIncludes most methods for GWAS studies, including procedures for fast computation, mixed linear models, efficient mixed model association, bayesian methods such as BLINK, diagnostics such as QQ plots and genotype filtering.May be slow for very large datasets and Zhang, 2020)
GEMMAAssociationDetecting association with environmental/phenotypical featuresComputationnally efficient for large scale datasetsImports data from PLINK format and Stephens, 2012)
GENABELAssociationDetecting association with environmental/phenotypic featuresModularity, facilitates correction for population structure/relatedness.Imports data from PLINK format. No longer supported! et al., 2007)
PLINKAssociationDetecting association with environmental/phenotypical featuresHandles a variety of tests for population structure and relatednessPopulation structure/kinship need to be assessed prior association analysis et al., 2007)
TrinculoAssociationDetecting association with environmental/phenotypical featuresSpecifically designed to handle categorical variables with more than 2 categories. Performs multinomial logistic regression and provides frequentist and bayesian frameworks.Requires lapack library in Unix. Allows fine-mapping by testing for corrrelations between adjacent markers. and McVean, 2016)
SAMBADAAssociation/Environmental associationDetecting association with environmental/phenotypical featuresDesigned to be fast, underlying models have been kept simple. Allows conversion from PLINK format. Takes into account spatial autocorrelation of individual genotypes. Allows correction for population structureDoes not work with pooled data. Possibly high levels of false positives. Relatedness between samples should be assessed independently. Should be used in combination with LFMM or BayPass. et al., 2016)
RelateCoalescence with recombinationReconstruct genome-wide genealogies for hundreds of samplesProvides quantitative estimates for TMRCA and topologies at each locus. Infers past demography (similar to PSMC methods). Infers changes in mutation rates. Performs scans for positive selection over discrete time periods.Requires an outgroup to polarize alleles as ancestral/derived. Requires a recombination map. Does not reconstruct ARG sensus stricto, and does not estimate uncertainty of the local genealogies et al., 2019)
diCal-IBDCoalescent with recombination/IBDPredicting IBD tracts from demographic modelsHigh IBD sharing suggests recent positive selection.Uses diCal output to obtain expectations based on demographic scenarios et al., 2014)
VolcanoFinderComposite likelihood testAdaptive introgressionDetects a specific signature of increase then drop in diversity near a selected locus brought in a population through introgressionPrivate input format. Computationnally intensive, needs to be run in parallel. et al., 2020)
SCCTConditional coalescent treeDetecting positive selectionDesigned for detecting recent positive selection. Clains to be more precise at identifying selected sitesThe ancestral state of alleles must be obtained through an outgroup et al., 2014)
LFMMEnvironmental associationDetecting adaptation to environmental featuresCorrects for population structure using latent factors, faster than BAYENV for large datasetsOnly performs association with environment et al., 2013)
CLUESGenealogies at selected lociEstimate the time at which a beneficial allele rises in frequencyPrevious version used ARGWeaver output, current version uses Relate. Provides scripts to plot the trajectory of selected alleles.Assumes a panmictic population, neglects the effects of selection at linked sites. et al., 2019)
PALMGenealogies at selected lociEstimate the strength and timing of selection on polygenic traitsUses genealogies estimated from Relate and results from GWAS to estimate timing and strength of selection for polygenic traits. Should be robust to pleiotropy and residual structure in GWASMay overestimate selection for older events. Only tested in humans. et al., 2020)
startmrcaGenealogies at selected lociEstimate the time at which a beneficial allele rises in frequencyCompares genealogies between carriers and non-carriers of an advantageous mutation, assuming a star- genealogy at selected loci. Can handle VCF filesRequires a reference panel of noncarrier haplotypes. Sensitive to loca diversity before the sweep, and to migration events during a sweep. More indicated for recent sweeps., Coop, Stephens, & Novembre, 2018)
Ancestry_HMM-S Identity-by-state tractsAdaptive introgressionEstimates the selective coefficient of the introgressed loci through a hidden-Markov chain approach.Requires the time and extent of introgression to be defined by the user​ et al., 2020)
H12 testLDDetecting selection using signatures of high LDDoes not require phased data. Designed for detecting soft sweepsCoalescent simulations are recommended to evaluate the likelihood of selection et al., 2015)
LDnaLDDetecting selection using signatures of high LDCan be used to address population structure or detect large inversions or indel polymorphism through LDThe user needs to play with parameters to ensure robustness of SNPs significantly linked et al., 2015)
rehhLDDetecting selection using signatures of high LDCan compute both XP-EHH and Rsb. Handles several input formatsRequires phased data and high density of markers and Vitalis, 2012)
Scan for epistatic interaction (based on LD)LDPolygenic selection/Epistatic interactionsUses genome-wide LD between a candidate locus and the rest of the genomes to identify epistatic interactions. Can test SNP-SNP interaction, or between genomic windows (summarizes genotypes through PCA)Lack of a detailed tutorial et al., 2020)
SelscanLDDetecting selection using signatures of high LDIncludes the nSL statistics dedicated to soft sweep detectionDoes not include utilities to specify the ancestral state of alleles. Requires phased data and high density of markers and Hernandez, 2014)
BALLETLikelihood test for balancing selectionDetecting balancing selectionDesigned for detecting ancient balancing selection. Does not require phasingRequires whole-genome data and recombination map. The ancestral state of alleles must be obtained through an outgroup et al., 2014)
Betascan2Local associations of allele frequenciesDetecting balancing selectionUses correlations in frequencies between genomically proximate SNPs to compute a score. Can incorporate information about ancestral/derived alleles, fixed derived variants and normalizes the statistics depending on the amount of sites in a given genomic window. Very detailed tutorial and utilities.Requires estimating the length distribution of ancestral fragments on each side of the selected site. The 95% percentile can be estimated with the formula L=-log(0.05)/(T*rho), with T the time since selection in generations and rho the effective recombination rate/ generation. and Voight, 2017, 2020)
NCD statisticsLocal associations of allele frequenciesDetecting balancing selectionExamines the observed and expected frequency spectra of polymorphisms in genomic windows to test for selection. Can incorporate fixed differences with an outgroup (NCD2), but not mandatory (NCD1)Private input format, requires simulations to calibrate the statistics. Requires to define the expected equilibrium frequency of alleles (usually between 0.3 and 0.5). Low sensitivity below these frequencies. et al., 2018)
BayescanPopulation differentiationDetecting positive selection and local adaptationIncorporates uncertainty on allele frequencies due to low sample sizesSensitive to priors on the ratio of selected/neutral sites. False positive rates can be high under scenarios of demographic expansion, admixture and isolation by distance and Gaggiotti, 2008)
FDIST2Population differentiationDetecting positive selection and local adaptationAllows to control for hierarchical population structureFalse positive rate is high when an island model cannot be assumed and Balding, 2004)
PCAdaptPopulation differentiationDetecting positive selection and local adaptationDoes not require to define populations. Handles admixed populations and pooled datasetsFalse positive rate can be high et al., 2016)
SelEstimPopulation differentiationDetecting positive selection and local adaptationCan estimate the coefficients of selection. Calibration using a pseudo-observed dataset (can be used in combination with the R function simulate.baypass() in BayPass).Assumes a Wrigth-Fisher island model. et al., 2014)
Bayenv, BayPassPopulation differentiation/AssociationDetecting positive selection and adaptation to environmental featuresLess sensitive to population demographic history than previous methods. Handle pooled datasetsSignificance thresholds need to be determined from pseudo-observed datasets. Calibration with neutral SNPs is recommended. BayPass better estimates the kinship matrix ;ünther and Coop, 2013; Gautier, 2015)
FLKPopulation differentiation/AssociationDetecting positive selection and local adaptationLess sensitive to population demographic history than previous methodsRequires an outgroup population et al., 2010)
LSDPopulation differentiation/Population-branch testDetecting positive selection and local adaptationCompares the level of exclusively shared differences between internal and external branches of a population tree. Allows testing selection occurring on the ancestral branch leading to two populations.Requires several populations to perform the test. May be less sensitive to selection on standing variation. and Orlando, 2018)
POPBAMSummary statisticsDetecting selection using AFS, differentiationExtracts summary statistics directly from BAM filesDoes not allow for sophisticated filtering and SNP calling, 2013)
VCFTOOLSSummary statisticsDetecting selection using AFS, differentiationExtracts summary statistics from VCF files. Also allows VCF filtering and conversionSet of summary statistics not as extensive as PopGenome et al., 2011)
RAiSDSummary statistics/Allele frequency spectrum + LDDetecting positive selection and local adaptationScans the genome for composite signals of selective sweeps summarized by the μ statistics. Corrects for the effects of background selection by estimating a threshold value for the statistics based on simulations with background selectionUses a single population of interest. and Pavlidis, 2018)
TASSELSummary statistics/AssociationDetecting association with phenotypeUser friendly (Java interface), corrects for relatedness, allows computing summary statistics (LD, diversity)Requires relatedness to be assessed externally (with e.g. STRUCTURE) et al., 2007)
ANGSDSummary statistics/Association/Population Branch testDetecting selection using AFS, differentiation, association with functional traitsAllows for association using generalized linear modelsDescriptive statistics. P-values need to be evaluated through coalescent simulations. et al., 2014)
SweeDSummary statistics/Composite Likelihood testDesigned for whole genome data (or large continuous regions)Supports Fasta and VCF formats. Estimates selection coefficients.NA et al., 2016)
selectionToolsSummary statistics/LDDetecting selection using AFS, differentiation and LD statisticsAllows combining several tools in a single pipeline. Includes phasing tools.Set of available summary statistics remains limited (same as VCFtools + Fay and Wu's H) et al., 2014)
PAML/CODEMLSummary statistics/phylogenyDistribution of fitness effects/selection on coding variationEstimates selection along branches in a phylogeny for genes of interest, contrasting patterns of synonymous and non-synonymous substitutions. A detailed tutorial is available here: for large datasets. Needs to be parallelised., 2007)
polyDFE2.0Summary statistics/phylogenyDistribution of fitness effects/selection on coding variationCan test for invariance of DFEs across datasets (genomic regions within species, or different species). No need for divergence estimates (does not assume that the same DFE is shared between species and outgroup). Very detailed tutorial available here: require a large number of SNPs for each dataset for comparisons to be meaningful and Bataillon, 2019)
POPGenomeSummary statistics/Population Branch testDetecting selection using AFS, differentiationFast, embedded in R, allows using annotation files (GFF/GTF format).Does not perform association, but can be used in combination with GENABEL within R et al., 2014)
ETEToolkitSummary statistics/phylogenyDistribution of fitness effects/selection on coding variationETEToolkit can call CODEML from Python and can streamline phylogenetic analyses of selection. Estimates selection along branches in a phylogeny for genes of interest, contrasting patterns of synonymous and non-synonymous substitutions. Slow for large datasets. Needs to be parallelised. et al., 2016)
HaplotypeDFEStandingVariation (no official name for the pipeline)Summary statistics/phylogenyDistribution of fitness effects/selection on coding variationUses variation in the length of tracts of Identity-by-State to infer the distribution of fitness effect.The pipeline relies on in-house simulators and ABC to obtain more robust estimates of the DFE. Comprehensive but may be difficult to deploy for a naive user. Vecchyo et al., 2022)


Alachiotis, N., & Pavlidis, P. (2018). RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors. Communications Biology1(1). doi: 10.1038/s42003-018-0085-8

Aulchenko, Y. S., Ripke, S., Isaacs, A., & van Duijn, C. M. (2007). GenABEL: An R library for genome-wide association analysis. Bioinformatics23(10), 1294–1296. doi: 10.1093/bioinformatics/btm108

Beaumont, M. A., & Balding, D. J. (2004). Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology13(4), 969–980. doi: 10.1111/j.1365-294X.2004.02125.x

Bitarello, B. D., De Filippo, C., Teixeira, J. C., Schmidt, J. M., Kleinert, P., Meyer, D., & Andres, A. M. (2018). Signatures of long-term balancing selection in human genomes. Genome Biology and Evolution10(3), 939–955. doi: 10.1093/gbe/evy054

Bonhomme, M., Chevalet, C., Servin, B., Boitard, S., Abdallah, J. M., Blott, S., & San Cristobal, M. (2010). Detecting Selection in Population Trees: The Lewontin and Krakauer Test Extended. Genetics, (186), 241–262. doi: 10.1534/genetics.110.117275

Boyrie, L., Moreau, C., Frugier, F., Jacquet, C., & Bonhomme, M. (2020). A linkage disequilibrium-based statistical test for Genome-Wide Epistatic Selection Scans in structured populations. Heredity. doi: 10.1038/s41437-020-0349-1

Bradbury, P. J., Zhang, Z., Kroon, D. E., Casstevens, T. M., Ramdoss, Y., & Buckler, E. S. (2007). TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics (Oxford, England)23(19), 2633–2635. doi: 10.1093/bioinformatics/btm308

Cadzow, M., Boocock, J., Nguyen, H. T., Wilcox, P., Merriman, T. R., & Black, M. A. (2014). A bioinformatics workflow for detecting signatures of selection in genomic data. Frontiers in Genetics5(AUG), 1–8. doi: 10.3389/fgene.2014.00293

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., … Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics27(15), 2156–2158. doi: 10.1093/bioinformatics/btr330

Degiorgio, M., Huber, C. D., Hubisz, M. J., Hellmann, I., & Nielsen, R. (2016). Genetics and population analysis SWEEPFINDER 2 : Increased sensitivity , robustness , and flexibility. Bioinformatics. doi: 10.111/mec.13351.RR

DeGiorgio, M., Lohmueller, K. E., & Nielsen, R. (2014). A model-based approach for identifying signatures of ancient balancing selection in genetic data. PLoS Genetics10(8), e1004561. doi: 10.1371/journal.pgen.1004561

Duforet-Frebourg, N., Luu, K., Laval, G., Bazin, E., & Blum, M. G. B. (2016). Detecting genomic signatures of natural selection with principal component analysis: Application to the 1000 genomes data. Molecular Biology and Evolution33(4), 1082–1093. doi: 10.1093/molbev/msv334

Foll, M., & Gaggiotti, O. (2008). A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics180(2), 977–993. doi: 10.1534/genetics.108.092221

Frichot, E., Schoville, S. D., Bouchard, G., & François, O. (2013). Testing for associations between loci and environmental gradients using latent factor mixed models. Molecular Biology and Evolution30(7), 1687–1699. doi: 10.1093/molbev/mst063

Garrigan, D. (2013). POPBAM: Tools for evolutionary analysis of short read sequence alignments. Evolutionary Bioinformatics2013(9), 343–353. doi: 10.4137/EBO.S12751

Garud, N. R., Messer, P. W., Buzbas, E. O., & Petrov, D. A. (2015). Recent Selective Sweeps in North American Drosophila melanogaster Show Signatures of Soft Sweeps. PLoS Genetics11(2), 1–32. doi: 10.1371/journal.pgen.1005004

Gautier, M. (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics201(September), 1555–1579. doi: doi:0.1534/genetics.115.181453

Gautier, M., & Vitalis, R. (2012). Rehh An R package to detect footprints of selection in genome-wide SNP data from haplotype structure. Bioinformatics28(8), 1176–1177. doi: 10.1093/bioinformatics/bts115

Günther, T., & Coop, G. (2013). Robust identification of local adaptation from allele frequencies. Genetics195(1), 205–220. doi: 10.1534/genetics.113.152462

Hubisz, M. J., Williams, A. L., & Siepel, A. (2020). Mapping gene flow between ancient hominins through demography-aware inference of the ancestral recombination graph. PLoS Genetics16(8), 1–24. doi: 10.1371/JOURNAL.PGEN.1008895

Jostins, L., & McVean, G. (2016). Trinculo: Bayesian and frequentist multinomial logistic regression for genome-wide association studies of multi-category phenotypes. Bioinformatics32(12), 1898–1900. doi: 10.1093/bioinformatics/btw075

Kemppainen, P., Knight, C. G., Sarma, D. K., Hlaing, T., Prakash, A., Maung Maung, Y. N., … Walton, C. (2015). Linkage disequilibrium network analysis (LDna) gives a global view of chromosomal inversions, local adaptation and geographic structure. Molecular Ecology Resources, (July), 1031–1045. doi: 10.1111/1755-0998.12369

Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics15(1), 356. doi: 10.1186/s12859-014-0356-4

Librado, P., & Orlando, L. (2018). Detecting signatures of positive selection along defined branches of a population tree using LSD. Molecular Biology and Evolution35, 1520–1535. doi: 10.1093/molbev/msy053

Ortega-Del Vecchyo, D., Lohmueller, K. E., Novembre J. (2022). Haplotype-based inference of the distribution of fitness effects, Genetics. doi: 10.1093/genetics/iyac002

Pfeifer, B., Wittelsburger, U., Ramos-Onsins, S. E., & Lercher, M. J. (2014). PopGenome: An efficient swiss army knife for population genomic analyses in R. Molecular Biology and Evolution31(7), 1929–1936. doi: 10.1093/molbev/msu136

Rasmussen, M. D., Hubisz, M. J., Gronau, I., & Siepel, A. (2014). Genome-Wide Inference of Ancestral Recombination Graphs. PLoS Genetics10(5). doi: 10.1371/journal.pgen.1004342

Setter, D., Mousset, S., Cheng, X., Nielsen, R., DeGiorgio, M., & Hermisson, J. (2020). VolcanoFinder: Genomic scans for adaptive introgression. PLoS Genetics16(6), 1–44. doi: 10.1371/journal.pgen.1008867

Siewert, K. M., & Voight, B. F. (2017). Detecting Long-Term Balancing Selection Using Allele Frequency Correlation. Molecular Biology and Evolution34(11), 2996–3005. doi: 10.1093/molbev/msx209

Siewert, K. M., & Voight, B. F. (2020). BetaScan2: Standardized Statistics to Detect Balancing Selection Utilizing Substitution Data. Genome Biology and Evolution12(2), 3873–3877. doi: 10.1093/gbe/evaa013

Speidel, L., Forest, M., Shi, S., & Myers, S. R. (2019). A method for genome-wide genealogy estimation for thousands of samples. Nature Genetics51(9), 1321–1329. doi: 10.1038/s41588-019-0484-x

Stern, A. J., Wilton, P. R., & Nielsen, R. (2019). An approximate full-likelihood method for inferring selection and allele frequency trajectories from DNA sequence data. In PLoS Genetics (Vol. 15). doi: 10.1371/journal.pgen.1008384

Stern, A., Speidel, L., Zaitlen, N., & Nielsen, R. (2020). Disentangling selection on genetically correlated polygenic traits using whole-genome genealogies. BioRxiv, 1–30. doi: 10.1101/2020.05.07.083402

Stucki, S., Orozco-Terwengel, P., Bruford, M. W., Colli, L., Masembe, C., Negrini, R., … Consortium, N. (2016). High performance computation of landscape genomic models integrating local indices of spatial association. Molecular Ecology Resources17(5), 1072–1089. doi: 10.1111/j.1540-8191.2009.00972.x

Svedberg​, J., Shchur​, V., Reinman​, S., Nielsen​, R., Corbett-Detig​, R., & Svedberg, J. (2020). Inferring Adaptive Introgression Using Hidden Markov Models. BioRxiv. doi:

Szpiech, Z. A., & Hernandez, R. D. (2014). selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol31(10), 2824–2827. doi: 10.1093/molbev/msu211

Tataru, P., & Bataillon, T. (2019). PolyDFEv2.0: Testing for invariance of the distribution of fitness effects within and across species. Bioinformatics35(16), 2868–2869. doi: 10.1093/bioinformatics/bty1060

Tataru, P., Nirody, J. A., & Song, Y. S. (2014). DiCal-IBD: Demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics30(23), 3430–3431. doi: 10.1093/bioinformatics/btu563

Vitalis, R., Gautier, M., Dawson, K. J., & Beaumont, M. A. (2014). Detecting and measuring selection from gene frequency data. Genetics196(3), 799–817. doi: 10.1534/genetics.113.152991

Wang, J., & Zhang, Z. (2020). GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction. BioRxiv.

Wang, M., Huang, X., Li, R., Xu, H., Jin, L., & He, Y. (2014). Detecting recent positive selection with high accuracy and reliability by conditional coalescent tree. Molecular Biology and Evolution31(11), 3068–3080. doi: 10.1093/molbev/msu244

Yang, Z. (2007). PAML 4: Phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution24(8), 1586–1591. doi: 10.1093/molbev/msm088

Zhou, X., & Stephens, M. (2012). Genome-wide efficient mixed model analysis for association studies. Nature Genetics44(7), 821–824. doi: 10.1038/ng.2310.