Methods to detect selection

Checking for population structure is an essential step when performing analyses on genome-level datasets. Neglecting it can bias demographic inferences (Chikhi et al., 2010; Heller et al., 2013) or the detection of loci under selection (e.g. Nielsen et al., 2007); thus, checking for outlier individuals and assessing the global structure is required prior to any more sophisticated analysis. On the other hand, selection acts both on correlations i) between alleles and environment at selected loci and ii) between alleles from different loci, either directly under selection or not. This is reflected respectively by i) variation in polymorphism within and between populations and ii) linkage disequilibrium (LD) between loci. If selection is widespread in the genome, the study of population history can thererore be biased, making necessary the joint study of selection and population structure.
This table summarizes current methods to detect selection.

 

SoftwareClass of methodsPurposeSpecificsIssues and warningsLinkReference
BayescanPopulation differentiationDetecting positive selection and local adaptationIncorporates uncertainty on allele frequencies due to low sample sizesSensitive to priors on the ratio of selected:neutral sites. False positive rates can be high under scenarios of demographic expansion, admixture and isolation by distancehttp://cmpg.unibe.ch/software/BayeScan/(Foll and Gaggiotti, 2008)
FDIST2Population differentiationDetecting positive selection and local adaptationAllows to control for hierarchical population structureFalse positive rate is high when an island model cannot be assumedhttp://datadryad.org/resource/doi:10.5061/dryad.v8d05(Beaumont and Balding, 2004)
FLKPopulation differentiation/AssociationDetecting positive selection and local adaptationLess sensitive to population demographic history than previous methodsRequires an outgroup populationhttps://qgsp.jouy.inra.fr/index.php?option=com_content&view=article&id=50&Itemid=55(Bonhomme et al. , 2010)
PCAdaptPopulation differentiationDetecting positive selection and local adaptationDoes not require to define populations. Handles admixed populations and pooled datasetsFalse positive rate can be highhttp://membres-timc.imag.fr/Michael.Blum/PCAdapt.html(Duforet-Frebourg et al. , 2016)
Bayenv, BayPassPopulation differentiation/AssociationDetecting positive selection and adaptation to environmental featuresLess sensitive to population demographic history than previous methods. Handle pooled datasetsSignificance thresholds need to be determined from pseudo-observed datasets. Calibration with neutral SNPs is recommended. BayPass better estimates the kinship matrixhttp://www1.montpellier.inra.fr/CBGP/software/baypass/ ; https://bitbucket.org/tguenther/bayenv2_public/src(Günther and Coop, 2013; Gautier, 2015)
LFMMEnvironmental associationDetecting adaptation to environmental featuresCorrects for population structure using latent factors, faster than BAYENV for large datasetsOnly performs association with environmenthttp://membres-timc.imag.fr/Olivier.Francois/lfmm/software.htm(Frichot et al. , 2013)
PLINKAssociationDetecting association with environmental/phenotypical featuresHandles a variety of tests for population structure and relatednessPopulation structure/kinship need to be assessed prior association analysishttp://pngu.mgh.harvard.edu/~purcell/plink/(Purcell et al. , 2007)
GENABELAssociationDetecting association with environmental/phenotypic featuresModularity, facilitates correction for population structure/relatedness.Imports data from PLINK formathttp://www.genabel.org/(Aulchenko et al., 2007)
GEMMAAssociationDetecting association with environmental/phenotypical featuresComputationnally efficient for large scale datasets. Specifically designed for polygenic modeling in genome-wide association studiesImports data from PLINK formathttp://www.xzlab.org/software.html(Zhou and Stephens, 2012)
ANGSDSummary statistics/AssociationDetecting selection using AFS, differentiation, association with functional traitsAllows for association using generalized linear modelsDescriptive statistics. P-values need to be evaluated through coalescent simulations.http://www.popgen.dk/angsd/index.php/ANGSD(Korneliussen et al. , 2014)
TASSELSummary statistics/AssociationDetecting association with phenotypeUser friendly (Java interface), corrects for relatedness, allows computing summary statistics (LD, diversity)Requires relatedness to be assessed externally (with e.g. STRUCTURE)http://www.maizegenetics.net/tassel(Korneliussen et al. , 2014)
selectionToolsSummary statistics/LDDetecting selection using AFS, differentiation and LD statisticsAllows combining several tools in a single pipeline. Includes phasing tools.Set of available summary statistics remains limited (same as VCFtools + Fay and Wu's H)https://github.com/MerrimanLab/selectionTools(Cadzow et al. , 2014)
POPGenomeSummary statisticsDetecting selection using AFS, differentiationFast, embedded in R, allows using annotation files (GFF/GTF format).Does not perform association, but can be used in combination with GENABEL within Rhttps://cran.r-project.org/web/packages/PopGenome/index.html(Pfeifer et al. , 2014)
POPBAMSummary statisticsDetecting selection using AFS, differentiationExtracts summary statistics directly from BAM filesDoes not allow for sophisticated filtering and SNP callinghttp://popbam.sourceforge.net/(Garrigan, 2013)
VCFTOOLSSummary statisticsDetecting selection using AFS, differentiationExtracts summary statistics from VCF files. Also allows VCF filtering and conversionSet of summary statistics not as extensive as PopGenomehttp://vcftools.sourceforge.net/(Danecek et al. , 2011)
SweeDComposite Likelihood testDesigned for whole genome data (or large continuous regions)Supports Fasta and VCF formats. Estimates for selection coefficients.Better suited for whole genome datasetshttp://pop-gen.eu/wordpress/software/sweed(Degiorgio et al. , 2016)
SelscanLDDetecting selection using signatures of high LDIncludes the nSL statistics dedicated to soft sweep detectionDoes not include utilities to specify the ancestral state of alleles. Requires phased data and high density of markershttps://github.com/szpiech/selscan(Szpiech and Hernandez, 2014)
rehhLDDetecting selection using signatures of high LDCan compute both XP-EHH and Rsb. Handles several input formatsRequires phased data and high density of markershttps://cran.r-project.org/web/packages/rehh/index.html(Gautier et al., 2017)
H12 testLDDetecting selection using signatures of high LDDoes not require phased data. Designed for detecting soft sweepsCoalescent simulations are recommended to evaluate the likelihood of selectionhttps://github.com/ngarud/SelectionHapStats/(Garud et al. , 2015)
LDnaLDDetecting selection using signatures of high LDCan be used to address population structure or detect large inversions or indel polymorphism through LDThe user needs to play with parameters to ensure robustness of SNPs significantly linkedhttps://github.com/petrikemppainen/LDna(Kemppainen et al. , 2015)
ARGWeaverAncestral recombination graphsDetecting selection by screening for variation in topology and age of allelesProvides quantitative estimates for TMRCA and topologies at each locus. Can be used to infer demographic history. Especially useful to identify signature of long-term balancing selection (older coalescence times)High computing cost. Requires phased whole-genome data.https://github.com/mdrasmus/argweaver(Rasmussen et al. , 2014)
msmsCoalescenceSimulate demographic scenarios including selectionFlexible, syntax similar to ms, handles arbitratily complex models. Can be used in an ABC framework to include selection as a parameter to be estimatedSyntax can be difficult to handle for the naive user (but see coala)http://www.mabs.at/ewing/msms/index.shtml(Ewing and Hermisson, 2010)
discoalCoalescenceSimulate selective sweeps under arbitrary demographic scenariosMore specifically designed for studying soft and hard sweepsRedundant with msmshttps://github.com/kern-lab/discoalPublication embargoed (Kern and Schrider, 2016)
diCal-IBDCoalescent with recombination/IBDPredicting IBD tracts from demographic modelsHigh IBD sharing suggests recent positive selection. Uses diCal output to obtain expectations based on demographic scenarioshttps://sourceforge.net/projects/dical-ibd/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4296155/
BALLETLikelihood test for balancing selectionDetecting balancing selectionDesigned for detecting ancient balancing selection. Does not require phasingRequires whole-genome data and recombination map. The ancestral state of alleles must be obtained through an outgrouphttp://www.personal.psu.edu/mxd60/ballet.html(DeGiorgio et al. , 2014)
SCCTConditional coalescent treeDetecting positive selectionDesigned for detecting recent positive selection. Clains to be more precise at identifying selected sitesRequires whole-genome data. The ancestral state of alleles must be obtained through an outgrouphttps://github.com/wavefancy/scct(Wang et al. , 2014)
TrinculoAssociationDetecting association with environmental/phenotypical featuresSpecifically designed to handle categorical variables with more than 2 categories. Performs multinomial logistic regression and provides frequentist and bayesian frameworks.Requires lapack library in Unix. Allows fine-mapping by testing for corrrelations between adjacent markers.https://sourceforge.net/projects/trinculo/(Jostins & McVEan 2016)
SAMBADAAssociation/Environmental associationDetecting association with environmental/phenotypical featuresDesigned to be fast, underlying models have been kept simple. Allows conversion from PLINK format. Takes into account spatial autocorrelation of individual genotypes. Allows correction for population structureDoes not work with pooled data. Possibly high levels of false positives. Relatedness between samples should be assessed independently. Should be used in combination with LFMM or BayPass.http://lasig.epfl.ch/sambada(Stucki et al., 2016)
SelEstimPopulation differentiationDetecting positive selection and local adaptationCan estimate the coefficients of selection. Calibration using a pseudo-observed dataset to obtain (can be used in combination with the R function simulate.baypass() in BayPass). Assumes an island model.http://www1.montpellier.inra.fr/CBGP/software/selestim/(Vitalis et al., 2014)
BetaScanTest for long-term balancing selectionDetecting balancing selectionUses the allele frequency spectrum acrosss genomic windows to detect balancing selection. Can use both folded and unfolded (with outgroup) spectraIt is advided to compare results from both folded and unfolded spectra if possible. The size of independent windows to consider should be defined according to some (even rough) estimate of recombination rate.https://github.com/ksiewert/BetaScan(Siewert and Voight 2017)

References

Aulchenko YS, Ripke S, Isaacs A, van Duijn CM (2007). GenABEL: An R library for genome-wide association analysis. Bioinformatics 23: 1294–1296.

Beaumont MA, Balding DJ (2004). Identifying adaptive genetic divergence among populations from genome scans. Mol Ecol 13: 969–980.

Beeravolu CR, Hickerson MJ, Frantz LAF, Lohse K (2016). Approximate Likelihood Inference of Complex Population Histories and Recombination from Multiple Genomes. bioarXiv: 1–31.

Bonhomme M, Chevalet C, Servin B, Boitard S, Abdallah JM, Blott S, et al. (2010). Detecting Selection in Population Trees: The Lewontin and Krakauer Test Extended. Genetics: 241–262.

Cadzow M, Boocock J, Nguyen HT, Wilcox P, Merriman TR, Black MA (2014). A bioinformatics workflow for detecting signatures of selection in genomic data. Front Genet 5: 1–8.

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. (2011). The variant call format and VCFtools. Bioinformatics 27: 2156–2158.

Degiorgio M, Huber CD, Hubisz MJ, Hellmann I, Nielsen R (2016). Genetics and population analysis SWEEPFINDER 2 : Increased sensitivity , robustness , and flexibility. Bioinformatics.

DeGiorgio M, Lohmueller KE, Nielsen R (2014). A model-based approach for identifying signatures of ancient balancing selection in genetic data. PLoS Genet 10: e1004561.

Duforet-Frebourg N, Luu K, Laval G, Bazin E, Blum MGB (2016). Detecting genomic signatures of natural selection with principal component analysis: Application to the 1000 genomes data. Mol Biol Evol 33: 1082–1093.

Ewing G, Hermisson J (2010). MSMS: A coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26: 2064–2065.

Foll M, Gaggiotti O (2008). A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180: 977–93.

Frichot E, Schoville SD, Bouchard G, François O (2013). Testing for associations between loci and environmental gradients using latent factor mixed models. Mol Biol Evol 30: 1687–1699.

Garrigan D (2013). POPBAM: Tools for evolutionary analysis of short read sequence alignments. Evol Bioinforma 2013: 343–353.

Garud NR, Messer PW, Buzbas EO, Petrov DA (2015). Recent Selective Sweeps in North American Drosophila melanogaster Show Signatures of Soft Sweeps. PLoS Genet 11: 1–32.

Gautier M (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics 201: 1555–1579.

Gautier M, Klassmann A, Vitalis R (2017). rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure. Mol Ecol Resour 17: 78–90.

Günther T, Coop G (2013). Robust identification of local adaptation from allele frequencies. Genetics 195: 205–220.

Kemppainen P, Knight CG, Sarma DK, Hlaing T, Prakash A, Maung Maung YN, et al. (2015). Linkage disequilibrium network analysis (LDna) gives a global view of chromosomal inversions, local adaptation and geographic structure. Mol Ecol Resour: 1031–1045.

Kern AD, Schrider DR (2016). Discoal: flexible coalescent simulations with selection. Bioinformatics 32: 3839–3841.

Korneliussen TS, Albrechtsen A, Nielsen R (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 15: 356.

Pfeifer B, Wittelsburger U, Ramos-Onsins SE, Lercher MJ (2014). PopGenome: An efficient swiss army knife for population genomic analyses in R. Mol Biol Evol 31: 1929–1936.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am J Hum Genet 81: 559–575.

Rasmussen MD, Hubisz MJ, Gronau I, Siepel A (2014). Genome-Wide Inference of Ancestral Recombination Graphs. PLoS Genet 10.

Stucki S, Orozco-Terwengel P, Bruford MW, Colli L, Masembe C, Negrini R, et al. (2016). High performance computation of landscape genomic models integrating local indices of spatial association. Mol Ecol Resour

Szpiech ZA, Hernandez RD (2014). selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol 31: 2824–2827.

Vitalis R, Gautier M, Dawson KJ, Beaumont MA (2014). Detecting and measuring selection from gene frequency data. Genetics 196: 799–817.

Wang M, Huang X, Li R, Xu H, Jin L, He Y (2014). Detecting recent positive selection with high accuracy and reliability by conditional coalescent tree. Mol Biol Evol 31: 3068–3080.

Zhou X, Stephens M (2012). Genome-wide efficient mixed model analysis for association studies. Nat Genet 44: 821–824.