Checking for population structure is an essential step when performing analyses on genome-level datasets. Neglecting it can bias demographic inferences (Chikhi et al., 2010; Heller et al., 2013) or the detection of loci under selection (e.g. Nielsen et al., 2007); thus, checking for outlier individuals and assessing the global structure is required prior to any more sophisticated analysis. On the other hand, selection acts both on correlations i) between alleles and environment at selected loci and ii) between alleles from different loci, either directly under selection or not. This is reflected respectively by i) variation in polymorphism within and between populations and ii) linkage disequilibrium (LD) between loci. If selection is widespread in the genome, the study of population history can thererore be biased, making necessary the joint study of selection and population structure.
This table summarizes current methods to detect selection.
|Software||Class of methods||Purpose||Specifics||Issues and warnings||Link||Reference|
|Bayescan||Population differentiation||Detecting positive selection and local adaptation||Incorporates uncertainty on allele frequencies due to low sample sizes||Sensitive to priors on the ratio of selected:neutral sites. False positive rates can be high under scenarios of demographic expansion, admixture and isolation by distance||http://cmpg.unibe.ch/software/BayeScan/||(Foll and Gaggiotti, 2008)|
|FDIST2||Population differentiation||Detecting positive selection and local adaptation||Allows to control for hierarchical population structure||False positive rate is high when an island model cannot be assumed||http://datadryad.org/resource/doi:10.5061/dryad.v8d05||(Beaumont and Balding, 2004)|
|FLK||Population differentiation/Association||Detecting positive selection and local adaptation||Less sensitive to population demographic history than previous methods||Requires an outgroup population||https://qgsp.jouy.inra.fr/index.php?option=com_content&view=article&id=50&Itemid=55||(Bonhomme et al. , 2010)|
|PCAdapt||Population differentiation||Detecting positive selection and local adaptation||Does not require to define populations. Handles admixed populations and pooled datasets||False positive rate can be high||http://membres-timc.imag.fr/Michael.Blum/PCAdapt.html||(Duforet-Frebourg et al. , 2016)|
|Bayenv, BayPass||Population differentiation/Association||Detecting positive selection and adaptation to environmental features||Less sensitive to population demographic history than previous methods. Handle pooled datasets||Significance thresholds need to be determined from pseudo-observed datasets. Calibration with neutral SNPs is recommended. BayPass better estimates the kinship matrix||http://www1.montpellier.inra.fr/CBGP/software/baypass/ ; https://bitbucket.org/tguenther/bayenv2_public/src||(Günther and Coop, 2013; Gautier, 2015)|
|LFMM||Environmental association||Detecting adaptation to environmental features||Corrects for population structure using latent factors, faster than BAYENV for large datasets||Only performs association with environment||http://membres-timc.imag.fr/Olivier.Francois/lfmm/software.htm||(Frichot et al. , 2013)|
|PLINK||Association||Detecting association with environmental/phenotypical features||Handles a variety of tests for population structure and relatedness||Population structure/kinship need to be assessed prior association analysis||http://pngu.mgh.harvard.edu/~purcell/plink/||(Purcell et al. , 2007)|
|GENABEL||Association||Detecting association with environmental/phenotypic features||Modularity, facilitates correction for population structure/relatedness.||Imports data from PLINK format||http://www.genabel.org/||(Aulchenko et al., 2007)|
|GEMMA||Association||Detecting association with environmental/phenotypical features||Computationnally efficient for large scale datasets. Specifically designed for polygenic modeling in genome-wide association studies||Imports data from PLINK format||http://www.xzlab.org/software.html||(Zhou and Stephens, 2012)|
|ANGSD||Summary statistics/Association||Detecting selection using AFS, differentiation, association with functional traits||Allows for association using generalized linear models||Descriptive statistics. P-values need to be evaluated through coalescent simulations.||http://www.popgen.dk/angsd/index.php/ANGSD||(Korneliussen et al. , 2014)|
|TASSEL||Summary statistics/Association||Detecting association with phenotype||User friendly (Java interface), corrects for relatedness, allows computing summary statistics (LD, diversity)||Requires relatedness to be assessed externally (with e.g. STRUCTURE)||http://www.maizegenetics.net/tassel||(Korneliussen et al. , 2014)|
|selectionTools||Summary statistics/LD||Detecting selection using AFS, differentiation and LD statistics||Allows combining several tools in a single pipeline. Includes phasing tools.||Set of available summary statistics remains limited (same as VCFtools + Fay and Wu's H)||https://github.com/MerrimanLab/selectionTools||(Cadzow et al. , 2014)|
|POPGenome||Summary statistics||Detecting selection using AFS, differentiation||Fast, embedded in R, allows using annotation files (GFF/GTF format).||Does not perform association, but can be used in combination with GENABEL within R||https://cran.r-project.org/web/packages/PopGenome/index.html||(Pfeifer et al. , 2014)|
|POPBAM||Summary statistics||Detecting selection using AFS, differentiation||Extracts summary statistics directly from BAM files||Does not allow for sophisticated filtering and SNP calling||http://popbam.sourceforge.net/||(Garrigan, 2013)|
|VCFTOOLS||Summary statistics||Detecting selection using AFS, differentiation||Extracts summary statistics from VCF files. Also allows VCF filtering and conversion||Set of summary statistics not as extensive as PopGenome||http://vcftools.sourceforge.net/||(Danecek et al. , 2011)|
|SweeD||Composite Likelihood test||Designed for whole genome data (or large continuous regions)||Supports Fasta and VCF formats. Estimates for selection coefficients.||Better suited for whole genome datasets||http://pop-gen.eu/wordpress/software/sweed||(Degiorgio et al. , 2016)|
|Selscan||LD||Detecting selection using signatures of high LD||Includes the nSL statistics dedicated to soft sweep detection||Does not include utilities to specify the ancestral state of alleles. Requires phased data and high density of markers||https://github.com/szpiech/selscan||(Szpiech and Hernandez, 2014)|
|rehh||LD||Detecting selection using signatures of high LD||Can compute both XP-EHH and Rsb. Handles several input formats||Requires phased data and high density of markers||https://cran.r-project.org/web/packages/rehh/index.html||(Gautier et al., 2017)|
|H12 test||LD||Detecting selection using signatures of high LD||Does not require phased data. Designed for detecting soft sweeps||Coalescent simulations are recommended to evaluate the likelihood of selection||https://github.com/ngarud/SelectionHapStats/||(Garud et al. , 2015)|
|LDna||LD||Detecting selection using signatures of high LD||Can be used to address population structure or detect large inversions or indel polymorphism through LD||The user needs to play with parameters to ensure robustness of SNPs significantly linked||https://github.com/petrikemppainen/LDna||(Kemppainen et al. , 2015)|
|ARGWeaver||Ancestral recombination graphs||Detecting selection by screening for variation in topology and age of alleles||Provides quantitative estimates for TMRCA and topologies at each locus. Can be used to infer demographic history. Especially useful to identify signature of long-term balancing selection (older coalescence times)||High computing cost. Requires phased whole-genome data.||https://github.com/mdrasmus/argweaver||(Rasmussen et al. , 2014)|
|msms||Coalescence||Simulate demographic scenarios including selection||Flexible, syntax similar to ms, handles arbitratily complex models. Can be used in an ABC framework to include selection as a parameter to be estimated||Syntax can be difficult to handle for the naive user (but see coala)||http://www.mabs.at/ewing/msms/index.shtml||(Ewing and Hermisson, 2010)|
|discoal||Coalescence||Simulate selective sweeps under arbitrary demographic scenarios||More specifically designed for studying soft and hard sweeps||Redundant with msms||https://github.com/kern-lab/discoal||Publication embargoed (Kern and Schrider, 2016)|
|diCal-IBD||Coalescent with recombination/IBD||Predicting IBD tracts from demographic models||High IBD sharing suggests recent positive selection.||Uses diCal output to obtain expectations based on demographic scenarios||https://sourceforge.net/projects/dical-ibd/||https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4296155/|
|BALLET||Likelihood test for balancing selection||Detecting balancing selection||Designed for detecting ancient balancing selection. Does not require phasing||Requires whole-genome data and recombination map. The ancestral state of alleles must be obtained through an outgroup||http://www.personal.psu.edu/mxd60/ballet.html||(DeGiorgio et al. , 2014)|
|SCCT||Conditional coalescent tree||Detecting positive selection||Designed for detecting recent positive selection. Clains to be more precise at identifying selected sites||Requires whole-genome data. The ancestral state of alleles must be obtained through an outgroup||https://github.com/wavefancy/scct||(Wang et al. , 2014)|
|Trinculo||Association||Detecting association with environmental/phenotypical features||Specifically designed to handle categorical variables with more than 2 categories. Performs multinomial logistic regression and provides frequentist and bayesian frameworks.||Requires lapack library in Unix. Allows fine-mapping by testing for corrrelations between adjacent markers.||https://sourceforge.net/projects/trinculo/||(Jostins & McVEan 2016)|
|SAMBADA||Association/Environmental association||Detecting association with environmental/phenotypical features||Designed to be fast, underlying models have been kept simple. Allows conversion from PLINK format. Takes into account spatial autocorrelation of individual genotypes. Allows correction for population structure||Does not work with pooled data. Possibly high levels of false positives. Relatedness between samples should be assessed independently. Should be used in combination with LFMM or BayPass.||http://lasig.epfl.ch/sambada||(Stucki et al., 2016)|
|SelEstim||Population differentiation||Detecting positive selection and local adaptation||Can estimate the coefficients of selection. Calibration using a pseudo-observed dataset to obtain (can be used in combination with the R function simulate.baypass() in BayPass).||Assumes an island model.||http://www1.montpellier.inra.fr/CBGP/software/selestim/||(Vitalis et al., 2014)|
|BetaScan||Test for long-term balancing selection||Detecting balancing selection||Uses the allele frequency spectrum acrosss genomic windows to detect balancing selection. Can use both folded and unfolded (with outgroup) spectra||It is advided to compare results from both folded and unfolded spectra if possible. The size of independent windows to consider should be defined according to some (even rough) estimate of recombination rate.||https://github.com/ksiewert/BetaScan||(Siewert and Voight 2017)|
Aulchenko YS, Ripke S, Isaacs A, van Duijn CM (2007). GenABEL: An R library for genome-wide association analysis. Bioinformatics 23: 1294–1296.
Beaumont MA, Balding DJ (2004). Identifying adaptive genetic divergence among populations from genome scans. Mol Ecol 13: 969–980.
Beeravolu CR, Hickerson MJ, Frantz LAF, Lohse K (2016). Approximate Likelihood Inference of Complex Population Histories and Recombination from Multiple Genomes. bioarXiv: 1–31.
Bonhomme M, Chevalet C, Servin B, Boitard S, Abdallah JM, Blott S, et al. (2010). Detecting Selection in Population Trees: The Lewontin and Krakauer Test Extended. Genetics: 241–262.
Cadzow M, Boocock J, Nguyen HT, Wilcox P, Merriman TR, Black MA (2014). A bioinformatics workflow for detecting signatures of selection in genomic data. Front Genet 5: 1–8.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. (2011). The variant call format and VCFtools. Bioinformatics 27: 2156–2158.
Degiorgio M, Huber CD, Hubisz MJ, Hellmann I, Nielsen R (2016). Genetics and population analysis SWEEPFINDER 2 : Increased sensitivity , robustness , and flexibility. Bioinformatics.
DeGiorgio M, Lohmueller KE, Nielsen R (2014). A model-based approach for identifying signatures of ancient balancing selection in genetic data. PLoS Genet 10: e1004561.
Duforet-Frebourg N, Luu K, Laval G, Bazin E, Blum MGB (2016). Detecting genomic signatures of natural selection with principal component analysis: Application to the 1000 genomes data. Mol Biol Evol 33: 1082–1093.
Ewing G, Hermisson J (2010). MSMS: A coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26: 2064–2065.
Foll M, Gaggiotti O (2008). A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective. Genetics 180: 977–93.
Frichot E, Schoville SD, Bouchard G, François O (2013). Testing for associations between loci and environmental gradients using latent factor mixed models. Mol Biol Evol 30: 1687–1699.
Garrigan D (2013). POPBAM: Tools for evolutionary analysis of short read sequence alignments. Evol Bioinforma 2013: 343–353.
Garud NR, Messer PW, Buzbas EO, Petrov DA (2015). Recent Selective Sweeps in North American Drosophila melanogaster Show Signatures of Soft Sweeps. PLoS Genet 11: 1–32.
Gautier M (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics 201: 1555–1579.
Gautier M, Klassmann A, Vitalis R (2017). rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure. Mol Ecol Resour 17: 78–90.
Günther T, Coop G (2013). Robust identification of local adaptation from allele frequencies. Genetics 195: 205–220.
Kemppainen P, Knight CG, Sarma DK, Hlaing T, Prakash A, Maung Maung YN, et al. (2015). Linkage disequilibrium network analysis (LDna) gives a global view of chromosomal inversions, local adaptation and geographic structure. Mol Ecol Resour: 1031–1045.
Kern AD, Schrider DR (2016). Discoal: flexible coalescent simulations with selection. Bioinformatics 32: 3839–3841.
Korneliussen TS, Albrechtsen A, Nielsen R (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 15: 356.
Pfeifer B, Wittelsburger U, Ramos-Onsins SE, Lercher MJ (2014). PopGenome: An efficient swiss army knife for population genomic analyses in R. Mol Biol Evol 31: 1929–1936.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am J Hum Genet 81: 559–575.
Rasmussen MD, Hubisz MJ, Gronau I, Siepel A (2014). Genome-Wide Inference of Ancestral Recombination Graphs. PLoS Genet 10.
Stucki S, Orozco-Terwengel P, Bruford MW, Colli L, Masembe C, Negrini R, et al. (2016). High performance computation of landscape genomic models integrating local indices of spatial association. Mol Ecol Resour
Szpiech ZA, Hernandez RD (2014). selscan: an efficient multithreaded program to perform EHH-based scans for positive selection. Mol Biol Evol 31: 2824–2827.
Vitalis R, Gautier M, Dawson KJ, Beaumont MA (2014). Detecting and measuring selection from gene frequency data. Genetics 196: 799–817.
Wang M, Huang X, Li R, Xu H, Jin L, He Y (2014). Detecting recent positive selection with high accuracy and reliability by conditional coalescent tree. Mol Biol Evol 31: 3068–3080.
Zhou X, Stephens M (2012). Genome-wide efficient mixed model analysis for association studies. Nat Genet 44: 821–824.