Methods to infer population structure and explore datasets

Here you will find a summary of methods aiming at identifying population structure.

SoftwareType of methodPurposeSpecificsIssues and warningsLinkReference
SPRelateMultivariate analysisSummarizing variance across loci and visualizing inter-individual genetic distanceFast. Can use VCF files as an inputRequires careful interpretation (Jombard et al. 2009)https://bioconductor.org/packages/release/bioc/html/SNPRelate.html(Zheng et al. , 2012)
Eigenstrat/smartpcaMultivariate analysisSummarizing variance across loci and visualizing inter-individual genetic distanceFast. Can use VCF files as an inputRequires careful interpretation (Jombard et al. 2009)https://github.com/DReichLab/EIG/tree/master/EIGENSTRAT(Price et al. , 2006)
DAPC (adegenet)Multivariate analysis/ClusteringMaximizes divergence between groups identified by PCAFast. Less sensitive to HWE assumptions. Claims to be more efficient than StructureRequires careful interpretation (Jombard et al. 2009)http://adegenet.r-forge.r-project.org/(Jombart et al. , 2010)
sPCA (adegenet)Multivariate analysis/ClusteringSpatially explicit model to assess population structureSpatially explicit and able to detect cryptic structure. Fast.Does not take into account HW equilibrium or LDhttp://adegenet.r-forge.r-project.org/(Jombart et al. , 2008)
BEDASSLEDifferentiation and MCMC model testingIdentifies contribution of environment and geographical distance to populations differentiationLess biased than Mantel tests, provides tools for model testingUses population-level data.https://cran.r-project.org/web/packages/BEDASSLE/index.html(Bradburd et al. , 2013)
GENELANDClustering and characterizing admixtureGrouping individuals in spatially consistent clusters maximizing HW equilibriumTakes into account spatial variation, supposed to detect weak structure, framed in RImmigrant alleles are assumed to be found only in new immigrantshttps://cran.r-project.org/web/packages/Geneland/(Guillot et al. , 2012)
sNMFClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between lociFast (30X than ADMIXTURE)Still slow computation time for large datasetshttp://membres-timc.imag.fr/Olivier.Francois/snmf/index.htm(Frichot et al. , 2014)
STRUCTUREClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between lociUser friendly interface. Bayesian inference.Slow for large datasets. Requires specific input formathttp://pritchardlab.stanford.edu/structure.html(Pritchard et al. , 2000)
FastSTRUCTUREClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between loci~100X faster than StructureApproximate inference of the original Structure modelhttp://rajanil.github.io/fastStructure/(Raj et al. , 2014)
ADMIXTUREClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between lociMaximum Likelihood, claimed to be faster than Structure. Note that it allows mixed ploidy (e.g. individuals that are haploids or diploids at a chromosome/locus depending on their sex can be analyzed jointly).Often slower than its counterpartshttps://www.genetics.ucla.edu/software/admixture/index.html(Alexander and Novembre, 2009)
FineStructure/GlobeTrotterClustering and characterizing admixtureChromosome painting, admixture and clusteringEstimates time since admixture, fast, specific tools for RAD-seq, set of scripts to facilitate analysisRelies on Structure and fastStructure assumptions. Requires phased data.http://paintmychromosomes.com/(Hellenthal et al. , 2014)
PCAdmixClustering and characterizing admixtureChromosome paintingFast, uses HMM to smooth out windows and limit noise due to low confidence ancestryRequires a priori definition of ancestral populations and phased haplotypeshttps://sites.google.com/site/pcadmix/(Brisbin et al. , 2012)
SplitstreePhylogeny/NetworkNetwork reconstruction and phylogenetic relationshipsUser friendly interface, proposes a variety of methods for networks reconstructionMostly descriptivehttp://www.splitstree.org/(Huson and Bryant, 2006)
SNPhyloPhylogenyNetwork reconstruction and phylogenetic relationshipsComplete pipeline from SNP filtering to tree reconstructionShould be used on complex of species or divergent populations with little migrationhttp://chibba.pgml.uga.edu/snphylo/(Lee et al. , 2014)
RAxMLPhylogenyNetwork reconstruction and phylogenetic relationshipsMaximum Likelihood inference of phylogenetic relationshipsShould be used on complex of species or divergent populations with little migrationhttp://sco.h-its.org/exelixis/web/software/raxml/index.html(Stamatakis, 2014)
BEAST2PhylogenyNetwork reconstruction and phylogenetic relationshipsUser friendly. Can be used to track changes in effective population sizes (Bayesian Skyline Plots). Possible to estimate divergence timesSlow for large datasets. Requires sequence data that can be produced by , e.g., Stacks for RAD-seq datahttp://beast2.org/(Drummond and Rambaut, 2007; Bouckaert et al. , 2014)
PhyMLPhylogenyPhylogenetic relationshipsMaximum Likelihood inference of phylogenetic relationships. An online version is availableShould be used on complex of species or divergent populations with little migrationhttp://www.atgc-montpellier.fr/phyml/binaries.php(Guindon et al. , 2010)
SNAPPPhylogenyPhylogenetic relationshipsHandles SNP dataRemains slow for medium to large datasets (>1,000SNPs)http://beast2.org/snapp/(Bryant et al. , 2012)
*BEASTPhylogeny and species tree inferenceDivergence time estimation and phylogenetic relationshipsOutputs a species tree instead of concatenated gene tree. Allows for testing consistency between phylogenetic signals at different lociSlow for large datasets. Requires sequence data. Not suited for situations where gene flow/admixture occurrshttp://beast2.org/(Heled and Drummond, 2010)
TREEMIXClustering and characterizing admixtureAdmixture graph, infers most likely admixture events in a treeBased on allele frequencies and can be used for pooled data. Requires multiple runs to properly assess the likelihood of each modelhttps://bitbucket.org/nygcresearch/treemix/src(Pickrell and Pritchard, 2012)
TWISSTTopology weightingChromosome painting, clustering and branching between populationsRetrieves the most likely coalescence pattern between several taxa along the genome. Can be seen as an extension of the ABBA/BABA testNeeds a priori grouping of individuals into taxa. Requires at least 4 taxa. Impractical for more than 6 taxa. Windows size must include enough SNPs to retrieve the correct topology but at the risk that regions with different histories are includedhttps://github.com/simonhmartin/twisst(Martin and Van Belleghem, 2016)
LAMPPedigree, Identity by descent/stateChromosome painting, relatednessLAMP also allows for association and pedigree analysesIdentifies local ancestry in windows (source of noise), requires phased datahttp://lamp.icsi.berkeley.edu/lamp/(Baran et al. , 2012)
PLINKPedigree, Identity by descent/stateEstimating inbreeding and relatednessAllows studying identity by descent and by state. PLINK is a multi-purpose tool, facilitating data analysis within the same softwareNAhttp://pngu.mgh.harvard.edu/~purcell/plink/(Purcell et al. , 2007)
VCFTOOLSPedigree, Identity by descent/stateEstimating inbreeding and relatednessComputes unadjusted Ajk and kinship coefficientNAhttps://vcftools.github.io/man_latest.html(Danecek et al. , 2011)
KINGPedigree, Identity by descent/stateEstimating inbreeding and relatedness, multivariate analysisMendelian error checking, testing family structure, highly accurate kinship coefficient, association analysis, population structure inferenceKinship coefficient also computed in VCFTOOLShttp://people.virginia.edu/~wc9c/KING/Download.htm(Manichaikul et al. , 2010)
BAYPASS/BayenvVariance/covariance matrixBuilding a population covariance matrix across population allele frequencies, similar to TREEMIXCan handle pooled dataMatrices are mostly designed to provide a neutral model for assessing selection, but can be used to infer population structurehttp://www1.montpellier.inra.fr/CBGP/software/baypass/ ; https://bitbucket.org/tguenther/bayenv2_public/src(Günther and Coop, 2013; Gautier, 2015)
ArlequinAMOVA (Analysis of MOlecular VAriance)Characterizing hierarchical population structureArlequin allows for a variety of other analyses of diversityRequires a priori assignment of individuals to populations, data formatting is required prior analysishttp://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html(Excoffier and Lischer, 2010)
POPTREE2Genetic distanceVisualizing a matrix of pairwise differentiation statistics as a treeCan be used for pooled datasets, several statistics can be usedDifferentiation measures alone do not necessarily retrieve the actual history of populationshttp://www.med.kagawa-u.ac.jp/~genomelb/takezaki/poptree2/index.html(Takezaki et al. , 2010)
StacksDifferentiation/Diversity/PhylogenyProcessing RAD-seq data and facilitate their analysisDesigned for RAD-seq data, variety of output formats for downstream analyses. Allows to retrieve DNA sequences for each locusNAhttp://catchenlab.life.illinois.edu/stacks/(Catchen et al. , 2011)
Popoolation/Popoolation2/Popoolation TEDifferentiation/DiversityExtracting summary statistics from pooled dataExplicitely corrects for sampling bias in pooled dataMostly limited to a few summary statistics. A pipeline dedicated to TE detection is also availablehttps://sourceforge.net/p/popoolation/wiki/Main/(Kofler, Orozco-terWengel, et al. , 2011; Kofler, Pandey, et al. , 2011)
POPGenomeDifferentiation/Diversity/RecombinationComputing summary statistics based on AFS and LD along genomesAccepts VCF and GFF/GFT files, efficient and fast. Tests for admixture available (ABBA BABA test). Includes basic coalescence simulations (ms and msms)Mostly limited to summary statistics (but coalescent simulations are possible). No built-in SNP calling modulehttp://catchenlab.life.illinois.edu/stacks/(Pfeifer et al. , 2014)
ANGSDDifferentiation/Diversity/RecombinationComputing summary statistics based on AFS and LD along genomesAble to process BAM files, built-in procedures for data filtering, admixture analysisMostly limited to summary statisticshttps://github.com/ANGSD/angsd(Korneliussen et al. , 2014)
ArlequinDifferentiation/Diversity/RecombinationComputing summary statistics based on AFS and LD along genomesCan output AFS for further analysis in fastsimcoal2Slower than PopGenome, requires a private formathttp://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html(Excoffier and Lischer, 2010)
VCFTOOLSDifferentiation/Diversity/RecombinationComputing summary statistics based on AFS and LD along genomesFast. VCFTOOLS can also be used for SNP filteringLess summary statistics than POPGenomehttps://vcftools.github.io/man_latest.html(Danecek et al. , 2011)
LDHatRecombinationEstimating variation in recombination rates along a genomeHandles unphased and missing data, underlying model can be used for organisms such as viruses or bacteriaLimited to 300 sequences, private format, model for recombination hotspots based on human datahttp://ldhat.sourceforge.net/(McVean et al. , 2002)
LDHotRecombinationIdentifying recombination hotspotsSpecifically designed for detecting recombination hotspotsRequires data to be phased, working with LDHathttps://github.com/auton1/LDhot(Myers, 2005)
KimtreeGenetic distanceEstimating divergence time between populations and testing for topologiesThe method is conditional on a prior topology provided by the user. It computes DIC for a given topology, allowing to test for the best one.Times are given in diffusion time scale, and can be converted in demographic times using independent estimates of Ne.http://www1.montpellier.inra.fr/CBGP/software/kimtree/index.html(Gautier and Vitalis, 2013)
npstatDifferentiation/DiversityExtracting summary statistics from pooled dataExplicitely corrects for sampling bias in pooled data. Allows computing tests using an outgroup (MK test, Fay and Wu's H) and characterizing coding mutations. Mostly limited to summary statistics, but more complete than Popoolation.https://github.com/lucaferretti/npstat(Ferretti et al. 2013)
SVDQuartetsPhylogenyBuilds species trees using short non-recombining sequencesCoalescence-based. Suitable for short loci (e.g. RAD-seq and GBS)See ASTRAL-2 and Chou et al. 2015 http://www.stat.osu.edu/~lkubatko/software/SVDquartets/
(Chifman and Kubatko, 2014)
ASTRAL-2

PhylogenyBuilds species trees using short non-recombining sequences Coalescence-based. Suitable for short loci (e.g. RAD-seq and GBS) More reliable under high incomplete lineage sorting that SVDQuartets and NJst (Chou et al. 2015)https://github.com/smirarab/ASTRAL (Mirarab and Warnow, 2015)
NJst (in phybase) Phylogeny Builds species trees using short non-recombining sequences Coalescence-based. Suitable for short loci (e.g. RAD-seq and GBS)
See ASTRAL-2 and Chou et al. 2015 https://code.google.com/archive/p/phybase/downloads
(Liu and Yu, 2011)

References

Alexander DH, Novembre J (2009). Fast Model-Based Estimation of Ancestry in Unrelated Individuals. Genome Res: 1655–1664.

Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, et al. (2012). Fast and accurate inference of local ancestry in Latino populations. Bioinformatics 28: 1359–1367.

Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu CH, Xie D, et al. (2014). BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLoS Comput Biol 10: 1–6.

Bradburd GS, Ralph PL, Coop GM (2013). Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution (N Y) 67: 3258–3273.

Brisbin A, Bryc K, Byrnes J, Zakharia F, Omberg L, Degenhardt J, et al. (2012). PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations. Hum Biol 84: 343–364.

Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, Roychoudhury A (2012). Inferring species trees directly from biallelic genetic markers: Bypassing gene trees in a full coalescent analysis. Mol Biol Evol 29: 1917–1932.

Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH (2011). Stacks: building and genotyping Loci de novo from short-read sequences. G3 (Bethesda) 1: 171–82.

Chifman J, Kubatko L (2014). Quartet inference from SNP data under the coalescent model. Bioinformatics 30: 3317–3324.

Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, et al. (2015). A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 16: S2.

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. (2011). The variant call format and VCFtools. Bioinformatics 27: 2156–2158.

Drummond AJ, Rambaut A (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7: 214.

Excoffier L, Lischer HEL (2010). Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour 10: 564–7.

Ferretti L., Ramos-Onsins S.E. and Perez-Enciso M (2013). Population genomics from pool sequencing. Molecular Ecology 22:5561-76.

Frichot E, Mathieu F, Trouillon T, Bouchard G, François O (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics 196: 973–983.

Gautier M, Vitalis R (2013). Inferring population histories using genome-wide allele frequency data. Mol Biol Evol 30: 654–68.

Gautier M (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics 201: 1555–1579.

Guillot G, Renaud S, Ledevin R, Michaux J, Claude J (2012). A unifying model for the analysis of phenotypic, genetic, and geographic data. Syst Biol 61: 897–911.

Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Syst Biol 59: 307–321.

Günther T, Coop G (2013). Robust identification of local adaptation from allele frequencies. Genetics 195: 205–220.

Heled J, Drummond AJ (2010). Bayesian Inference of Species Trees from Multilocus Data. Mol Biol Evol 27: 570–580.

Hellenthal G, Busby GBJ, Band G, Wilson JF, Capelli C, Falush D, et al. (2014). A Genetic Atlas of Human Admixture History. Science (80- ) 343: 747–751.

Huson DH, Bryant D (2006). Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 23: 254–267.

Jombart T, Devillard S, Balloux F, Falush D, Stephens M, Pritchard J, et al. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11: 94.

Jombart T, Devillard S, Dufour  a-B, Pontier D (2008). Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity (Edinb) 101: 92–103.

Kofler R, Orozco-terWengel P, De Maio N, Pandey RV, Nolte V, Futschik A, et al. (2011). PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS One 6: e15925.

Kofler R, Pandey RV, Schlötterer C (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics 27: 3435–6.

Korneliussen TS, Albrechtsen A, Nielsen R (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 15: 356.

Lee T-H, Guo H, Wang X, Kim C, Paterson AH (2014). SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics 15: 162.

Liu L, Yu L (2011). Estimating species trees from unrooted gene trees. Syst Biol 60: 661–667.

Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen W-M (2010). Robust relationship inference in genome-wide association studies. Bioinformatics 26: 2867–2873.

Martin SH, Van Belleghem SM (2016). Exploring evolutionary relationships across the genome using topology weighting. bioRxiv: 69112.

McVean G, Awadalla P, Fearnhead P (2002). A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160: 1231–1241.

Mirarab S, Warnow T (2015). ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31: i44–i52.

Myers S (2005). A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome. Science 310: 321–324.

Pfeifer B, Wittelsburger U, Ramos-Onsins SE, Lercher MJ (2014). PopGenome: An efficient swiss army knife for population genomic analyses in R. Mol Biol Evol 31: 1929–1936.

Pickrell JK, Pritchard JK (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet 8: e1002967.

Price A, Patterson NJ, Plenge RM, Weinblatt ME, Shadick N a, Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–9.

Pritchard JK, Stephens M, Donnelly P (2000). Inference of population structure using multilocus genotype data. Genetics 155: 945–959.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am J Hum Genet 81: 559–575.

Raj A, Stephens M, Pritchard JK (2014). FastSTRUCTURE: Variational inference of population structure in large SNP data sets. Genetics 197: 573–589.

Stamatakis A (2014). RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30: 1312–1313.

Takezaki N, Nei M, Tamura K (2010). POPTREE2: Software for constructing population trees from allele frequency data and computing other population statistics with windows interface. Mol Biol Evol 27: 747–752.

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28: 3326–3328.