Here you will find a summary of methods aiming at identifying population structure.
Software | Type of method | Purpose | Specifics | Issues and warnings | Link | Reference |
---|---|---|---|---|---|---|
SPRelate | Multivariate analysis | Summarizing variance across loci and visualizing inter-individual genetic distance | Fast. Can use VCF files as an input | Requires careful interpretation (Jombard et al. 2009) | https://bioconductor.org/packages/release/bioc/html/SNPRelate.html | (Zheng et al. , 2012) |
Eigenstrat/smartpca | Multivariate analysis | Summarizing variance across loci and visualizing inter-individual genetic distance | Fast. Can use VCF files as an input | Requires careful interpretation (Jombard et al. 2009) | https://github.com/DReichLab/EIG/tree/master/EIGENSTRAT | (Price et al. , 2006) |
DAPC (adegenet) | Multivariate analysis/Clustering | Maximizes divergence between groups identified by PCA | Fast. Less sensitive to HWE assumptions. Claims to be more efficient than Structure | Requires careful interpretation (Jombard et al. 2009) | http://adegenet.r-forge.r-project.org/ | (Jombart et al. , 2010) |
sPCA (adegenet) | Multivariate analysis/Clustering | Spatially explicit model to assess population structure | Spatially explicit and able to detect cryptic structure. Fast. | Does not take into account HW equilibrium or LD | http://adegenet.r-forge.r-project.org/ | (Jombart et al. , 2008) |
BEDASSLE | Differentiation and MCMC model testing | Identifies contribution of environment and geographical distance to populations differentiation | Less biased than Mantel tests, provides tools for model testing | Uses population-level data. | https://cran.r-project.org/web/packages/BEDASSLE/index.html | (Bradburd et al. , 2013) |
GENELAND | Clustering and characterizing admixture | Grouping individuals in spatially consistent clusters maximizing HW equilibrium | Takes into account spatial variation, supposed to detect weak structure, framed in R | Immigrant alleles are assumed to be found only in new immigrants | https://cran.r-project.org/web/packages/Geneland/ | (Guillot et al. , 2012) |
sNMF | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | Fast (30X than ADMIXTURE) | Still slow computation time for large datasets | http://membres-timc.imag.fr/Olivier.Francois/snmf/index.htm | (Frichot et al. , 2014) |
STRUCTURE | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | User friendly interface. Bayesian inference. | Slow for large datasets. Requires specific input format | http://pritchardlab.stanford.edu/structure.html | (Pritchard et al. , 2000) |
FastSTRUCTURE | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | ~100X faster than Structure | Approximate inference of the original Structure model | http://rajanil.github.io/fastStructure/ | (Raj et al. , 2014) |
ADMIXTURE | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | Maximum Likelihood, claimed to be faster than Structure. Note that it allows mixed ploidy (e.g. individuals that are haploids or diploids at a chromosome/locus depending on their sex can be analyzed jointly). | Often slower than its counterparts | https://www.genetics.ucla.edu/software/admixture/index.html | (Alexander and Novembre, 2009) |
FineStructure/GlobeTrotter | Clustering and characterizing admixture | Chromosome painting, admixture and clustering | Estimates time since admixture, fast, specific tools for RAD-seq, set of scripts to facilitate analysis | Relies on Structure and fastStructure assumptions. Requires phased data. | http://paintmychromosomes.com/ | (Hellenthal et al. , 2014) |
PCAdmix | Clustering and characterizing admixture | Chromosome painting | Fast, uses HMM to smooth out windows and limit noise due to low confidence ancestry | Requires a priori definition of ancestral populations and phased haplotypes | https://sites.google.com/site/pcadmix/ | (Brisbin et al. , 2012) |
Splitstree | Phylogeny/Network | Network reconstruction and phylogenetic relationships | User friendly interface, proposes a variety of methods for networks reconstruction | Mostly descriptive | http://www.splitstree.org/ | (Huson and Bryant, 2006) |
SNPhylo | Phylogeny | Network reconstruction and phylogenetic relationships | Complete pipeline from SNP filtering to tree reconstruction | Should be used on complex of species or divergent populations with little migration | http://chibba.pgml.uga.edu/snphylo/ | (Lee et al. , 2014) |
RAxML | Phylogeny | Network reconstruction and phylogenetic relationships | Maximum Likelihood inference of phylogenetic relationships | Should be used on complex of species or divergent populations with little migration | http://sco.h-its.org/exelixis/web/software/raxml/index.html | (Stamatakis, 2014) |
BEAST2 | Phylogeny | Network reconstruction and phylogenetic relationships | User friendly. Can be used to track changes in effective population sizes (Bayesian Skyline Plots). Possible to estimate divergence times | Slow for large datasets. Requires sequence data that can be produced by , e.g., Stacks for RAD-seq data | http://beast2.org/ | (Drummond and Rambaut, 2007; Bouckaert et al. , 2014) |
PhyML | Phylogeny | Phylogenetic relationships | Maximum Likelihood inference of phylogenetic relationships. An online version is available | Should be used on complex of species or divergent populations with little migration | http://www.atgc-montpellier.fr/phyml/binaries.php | (Guindon et al. , 2010) |
SNAPP | Phylogeny | Phylogenetic relationships | Handles SNP data | Remains slow for medium to large datasets (>1,000SNPs) | http://beast2.org/snapp/ | (Bryant et al. , 2012) |
*BEAST | Phylogeny and species tree inference | Divergence time estimation and phylogenetic relationships | Outputs a species tree instead of concatenated gene tree. Allows for testing consistency between phylogenetic signals at different loci | Slow for large datasets. Requires sequence data. Not suited for situations where gene flow/admixture occurrs | http://beast2.org/ | (Heled and Drummond, 2010) |
TREEMIX | Clustering and characterizing admixture | Admixture graph, infers most likely admixture events in a tree | Based on allele frequencies and can be used for pooled data. | Requires multiple runs to properly assess the likelihood of each model | https://bitbucket.org/nygcresearch/treemix/src | (Pickrell and Pritchard, 2012) |
TWISST | Topology weighting | Chromosome painting, clustering and branching between populations | Retrieves the most likely coalescence pattern between several taxa along the genome. Can be seen as an extension of the ABBA/BABA test | Needs a priori grouping of individuals into taxa. Requires at least 4 taxa. Impractical for more than 6 taxa. Windows size must include enough SNPs to retrieve the correct topology but at the risk that regions with different histories are included | https://github.com/simonhmartin/twisst | (Martin and Van Belleghem, 2016) |
LAMP | Pedigree, Identity by descent/state | Chromosome painting, relatedness | LAMP also allows for association and pedigree analyses | Identifies local ancestry in windows (source of noise), requires phased data | http://lamp.icsi.berkeley.edu/lamp/ | (Baran et al. , 2012) |
PLINK | Pedigree, Identity by descent/state | Estimating inbreeding and relatedness | Allows studying identity by descent and by state. PLINK is a multi-purpose tool, facilitating data analysis within the same software | NA | http://pngu.mgh.harvard.edu/~purcell/plink/ | (Purcell et al. , 2007) |
VCFTOOLS | Pedigree, Identity by descent/state | Estimating inbreeding and relatedness | Computes unadjusted Ajk and kinship coefficient | NA | https://vcftools.github.io/man_latest.html | (Danecek et al. , 2011) |
KING | Pedigree, Identity by descent/state | Estimating inbreeding and relatedness, multivariate analysis | Mendelian error checking, testing family structure, highly accurate kinship coefficient, association analysis, population structure inference | Kinship coefficient also computed in VCFTOOLS | http://people.virginia.edu/~wc9c/KING/Download.htm | (Manichaikul et al. , 2010) |
BAYPASS/Bayenv | Variance/covariance matrix | Building a population covariance matrix across population allele frequencies, similar to TREEMIX | Can handle pooled data | Matrices are mostly designed to provide a neutral model for assessing selection, but can be used to infer population structure | http://www1.montpellier.inra.fr/CBGP/software/baypass/ ; https://bitbucket.org/tguenther/bayenv2_public/src | (Günther and Coop, 2013; Gautier, 2015) |
Arlequin | AMOVA (Analysis of MOlecular VAriance) | Characterizing hierarchical population structure | Arlequin allows for a variety of other analyses of diversity | Requires a priori assignment of individuals to populations, data formatting is required prior analysis | http://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html | (Excoffier and Lischer, 2010) |
POPTREE2 | Genetic distance | Visualizing a matrix of pairwise differentiation statistics as a tree | Can be used for pooled datasets, several statistics can be used | Differentiation measures alone do not necessarily retrieve the actual history of populations | http://www.med.kagawa-u.ac.jp/~genomelb/takezaki/poptree2/index.html | (Takezaki et al. , 2010) |
Stacks | Differentiation/Diversity/Phylogeny | Processing RAD-seq data and facilitate their analysis | Designed for RAD-seq data, variety of output formats for downstream analyses. Allows to retrieve DNA sequences for each locus | NA | http://catchenlab.life.illinois.edu/stacks/ | (Catchen et al. , 2011) |
Popoolation/Popoolation2/Popoolation TE | Differentiation/Diversity | Extracting summary statistics from pooled data | Explicitely corrects for sampling bias in pooled data | Mostly limited to a few summary statistics. A pipeline dedicated to TE detection is also available | https://sourceforge.net/p/popoolation/wiki/Main/ | (Kofler, Orozco-terWengel, et al. , 2011; Kofler, Pandey, et al. , 2011) |
POPGenome | Differentiation/Diversity/Recombination | Computing summary statistics based on AFS and LD along genomes | Accepts VCF and GFF/GFT files, efficient and fast. Tests for admixture available (ABBA BABA test). Includes basic coalescence simulations (ms and msms) | Mostly limited to summary statistics (but coalescent simulations are possible). No built-in SNP calling module | http://catchenlab.life.illinois.edu/stacks/ | (Pfeifer et al. , 2014) |
ANGSD | Differentiation/Diversity/Recombination | Computing summary statistics based on AFS and LD along genomes | Able to process BAM files, built-in procedures for data filtering, admixture analysis | Mostly limited to summary statistics | https://github.com/ANGSD/angsd | (Korneliussen et al. , 2014) |
Arlequin | Differentiation/Diversity/Recombination | Computing summary statistics based on AFS and LD along genomes | Can output AFS for further analysis in fastsimcoal2 | Slower than PopGenome, requires a private format | http://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html | (Excoffier and Lischer, 2010) |
VCFTOOLS | Differentiation/Diversity/Recombination | Computing summary statistics based on AFS and LD along genomes | Fast. VCFTOOLS can also be used for SNP filtering | Less summary statistics than POPGenome | https://vcftools.github.io/man_latest.html | (Danecek et al. , 2011) |
LDHat | Recombination | Estimating variation in recombination rates along a genome | Handles unphased and missing data, underlying model can be used for organisms such as viruses or bacteria | Limited to 300 sequences, private format, model for recombination hotspots based on human data | http://ldhat.sourceforge.net/ | (McVean et al. , 2002) |
LDHot | Recombination | Identifying recombination hotspots | Specifically designed for detecting recombination hotspots | Requires data to be phased, working with LDHat | https://github.com/auton1/LDhot | (Myers, 2005) |
Kimtree | Genetic distance | Estimating divergence time between populations and testing for topologies | The method is conditional on a prior topology provided by the user. It computes DIC for a given topology, allowing to test for the best one. | Times are given in diffusion time scale, and can be converted in demographic times using independent estimates of Ne. | http://www1.montpellier.inra.fr/CBGP/software/kimtree/index.html | (Gautier and Vitalis, 2013) |
npstat | Differentiation/Diversity | Extracting summary statistics from pooled data | Explicitely corrects for sampling bias in pooled data. Allows computing tests using an outgroup (MK test, Fay and Wu's H) and characterizing coding mutations. | Mostly limited to summary statistics, but more complete than Popoolation. | https://github.com/lucaferretti/npstat | (Ferretti et al. 2013) |
SVDQuartets | Phylogeny | Builds species trees using short non-recombining sequences | Coalescence-based. Suitable for short loci (e.g. RAD-seq and GBS) | See ASTRAL-2 and Chou et al. 2015 | http://www.stat.osu.edu/~lkubatko/software/SVDquartets/ | (Chifman and Kubatko, 2014) |
ASTRAL-2 | Phylogeny | Builds species trees using short non-recombining sequences | Coalescence-based. Suitable for short loci (e.g. RAD-seq and GBS) | More reliable under high incomplete lineage sorting that SVDQuartets and NJst (Chou et al. 2015) | https://github.com/smirarab/ASTRAL | (Mirarab and Warnow, 2015) |
NJst (in phybase) | Phylogeny | Builds species trees using short non-recombining sequences | Coalescence-based. Suitable for short loci (e.g. RAD-seq and GBS) | See ASTRAL-2 and Chou et al. 2015 | https://code.google.com/archive/p/phybase/downloads | (Liu and Yu, 2011) |
References
Alexander DH, Novembre J (2009). Fast Model-Based Estimation of Ancestry in Unrelated Individuals. Genome Res: 1655–1664.
Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, et al. (2012). Fast and accurate inference of local ancestry in Latino populations. Bioinformatics 28: 1359–1367.
Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu CH, Xie D, et al. (2014). BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLoS Comput Biol 10: 1–6.
Bradburd GS, Ralph PL, Coop GM (2013). Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution (N Y) 67: 3258–3273.
Brisbin A, Bryc K, Byrnes J, Zakharia F, Omberg L, Degenhardt J, et al. (2012). PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations. Hum Biol 84: 343–364.
Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, Roychoudhury A (2012). Inferring species trees directly from biallelic genetic markers: Bypassing gene trees in a full coalescent analysis. Mol Biol Evol 29: 1917–1932.
Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH (2011). Stacks: building and genotyping Loci de novo from short-read sequences. G3 (Bethesda) 1: 171–82.
Chifman J, Kubatko L (2014). Quartet inference from SNP data under the coalescent model. Bioinformatics 30: 3317–3324.
Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, et al. (2015). A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 16: S2.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. (2011). The variant call format and VCFtools. Bioinformatics 27: 2156–2158.
Drummond AJ, Rambaut A (2007). BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7: 214.
Excoffier L, Lischer HEL (2010). Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour 10: 564–7.
Ferretti L., Ramos-Onsins S.E. and Perez-Enciso M (2013). Population genomics from pool sequencing. Molecular Ecology 22:5561-76.
Frichot E, Mathieu F, Trouillon T, Bouchard G, François O (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics 196: 973–983.
Gautier M, Vitalis R (2013). Inferring population histories using genome-wide allele frequency data. Mol Biol Evol 30: 654–68.
Gautier M (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics 201: 1555–1579.
Guillot G, Renaud S, Ledevin R, Michaux J, Claude J (2012). A unifying model for the analysis of phenotypic, genetic, and geographic data. Syst Biol 61: 897–911.
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Syst Biol 59: 307–321.
Günther T, Coop G (2013). Robust identification of local adaptation from allele frequencies. Genetics 195: 205–220.
Heled J, Drummond AJ (2010). Bayesian Inference of Species Trees from Multilocus Data. Mol Biol Evol 27: 570–580.
Hellenthal G, Busby GBJ, Band G, Wilson JF, Capelli C, Falush D, et al. (2014). A Genetic Atlas of Human Admixture History. Science (80- ) 343: 747–751.
Huson DH, Bryant D (2006). Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 23: 254–267.
Jombart T, Devillard S, Balloux F, Falush D, Stephens M, Pritchard J, et al. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet 11: 94.
Jombart T, Devillard S, Dufour a-B, Pontier D (2008). Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity (Edinb) 101: 92–103.
Kofler R, Orozco-terWengel P, De Maio N, Pandey RV, Nolte V, Futschik A, et al. (2011). PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS One 6: e15925.
Kofler R, Pandey RV, Schlötterer C (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics 27: 3435–6.
Korneliussen TS, Albrechtsen A, Nielsen R (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics 15: 356.
Lee T-H, Guo H, Wang X, Kim C, Paterson AH (2014). SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics 15: 162.
Liu L, Yu L (2011). Estimating species trees from unrooted gene trees. Syst Biol 60: 661–667.
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen W-M (2010). Robust relationship inference in genome-wide association studies. Bioinformatics 26: 2867–2873.
Martin SH, Van Belleghem SM (2016). Exploring evolutionary relationships across the genome using topology weighting. bioRxiv: 69112.
McVean G, Awadalla P, Fearnhead P (2002). A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160: 1231–1241.
Mirarab S, Warnow T (2015). ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31: i44–i52.
Myers S (2005). A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome. Science 310: 321–324.
Pfeifer B, Wittelsburger U, Ramos-Onsins SE, Lercher MJ (2014). PopGenome: An efficient swiss army knife for population genomic analyses in R. Mol Biol Evol 31: 1929–1936.
Pickrell JK, Pritchard JK (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet 8: e1002967.
Price A, Patterson NJ, Plenge RM, Weinblatt ME, Shadick N a, Reich D (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–9.
Pritchard JK, Stephens M, Donnelly P (2000). Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am J Hum Genet 81: 559–575.
Raj A, Stephens M, Pritchard JK (2014). FastSTRUCTURE: Variational inference of population structure in large SNP data sets. Genetics 197: 573–589.
Stamatakis A (2014). RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30: 1312–1313.
Takezaki N, Nei M, Tamura K (2010). POPTREE2: Software for constructing population trees from allele frequency data and computing other population statistics with windows interface. Mol Biol Evol 27: 747–752.
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28: 3326–3328.