Methods to infer population structure and explore datasets

Here you will find a summary of methods aiming at identifying population structure.

SoftwareClass of methodPurposeSpecificsIssues and warningsLinkReference
ArlequinAMOVACharacterizing hierarchical population structureArlequin allows for a variety of other analyses of diversityRequires a priori assignment of individuals to populations, data formatting is required prior analysishttp://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html(Excoffier and Lischer, 2010)
GENELANDClustering and characterizing admixtureGrouping individuals in spatially consistent clusters maximizing HW equilibriumTakes into account spatial variation, supposed to detect weak structure, framed in RImmigrant alleles are assumed to be found only in new immigrantshttps://cran.r-project.org/web/packages/Geneland/(Guillot et al., 2012)
sNMFClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between lociFast (30X than ADMIXTURE)Still slow computation time for very large datasetshttp://membres-timc.imag.fr/Olivier.Francois/snmf/index.htm(Frichot et al., 2014)
STRUCTUREClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between lociUser friendly interface. Bayesian inference.Not suited for whole genomes. Requires specific input format. Might be used on a small set of high quality markers for small genomes.http://pritchardlab.stanford.edu/structure.html(Pritchard et al., 2000)
FastSTRUCTUREClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between loci~100X faster than StructureApproximate inference of the original Structure modelhttp://rajanil.github.io/fastStructure/(Raj et al., 2014)
ADMIXTUREClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between lociMaximum Likelihood, claimed to be faster than Structure. Can handle sex-linked markersOften slower than its counterpartshttps://www.genetics.ucla.edu/software/admixture/index.html(Alexander and Novembre, 2009)
FineStructure/GlobeTrotterClustering and characterizing admixtureChromosome painting, admixture and clusteringEstimates time since admixture, fast, set of scripts to facilitate analysisRelies on Structure and fastStructure assumptions. Requires phased data.http://paintmychromosomes.com/(Hellenthal et al., 2014)
PCAdmixClustering and characterizing admixtureChromosome paintingFast, uses HMM to smooth out windows and limit noise due to low confidence ancestryRequires a priori definition of ancestral populations and phased haplotypeshttps://sites.google.com/site/pcadmix/(Brisbin et al., 2012)
MOSAICClustering and characterizing admixtureChromosome painting, estimating admixture time and proportions.Can handle several source populations. These populations do not have to be good surrogates of populations that actually mixed.Requires phased data, but performs phasing error correction.https://maths.ucd.ie/~mst/MOSAIC/(Salter-Townshend and Myers, 2019)
tfaClustering and characterizing admixtureSummarizing variance across loci and visualizing inter-individual genetic distanceUses latent factors to correct for drift and position ancient samples in a PCA-like framework.NAhttps://bcm-uga.github.io/tfa/(François and Jay, 2020)
DystructClustering and characterizing admixtureGrouping individuals in clusters maximizing HW equilibrium and LD between lociThis method explicitly takes into account the age of samples. Useful when analyzing mixtures of modern and ancient samples.Requires a genotype matrix in the eigenstrat format. Primarily tested on human data.https://github.com/tyjo/dystruct(Joseph and Pe’er, 2019)
BEDASSLEDifferentiation and MCMC model testingIdentifies contribution of environment and geographical distance to populations differentiationLess biased than Mantel tests, provides tools for model testingUses population-level data.https://cran.r-project.org/web/packages/BEDASSLE/index.html(Bradburd et al., 2013)
npstatDifferentiation/DiversityExtracting summary statistics from pooled dataExplicitely corrects for sampling bias in pooled data. Allows computing tests using an outgroup (MK test, HKA test, Fay and Wu's H) and characterizing coding mutations.Mostly limited to summary statistics, but more complete than Popoolation.https://github.com/lucaferretti/npstat(Ferretti et al., 2013)
Popoolation/Popoolation2/Popoolation TEDifferentiation/Diversity/RecombinationExtracting summary statistics from pooled dataExplicitely corrects for sampling bias in pooled data. Can be used to detect TE polymorphisms.Mostly limited to a few summary statistics. A pipeline dedicated to TE detection is also availablehttps://sourceforge.net/p/popoolation/wiki/Main/(Kofler, Orozco-terWengel, et al., 2011; Kofler, Pandey, et al., 2011)
POPGenomeDifferentiation/Diversity/RecombinationComputing summary statistics based on AFS and LD along genomesAccepts VCF and GFF/GFT files, efficient and fast. Tests for admixture available (ABBA BABA test). Includes basic coalescence simulations (ms and msms)Mostly limited to summary statistics (but coalescent simulations are possible). No built-in SNP calling modulehttp://catchenlab.life.illinois.edu/stacks/(Pfeifer et al., 2014)
ANGSDDifferentiation/Diversity/RecombinationComputing summary statistics based on AFS and LD along genomesAble to process BAM files, built-in procedures for data filtering, admixture analysis. Suited for low-depth data. Includes a suite of methods to estimate relatedness (NGSRelate).Mostly limited to summary statistics. Tutorials not always up-to-date.https://github.com/ANGSD/angsd

https://github.com/ANGSD/NgsRelate
(Korneliussen et al., 2014; Hanghøj et al., 2019)
ArlequinDifferentiation/Diversity/RecombinationComputing summary statistics based on AFS and LD along genomesCan output AFS for further analysis in fastsimcoal2Slower than PopGenome, requires a specific format and file conversion.http://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html(Excoffier and Lischer, 2010)
VCFTOOLSDifferentiation/Diversity/RecombinationComputing summary statistics based on AFS and LD along genomesFast. VCFTOOLS can also be used for SNP filteringLess summary statistics than POPGenomehttps://vcftools.github.io/man_latest.html(Danecek et al., 2011)
ATLASDifferentiation/Diversity/RecombinationLow depth sequencing/ancient samples analsisParticularly suited for analyzing ancient samples. Includes sets of tools to call variants, estimate post-mortem damage, inbreeding, genetic diversity. Produces the input file for PSMC (demography from a single diploid genome)Better used in combination with GATK pipelines. Still in developmenthttps://bitbucket.org/wegmannlab/atlas/wiki/Home(Link et al., 2017)
POPTREE2Genetic differentiationVisualizing a matrix of pairwise differentiation statistics as a treeCan be used for pooled datasets, several statistics can be usedDifferentiation measures alone do not necessarily retrieve the actual history of populationshttp://www.med.kagawa-u.ac.jp/~genomelb/takezaki/poptree2/index.html(Takezaki et al., 2010)
EEMSLandscape genomicsEstimating barriers to gene flow in a spatial contextEstimates pairwise relatedness between all samples, and compares it to isolation-by-distance expectations to identify barriers to gene flow and corridors of higher connectivity. Can handle both haploid and diploid dataRequires to convert VCF file into PLINK binary format. Estimates effective migration rates (does not disantangle migration rates and effective population sizes). Setting parameters for the MCMC chain requires some trial-and-errorhttps://github.com/dipetkov/eems(Petkova et al., 2015)
MAPSLandscape genomicsEstimating barriers to gene flow in a spatio-temporal contextExpands on EEMS, but takes into account the phase to reconstruct past changes in connectivity. Can disentantle migration rates and effective population sizes (unlike EEMS)Relies on identity-by-descent tracks, requiring phasing (for example using BEAGLE). A pipeline to obtain IBD tracks is available, with a few details here: https://github.com/halasadi/ibd_data_pipeline/issues/1https://github.com/halasadi/MAPS(Al-Asadi et al., 2019)
TESS3RLandscape genomicsGrouping individuals in clusters maximizing HW equilibrium and LD between lociIncorporates geographic information of samples. Can run genome scans of selection based on contrasting ancestral and modern allele frequencies.Importing data requires using conversion tools found in the LEA suitehttps://bcm-uga.github.io/TESS3_encho_sen/(Caye et al., 2016)
SNPRelateMultivariate analysisSummarizing variance across loci and visualizing inter-individual genetic distanceFast. Can use VCF files as an inputRequires careful interpretation (Jombard et al. 2009)https://bioconductor.org/packages/release/bioc/html/SNPRelate.html(Zheng et al., 2012)
Eigenstrat/smartpcaMultivariate analysisSummarizing variance across loci and visualizing inter-individual genetic distanceFast. Can use VCF files as an inputRequires careful interpretation (Jombard et al. 2009)https://github.com/DReichLab/EIG/tree/master/EIGENSTRAT(Price et al., 2006)
DAPC (adegenet)Multivariate analysis/ClusteringMaximizes divergence between groups identified by PCAFast. Less sensitive to HWE assumptions. Claims to be more efficient than StructureRequires careful interpretation (Jombard et al. 2009)http://adegenet.r-forge.r-project.org/(Jombart et al., 2010)
sPCA (adegenet)Multivariate analysis/ClusteringSpatially explicit model to assess population structureSpatially explicit and able to detect cryptic structure. Fast.Does not take into account HW equilibrium or LDhttp://adegenet.r-forge.r-project.org/(Jombart et al., 2008)
LAMPPedigree, Identity by descent/stateChromosome painting, relatednessLAMP also allows for association and pedigree analysesIdentifies local ancestry in windows (source of noise), requires phased datahttp://lamp.icsi.berkeley.edu/lamp/(Baran et al., 2012)
PLINKPedigree, Identity by descent/stateEstimating inbreeding and relatednessAllows studying identity by descent and by state. PLINK is a multi-purpose tool, facilitating data analysis within the same softwareNAhttp://pngu.mgh.harvard.edu/~purcell/plink/(Purcell et al., 2007)
VCFTOOLSPedigree, Identity by descent/stateEstimating inbreeding and relatednessComputes unadjusted Ajk and kinship coefficientNAhttps://vcftools.github.io/man_latest.html(Danecek et al., 2011)
KINGPedigree, Identity by descent/stateEstimating inbreeding and relatedness, multivariate analysisMendelian error checking, testing family structure, highly accurate kinship coefficient, association analysis, population structure inferenceKinship coefficient also computed in VCFTOOLShttp://people.virginia.edu/~wc9c/KING/Download.htm(Manichaikul et al., 2010)
COLONYPedigreesPedigree inference from SNPsRobust even with high error rates (e.g. low-depth sequencing). Can handle haplo-diploids systems (e.g. ants). Multi-threaded.Can only simulate genotypes with the Windows version.https://www.zsl.org/science/software/colony(Wang, 2019)
sequoiaPedigreesPedigree inference from SNPsCan be applied to large pedigrees (>1000 individuals). Accomodates unknown birth times.Handles hundreds of SNPs. For whole-genome data, preliminary filtering and LD-pruning may be recommended. Efficient with ~100 SNPs.https://cran.r-project.org/web/packages/sequoia/index.html(Huisman, 2017)
LDHatRecombinationEstimating variation in recombination rates along a genomeHandles unphased and missing data, underlying model can be used for organisms such as viruses or bacteriaLimited to 300 sequences, specific format (not VCF), model for recombination hotspots based on human datahttp://ldhat.sourceforge.net/(McVean et al., 2002)
LDHotRecombinationIdentifying recombination hotspotsSpecifically designed for detecting recombination hotspotsRequires data to be phased, working with LDHathttps://github.com/auton1/LDhot(Myers et al., 2005)
iSMCRecombinationRecombination from a single diploid genomeNo phasing needed. Accepts VCF files as input.Introgression and demographic misspecification may bias results. No detailed tutorialhttps://github.com/gvbarroso/iSMC(Barroso et al., 2018)
LDHelmetRecombinationEstimating variation in recombination rates along a genomeHigher accuracy than LDHatRequires phased data. Does not handle VCF, only fasta and fastq formats. Requires dividing the genome in short segments to be analysed in parallel.https://sourceforge.net/projects/ldhelmet/(Chan et al., 2012)

References

Al-Asadi, H., Petkova, D., Stephens, M., & Novembre, J. (2019). Estimating recent migration and population-size surfaces. PLoS Genetics15(1), 1–21. doi: 10.1371/journal.pgen.1007908

Alexander, D. H., & Novembre, J. (2009). Fast Model-Based Estimation of Ancestry in Unrelated Individuals. Genome Research, (19), 1655–1664. doi: 10.1101/gr.094052.109.vidual

Baran, Y., Pasaniuc, B., Sankararaman, S., Torgerson, D. G., Gignoux, C., Eng, C., … Halperin, E. (2012). Fast and accurate inference of local ancestry in Latino populations. Bioinformatics28(10), 1359–1367. doi: 10.1093/bioinformatics/bts144

Barroso, G. V., Puzovic, N., & Dutheil, J. Y. (2018). Inference of recombination maps from a single pair of genomes and its application to archaic samples. BioRxiv, 1–21. doi: 10.1101/452268

Bradburd, G. S., Ralph, P. L., & Coop, G. M. (2013). Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution67(11), 3258–3273. doi: 10.1111/evo.12193

Brisbin, A., Bryc, K., Byrnes, J., Zakharia, F., Omberg, L., Degenhardt, J., … Bustamante, C. D. (2012). PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations. Human Biology84(4), 343–364. doi: 10.3378/027.084.0401

Caye, K., Deist, T. M., Martins, H., Michel, O., & François, O. (2016). TESS3: Fast inference of spatial population structure and genome scans for selection. Molecular Ecology Resources16(2), 540–548. doi: 10.1111/1755-0998.12471

Chan, A. H., Jenkins, P. A., & Song, Y. S. (2012). Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster. PLoS Genetics8(12). doi: 10.1371/journal.pgen.1003090

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., … Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics27(15), 2156–2158. doi: 10.1093/bioinformatics/btr330

Excoffier, L., & Lischer, H. E. L. (2010). Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources10(3), 564–567. doi: 10.1111/j.1755-0998.2010.02847.x

Ferretti, L., Ramos-Onsins, S. E., & Pérez-Enciso, M. (2013). Population genomics from pool sequencing. Molecular Ecology22(22), 5561–5576. doi: 10.1111/mec.12522

François, O., & Jay, F. (2020). Factor analysis of ancient population genomic samples. Nature Communications11(1). doi: 10.1038/s41467-020-18335-6

Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G., & François, O. (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics196(4), 973–983. doi: 10.1534/genetics.113.160572

Guillot, G., Renaud, S., Ledevin, R., Michaux, J., & Claude, J. (2012). A unifying model for the analysis of phenotypic, genetic, and geographic data. Systematic Biology61(6), 897–911. doi: 10.1093/sysbio/sys038

Hanghøj, K., Moltke, I., Andersen, P. A., Manica, A., & Korneliussen, T. S. (2019). Fast and accurate relatedness estimation from high-throughput sequencing data in the presence of inbreeding. GigaScience8(5), 1–9. doi: 10.1093/gigascience/giz034

Hellenthal, G., Busby, G. B. J., Band, G., Wilson, J. F., Capelli, C., Falush, D., & Myers, S. (2014). A Genetic Atlas of Human Admixture History. Science343(February), 747–751. doi: 10.1126/science.1243518

Huisman, J. (2017). Pedigree reconstruction from SNP data: parentage assignment, sibship clustering and beyond. Molecular Ecology Resources17(5), 1009–1024. doi: 10.1111/1755-0998.12665

Jombart, T, Devillard, S., Dufour,  a-B., & Pontier, D. (2008). Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity101, 92–103. doi: 10.1038/hdy.2008.34

Jombart, Thibaut, Devillard, S., & Balloux, F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics11(1), 94. doi: 10.1186/1471-2156-11-94

Joseph, T. A., & Pe’er, I. (2019). Inference of Population Structure from Time-Series Genotype Data. American Journal of Human Genetics105(2), 317–333. doi: 10.1016/j.ajhg.2019.06.002

Kofler, R., Orozco-terWengel, P., De Maio, N., Pandey, R. V., Nolte, V., Futschik, A., … Schlötterer, C. (2011). PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PloS One6(1), e15925. doi: 10.1371/journal.pone.0015925

Kofler, R., Pandey, R. V., & Schlötterer, C. (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics27(24), 3435–3436. doi: 10.1093/bioinformatics/btr589

Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics15(1), 356. doi: 10.1186/s12859-014-0356-4

Link, V., Kousathanas, A., Veeramah, K., Sell, C., Scheu, A., & Wegmann, D. (2017). ATLAS: Analysis Tools for Low-depth and Ancient Samples. BioRxiv. doi: 10.1101/105346

Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W.-M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics26(22), 2867–2873. doi: 10.1093/bioinformatics/btq559

McVean, G., Awadalla, P., & Fearnhead, P. (2002). A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics160(3), 1231–1241.

Myers, S., Bottolo, L., Freeman, C., McVean, G., & Donnelly, P. (2005). A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome. Science310(5746), 321–324. doi: 10.1126/science.1117196

Petkova, D., Novembre, J., & Stephens, M. (2015). Visualizing spatial population structure with estimated effective migration surfaces. Nature Genetics48(1), 94–100. doi: 10.1038/ng.3464

Pfeifer, B., Wittelsburger, U., Ramos-Onsins, S. E., & Lercher, M. J. (2014). PopGenome: An efficient swiss army knife for population genomic analyses in R. Molecular Biology and Evolution31(7), 1929–1936. doi: 10.1093/molbev/msu136

Price, A., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. a, & Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics38(8), 904–909. doi: 10.1038/ng1847

Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics155(2), 945–959. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1461096&tool=pmcentrez&rendertype=abstract

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., … Sham, P. C. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics81(3), 559–575. doi: 10.1086/519795

Raj, A., Stephens, M., & Pritchard, J. K. (2014). FastSTRUCTURE: Variational inference of population structure in large SNP data sets. Genetics197(2), 573–589. doi: 10.1534/genetics.114.164350

Salter-Townshend, M., & Myers, S. (2019). Fine-Scale Inference of Ancestry Segments Without Prior Knowledge of Admixing Groups. Genetics212(July), 869–889.

Takezaki, N., Nei, M., & Tamura, K. (2010). POPTREE2: Software for constructing population trees from allele frequency data and computing other population statistics with windows interface. Molecular Biology and Evolution27(4), 747–752. doi: 10.1093/molbev/msp312

Wang, J. (2019). Pedigree reconstruction from poor quality genotype data. Heredity122(6), 719–728. doi: 10.1038/s41437-018-0178-7

Zheng, X., Levine, D., Shen, J., Gogarten, S. M., Laurie, C., & Weir, B. S. (2012). A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics28(24), 3326–3328. doi: 10.1093/bioinformatics/bts606