Methods to infer population structure and explore datasets

Here you will find a summary of methods aiming at identifying population structure.

Software	Class of method	Purpose	Specifics	Issues and warnings	Link	Reference
Arlequin	AMOVA	Characterizing hierarchical population structure	Arlequin allows for a variety of other analyses of diversity	Requires a priori assignment of individuals to populations, data formatting is required prior analysis	http://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html	(Excoffier and Lischer, 2010)
GENELAND	Clustering and characterizing admixture	Grouping individuals in spatially consistent clusters maximizing HW equilibrium	Takes into account spatial variation, supposed to detect weak structure, framed in R	Immigrant alleles are assumed to be found only in new immigrants	https://cran.r-project.org/web/packages/Geneland/	(Guillot et al., 2012)
sNMF	Clustering and characterizing admixture	Grouping individuals in clusters maximizing HW equilibrium and LD between loci	Fast (30X than ADMIXTURE)	Still slow computation time for very large datasets	http://membres-timc.imag.fr/Olivier.Francois/snmf/index.htm	(Frichot et al., 2014)
STRUCTURE	Clustering and characterizing admixture	Grouping individuals in clusters maximizing HW equilibrium and LD between loci	User friendly interface. Bayesian inference.	Not suited for whole genomes. Requires specific input format. Might be used on a small set of high quality markers for small genomes.	http://pritchardlab.stanford.edu/structure.html	(Pritchard et al., 2000)
FastSTRUCTURE	Clustering and characterizing admixture	Grouping individuals in clusters maximizing HW equilibrium and LD between loci	~100X faster than Structure	Approximate inference of the original Structure model	http://rajanil.github.io/fastStructure/	(Raj et al., 2014)
ADMIXTURE	Clustering and characterizing admixture	Grouping individuals in clusters maximizing HW equilibrium and LD between loci	Maximum Likelihood, claimed to be faster than Structure. Can handle sex-linked markers	Often slower than its counterparts	https://www.genetics.ucla.edu/software/admixture/index.html	(Alexander and Novembre, 2009)
FineStructure/GlobeTrotter	Clustering and characterizing admixture	Chromosome painting, admixture and clustering	Estimates time since admixture, fast, set of scripts to facilitate analysis	Relies on Structure and fastStructure assumptions. Requires phased data.	http://paintmychromosomes.com/	(Hellenthal et al., 2014)
PCAdmix	Clustering and characterizing admixture	Chromosome painting	Fast, uses HMM to smooth out windows and limit noise due to low confidence ancestry	Requires a priori definition of ancestral populations and phased haplotypes	https://sites.google.com/site/pcadmix/	(Brisbin et al., 2012)
MOSAIC	Clustering and characterizing admixture	Chromosome painting, estimating admixture time and proportions.	Can handle several source populations. These populations do not have to be good surrogates of populations that actually mixed.	Requires phased data, but performs phasing error correction.	https://maths.ucd.ie/~mst/MOSAIC/	(Salter-Townshend and Myers, 2019)
tfa	Clustering and characterizing admixture	Summarizing variance across loci and visualizing inter-individual genetic distance	Uses latent factors to correct for drift and position ancient samples in a PCA-like framework.	NA	https://bcm-uga.github.io/tfa/	(François and Jay, 2020)
Dystruct	Clustering and characterizing admixture	Grouping individuals in clusters maximizing HW equilibrium and LD between loci	This method explicitly takes into account the age of samples. Useful when analyzing mixtures of modern and ancient samples.	Requires a genotype matrix in the eigenstrat format. Primarily tested on human data.	https://github.com/tyjo/dystruct	(Joseph and Pe’er, 2019)
BEDASSLE	Differentiation and MCMC model testing	Identifies contribution of environment and geographical distance to populations differentiation	Less biased than Mantel tests, provides tools for model testing	Uses population-level data.	https://cran.r-project.org/web/packages/BEDASSLE/index.html	(Bradburd et al., 2013)
npstat	Differentiation/Diversity	Extracting summary statistics from pooled data	Explicitely corrects for sampling bias in pooled data. Allows computing tests using an outgroup (MK test, HKA test, Fay and Wu's H) and characterizing coding mutations.	Mostly limited to summary statistics, but more complete than Popoolation.	https://github.com/lucaferretti/npstat	(Ferretti et al., 2013)
Popoolation/Popoolation2/Popoolation TE	Differentiation/Diversity/Recombination	Extracting summary statistics from pooled data	Explicitely corrects for sampling bias in pooled data. Can be used to detect TE polymorphisms.	Mostly limited to a few summary statistics. A pipeline dedicated to TE detection is also available	https://sourceforge.net/p/popoolation/wiki/Main/	(Kofler, Orozco-terWengel, et al., 2011; Kofler, Pandey, et al., 2011)
POPGenome	Differentiation/Diversity/Recombination	Computing summary statistics based on AFS and LD along genomes	Accepts VCF and GFF/GFT files, efficient and fast. Tests for admixture available (ABBA BABA test). Includes basic coalescence simulations (ms and msms)	Mostly limited to summary statistics (but coalescent simulations are possible). No built-in SNP calling module	http://catchenlab.life.illinois.edu/stacks/	(Pfeifer et al., 2014)
ANGSD	Differentiation/Diversity/Recombination	Computing summary statistics based on AFS and LD along genomes	Able to process BAM files, built-in procedures for data filtering, admixture analysis. Suited for low-depth data. Includes a suite of methods to estimate relatedness (NGSRelate).	Mostly limited to summary statistics. Tutorials not always up-to-date.	https://github.com/ANGSD/angsd https://github.com/ANGSD/NgsRelate	(Korneliussen et al., 2014; Hanghøj et al., 2019)
Arlequin	Differentiation/Diversity/Recombination	Computing summary statistics based on AFS and LD along genomes	Can output AFS for further analysis in fastsimcoal2	Slower than PopGenome, requires a specific format and file conversion.	http://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html	(Excoffier and Lischer, 2010)
VCFTOOLS	Differentiation/Diversity/Recombination	Computing summary statistics based on AFS and LD along genomes	Fast. VCFTOOLS can also be used for SNP filtering	Less summary statistics than POPGenome	https://vcftools.github.io/man_latest.html	(Danecek et al., 2011)
ATLAS	Differentiation/Diversity/Recombination	Low depth sequencing/ancient samples analsis	Particularly suited for analyzing ancient samples. Includes sets of tools to call variants, estimate post-mortem damage, inbreeding, genetic diversity. Produces the input file for PSMC (demography from a single diploid genome)	Better used in combination with GATK pipelines. Still in development	https://bitbucket.org/wegmannlab/atlas/wiki/Home	(Link et al., 2017)
POPTREE2	Genetic differentiation	Visualizing a matrix of pairwise differentiation statistics as a tree	Can be used for pooled datasets, several statistics can be used	Differentiation measures alone do not necessarily retrieve the actual history of populations	http://www.med.kagawa-u.ac.jp/~genomelb/takezaki/poptree2/index.html	(Takezaki et al., 2010)
EEMS	Landscape genomics	Estimating barriers to gene flow in a spatial context	Estimates pairwise relatedness between all samples, and compares it to isolation-by-distance expectations to identify barriers to gene flow and corridors of higher connectivity. Can handle both haploid and diploid data	Requires to convert VCF file into PLINK binary format. Estimates effective migration rates (does not disantangle migration rates and effective population sizes). Setting parameters for the MCMC chain requires some trial-and-error	https://github.com/dipetkov/eems	(Petkova et al., 2015)
MAPS	Landscape genomics	Estimating barriers to gene flow in a spatio-temporal context	Expands on EEMS, but takes into account the phase to reconstruct past changes in connectivity. Can disentantle migration rates and effective population sizes (unlike EEMS)	Relies on identity-by-descent tracks, requiring phasing (for example using BEAGLE). A pipeline to obtain IBD tracks is available, with a few details here: https://github.com/halasadi/ibd_data_pipeline/issues/1	https://github.com/halasadi/MAPS	(Al-Asadi et al., 2019)
TESS3R	Landscape genomics	Grouping individuals in clusters maximizing HW equilibrium and LD between loci	Incorporates geographic information of samples. Can run genome scans of selection based on contrasting ancestral and modern allele frequencies.	Importing data requires using conversion tools found in the LEA suite	https://bcm-uga.github.io/TESS3_encho_sen/	(Caye et al., 2016)
SNPRelate	Multivariate analysis	Summarizing variance across loci and visualizing inter-individual genetic distance	Fast. Can use VCF files as an input	Requires careful interpretation (Jombard et al. 2009)	https://bioconductor.org/packages/release/bioc/html/SNPRelate.html	(Zheng et al., 2012)
Eigenstrat/smartpca	Multivariate analysis	Summarizing variance across loci and visualizing inter-individual genetic distance	Fast. Can use VCF files as an input	Requires careful interpretation (Jombard et al. 2009)	https://github.com/DReichLab/EIG/tree/master/EIGENSTRAT	(Price et al., 2006)
DAPC (adegenet)	Multivariate analysis/Clustering	Maximizes divergence between groups identified by PCA	Fast. Less sensitive to HWE assumptions. Claims to be more efficient than Structure	Requires careful interpretation (Jombard et al. 2009)	http://adegenet.r-forge.r-project.org/	(Jombart et al., 2010)
sPCA (adegenet)	Multivariate analysis/Clustering	Spatially explicit model to assess population structure	Spatially explicit and able to detect cryptic structure. Fast.	Does not take into account HW equilibrium or LD	http://adegenet.r-forge.r-project.org/	(Jombart et al., 2008)
LAMP	Pedigree, Identity by descent/state	Chromosome painting, relatedness	LAMP also allows for association and pedigree analyses	Identifies local ancestry in windows (source of noise), requires phased data	http://lamp.icsi.berkeley.edu/lamp/	(Baran et al., 2012)
PLINK	Pedigree, Identity by descent/state	Estimating inbreeding and relatedness	Allows studying identity by descent and by state. PLINK is a multi-purpose tool, facilitating data analysis within the same software	NA	http://pngu.mgh.harvard.edu/~purcell/plink/	(Purcell et al., 2007)
VCFTOOLS	Pedigree, Identity by descent/state	Estimating inbreeding and relatedness	Computes unadjusted Ajk and kinship coefficient	NA	https://vcftools.github.io/man_latest.html	(Danecek et al., 2011)
KING	Pedigree, Identity by descent/state	Estimating inbreeding and relatedness, multivariate analysis	Mendelian error checking, testing family structure, highly accurate kinship coefficient, association analysis, population structure inference	Kinship coefficient also computed in VCFTOOLS	http://people.virginia.edu/~wc9c/KING/Download.htm	(Manichaikul et al., 2010)
COLONY	Pedigrees	Pedigree inference from SNPs	Robust even with high error rates (e.g. low-depth sequencing). Can handle haplo-diploids systems (e.g. ants). Multi-threaded.	Can only simulate genotypes with the Windows version.	https://www.zsl.org/science/software/colony	(Wang, 2019)
sequoia	Pedigrees	Pedigree inference from SNPs	Can be applied to large pedigrees (>1000 individuals). Accomodates unknown birth times.	Handles hundreds of SNPs. For whole-genome data, preliminary filtering and LD-pruning may be recommended. Efficient with ~100 SNPs.	https://cran.r-project.org/web/packages/sequoia/index.html	(Huisman, 2017)
LDHat	Recombination	Estimating variation in recombination rates along a genome	Handles unphased and missing data, underlying model can be used for organisms such as viruses or bacteria	Limited to 300 sequences, specific format (not VCF), model for recombination hotspots based on human data	http://ldhat.sourceforge.net/	(McVean et al., 2002)
LDHot	Recombination	Identifying recombination hotspots	Specifically designed for detecting recombination hotspots	Requires data to be phased, working with LDHat	https://github.com/auton1/LDhot	(Myers et al., 2005)
iSMC	Recombination	Recombination from a single diploid genome	No phasing needed. Accepts VCF files as input.	Introgression and demographic misspecification may bias results. No detailed tutorial	https://github.com/gvbarroso/iSMC	(Barroso et al., 2018)
LDHelmet	Recombination	Estimating variation in recombination rates along a genome	Higher accuracy than LDHat	Requires phased data. Does not handle VCF, only fasta and fastq formats. Requires dividing the genome in short segments to be analysed in parallel.	https://sourceforge.net/projects/ldhelmet/	(Chan et al., 2012)

References

Al-Asadi, H., Petkova, D., Stephens, M., & Novembre, J. (2019). Estimating recent migration and population-size surfaces. PLoS Genetics, 15(1), 1–21. doi: 10.1371/journal.pgen.1007908

Alexander, D. H., & Novembre, J. (2009). Fast Model-Based Estimation of Ancestry in Unrelated Individuals. Genome Research, (19), 1655–1664. doi: 10.1101/gr.094052.109.vidual

Baran, Y., Pasaniuc, B., Sankararaman, S., Torgerson, D. G., Gignoux, C., Eng, C., … Halperin, E. (2012). Fast and accurate inference of local ancestry in Latino populations. Bioinformatics, 28(10), 1359–1367. doi: 10.1093/bioinformatics/bts144

Barroso, G. V., Puzovic, N., & Dutheil, J. Y. (2018). Inference of recombination maps from a single pair of genomes and its application to archaic samples. BioRxiv, 1–21. doi: 10.1101/452268

Bradburd, G. S., Ralph, P. L., & Coop, G. M. (2013). Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution, 67(11), 3258–3273. doi: 10.1111/evo.12193

Brisbin, A., Bryc, K., Byrnes, J., Zakharia, F., Omberg, L., Degenhardt, J., … Bustamante, C. D. (2012). PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations. Human Biology, 84(4), 343–364. doi: 10.3378/027.084.0401

Caye, K., Deist, T. M., Martins, H., Michel, O., & François, O. (2016). TESS3: Fast inference of spatial population structure and genome scans for selection. Molecular Ecology Resources, 16(2), 540–548. doi: 10.1111/1755-0998.12471

Chan, A. H., Jenkins, P. A., & Song, Y. S. (2012). Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster. PLoS Genetics, 8(12). doi: 10.1371/journal.pgen.1003090

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., … Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. doi: 10.1093/bioinformatics/btr330

Excoffier, L., & Lischer, H. E. L. (2010). Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10(3), 564–567. doi: 10.1111/j.1755-0998.2010.02847.x

Ferretti, L., Ramos-Onsins, S. E., & Pérez-Enciso, M. (2013). Population genomics from pool sequencing. Molecular Ecology, 22(22), 5561–5576. doi: 10.1111/mec.12522

François, O., & Jay, F. (2020). Factor analysis of ancient population genomic samples. Nature Communications, 11(1). doi: 10.1038/s41467-020-18335-6

Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G., & François, O. (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics, 196(4), 973–983. doi: 10.1534/genetics.113.160572

Guillot, G., Renaud, S., Ledevin, R., Michaux, J., & Claude, J. (2012). A unifying model for the analysis of phenotypic, genetic, and geographic data. Systematic Biology, 61(6), 897–911. doi: 10.1093/sysbio/sys038

Hanghøj, K., Moltke, I., Andersen, P. A., Manica, A., & Korneliussen, T. S. (2019). Fast and accurate relatedness estimation from high-throughput sequencing data in the presence of inbreeding. GigaScience, 8(5), 1–9. doi: 10.1093/gigascience/giz034

Hellenthal, G., Busby, G. B. J., Band, G., Wilson, J. F., Capelli, C., Falush, D., & Myers, S. (2014). A Genetic Atlas of Human Admixture History. Science, 343(February), 747–751. doi: 10.1126/science.1243518

Huisman, J. (2017). Pedigree reconstruction from SNP data: parentage assignment, sibship clustering and beyond. Molecular Ecology Resources, 17(5), 1009–1024. doi: 10.1111/1755-0998.12665

Jombart, T, Devillard, S., Dufour, a-B., & Pontier, D. (2008). Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity, 101, 92–103. doi: 10.1038/hdy.2008.34

Jombart, Thibaut, Devillard, S., & Balloux, F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics, 11(1), 94. doi: 10.1186/1471-2156-11-94

Joseph, T. A., & Pe’er, I. (2019). Inference of Population Structure from Time-Series Genotype Data. American Journal of Human Genetics, 105(2), 317–333. doi: 10.1016/j.ajhg.2019.06.002

Kofler, R., Orozco-terWengel, P., De Maio, N., Pandey, R. V., Nolte, V., Futschik, A., … Schlötterer, C. (2011). PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PloS One, 6(1), e15925. doi: 10.1371/journal.pone.0015925

Kofler, R., Pandey, R. V., & Schlötterer, C. (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27(24), 3435–3436. doi: 10.1093/bioinformatics/btr589

Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15(1), 356. doi: 10.1186/s12859-014-0356-4

Link, V., Kousathanas, A., Veeramah, K., Sell, C., Scheu, A., & Wegmann, D. (2017). ATLAS: Analysis Tools for Low-depth and Ancient Samples. BioRxiv. doi: 10.1101/105346

Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W.-M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867–2873. doi: 10.1093/bioinformatics/btq559

McVean, G., Awadalla, P., & Fearnhead, P. (2002). A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics, 160(3), 1231–1241.

Myers, S., Bottolo, L., Freeman, C., McVean, G., & Donnelly, P. (2005). A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome. Science, 310(5746), 321–324. doi: 10.1126/science.1117196

Petkova, D., Novembre, J., & Stephens, M. (2015). Visualizing spatial population structure with estimated effective migration surfaces. Nature Genetics, 48(1), 94–100. doi: 10.1038/ng.3464

Pfeifer, B., Wittelsburger, U., Ramos-Onsins, S. E., & Lercher, M. J. (2014). PopGenome: An efficient swiss army knife for population genomic analyses in R. Molecular Biology and Evolution, 31(7), 1929–1936. doi: 10.1093/molbev/msu136

Price, A., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. a, & Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8), 904–909. doi: 10.1038/ng1847

Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1461096&tool=pmcentrez&rendertype=abstract

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., … Sham, P. C. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics, 81(3), 559–575. doi: 10.1086/519795

Raj, A., Stephens, M., & Pritchard, J. K. (2014). FastSTRUCTURE: Variational inference of population structure in large SNP data sets. Genetics, 197(2), 573–589. doi: 10.1534/genetics.114.164350

Salter-Townshend, M., & Myers, S. (2019). Fine-Scale Inference of Ancestry Segments Without Prior Knowledge of Admixing Groups. Genetics, 212(July), 869–889.

Takezaki, N., Nei, M., & Tamura, K. (2010). POPTREE2: Software for constructing population trees from allele frequency data and computing other population statistics with windows interface. Molecular Biology and Evolution, 27(4), 747–752. doi: 10.1093/molbev/msp312

Wang, J. (2019). Pedigree reconstruction from poor quality genotype data. Heredity, 122(6), 719–728. doi: 10.1038/s41437-018-0178-7

Zheng, X., Levine, D., Shen, J., Gogarten, S. M., Laurie, C., & Weir, B. S. (2012). A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics, 28(24), 3326–3328. doi: 10.1093/bioinformatics/bts606

Methods in population genomics

A website listing current methods in population genomics

Methods to infer population structure and explore datasets