Here you will find a summary of methods aiming at identifying population structure.
Software | Class of method | Purpose | Specifics | Issues and warnings | Link | Reference |
---|---|---|---|---|---|---|
Arlequin | AMOVA | Characterizing hierarchical population structure | Arlequin allows for a variety of other analyses of diversity | Requires a priori assignment of individuals to populations, data formatting is required prior analysis | http://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html | (Excoffier and Lischer, 2010) |
GENELAND | Clustering and characterizing admixture | Grouping individuals in spatially consistent clusters maximizing HW equilibrium | Takes into account spatial variation, supposed to detect weak structure, framed in R | Immigrant alleles are assumed to be found only in new immigrants | https://cran.r-project.org/web/packages/Geneland/ | (Guillot et al., 2012) |
sNMF | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | Fast (30X than ADMIXTURE) | Still slow computation time for very large datasets | http://membres-timc.imag.fr/Olivier.Francois/snmf/index.htm | (Frichot et al., 2014) |
STRUCTURE | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | User friendly interface. Bayesian inference. | Not suited for whole genomes. Requires specific input format. Might be used on a small set of high quality markers for small genomes. | http://pritchardlab.stanford.edu/structure.html | (Pritchard et al., 2000) |
FastSTRUCTURE | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | ~100X faster than Structure | Approximate inference of the original Structure model | http://rajanil.github.io/fastStructure/ | (Raj et al., 2014) |
ADMIXTURE | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | Maximum Likelihood, claimed to be faster than Structure. Can handle sex-linked markers | Often slower than its counterparts | https://www.genetics.ucla.edu/software/admixture/index.html | (Alexander and Novembre, 2009) |
FineStructure/GlobeTrotter | Clustering and characterizing admixture | Chromosome painting, admixture and clustering | Estimates time since admixture, fast, set of scripts to facilitate analysis | Relies on Structure and fastStructure assumptions. Requires phased data. | http://paintmychromosomes.com/ | (Hellenthal et al., 2014) |
PCAdmix | Clustering and characterizing admixture | Chromosome painting | Fast, uses HMM to smooth out windows and limit noise due to low confidence ancestry | Requires a priori definition of ancestral populations and phased haplotypes | https://sites.google.com/site/pcadmix/ | (Brisbin et al., 2012) |
MOSAIC | Clustering and characterizing admixture | Chromosome painting, estimating admixture time and proportions. | Can handle several source populations. These populations do not have to be good surrogates of populations that actually mixed. | Requires phased data, but performs phasing error correction. | https://maths.ucd.ie/~mst/MOSAIC/ | (Salter-Townshend and Myers, 2019) |
tfa | Clustering and characterizing admixture | Summarizing variance across loci and visualizing inter-individual genetic distance | Uses latent factors to correct for drift and position ancient samples in a PCA-like framework. | NA | https://bcm-uga.github.io/tfa/ | (François and Jay, 2020) |
Dystruct | Clustering and characterizing admixture | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | This method explicitly takes into account the age of samples. Useful when analyzing mixtures of modern and ancient samples. | Requires a genotype matrix in the eigenstrat format. Primarily tested on human data. | https://github.com/tyjo/dystruct | (Joseph and Pe’er, 2019) |
BEDASSLE | Differentiation and MCMC model testing | Identifies contribution of environment and geographical distance to populations differentiation | Less biased than Mantel tests, provides tools for model testing | Uses population-level data. | https://cran.r-project.org/web/packages/BEDASSLE/index.html | (Bradburd et al., 2013) |
npstat | Differentiation/Diversity | Extracting summary statistics from pooled data | Explicitely corrects for sampling bias in pooled data. Allows computing tests using an outgroup (MK test, HKA test, Fay and Wu's H) and characterizing coding mutations. | Mostly limited to summary statistics, but more complete than Popoolation. | https://github.com/lucaferretti/npstat | (Ferretti et al., 2013) |
Popoolation/Popoolation2/Popoolation TE | Differentiation/Diversity/Recombination | Extracting summary statistics from pooled data | Explicitely corrects for sampling bias in pooled data. Can be used to detect TE polymorphisms. | Mostly limited to a few summary statistics. A pipeline dedicated to TE detection is also available | https://sourceforge.net/p/popoolation/wiki/Main/ | (Kofler, Orozco-terWengel, et al., 2011; Kofler, Pandey, et al., 2011) |
POPGenome | Differentiation/Diversity/Recombination | Computing summary statistics based on AFS and LD along genomes | Accepts VCF and GFF/GFT files, efficient and fast. Tests for admixture available (ABBA BABA test). Includes basic coalescence simulations (ms and msms) | Mostly limited to summary statistics (but coalescent simulations are possible). No built-in SNP calling module | http://catchenlab.life.illinois.edu/stacks/ | (Pfeifer et al., 2014) |
ANGSD | Differentiation/Diversity/Recombination | Computing summary statistics based on AFS and LD along genomes | Able to process BAM files, built-in procedures for data filtering, admixture analysis. Suited for low-depth data. Includes a suite of methods to estimate relatedness (NGSRelate). | Mostly limited to summary statistics. Tutorials not always up-to-date. | https://github.com/ANGSD/angsd https://github.com/ANGSD/NgsRelate | (Korneliussen et al., 2014; Hanghøj et al., 2019) |
Arlequin | Differentiation/Diversity/Recombination | Computing summary statistics based on AFS and LD along genomes | Can output AFS for further analysis in fastsimcoal2 | Slower than PopGenome, requires a specific format and file conversion. | http://cmpg.unibe.ch/software/arlequin35/Arl35Downloads.html | (Excoffier and Lischer, 2010) |
VCFTOOLS | Differentiation/Diversity/Recombination | Computing summary statistics based on AFS and LD along genomes | Fast. VCFTOOLS can also be used for SNP filtering | Less summary statistics than POPGenome | https://vcftools.github.io/man_latest.html | (Danecek et al., 2011) |
ATLAS | Differentiation/Diversity/Recombination | Low depth sequencing/ancient samples analsis | Particularly suited for analyzing ancient samples. Includes sets of tools to call variants, estimate post-mortem damage, inbreeding, genetic diversity. Produces the input file for PSMC (demography from a single diploid genome) | Better used in combination with GATK pipelines. Still in development | https://bitbucket.org/wegmannlab/atlas/wiki/Home | (Link et al., 2017) |
POPTREE2 | Genetic differentiation | Visualizing a matrix of pairwise differentiation statistics as a tree | Can be used for pooled datasets, several statistics can be used | Differentiation measures alone do not necessarily retrieve the actual history of populations | http://www.med.kagawa-u.ac.jp/~genomelb/takezaki/poptree2/index.html | (Takezaki et al., 2010) |
EEMS | Landscape genomics | Estimating barriers to gene flow in a spatial context | Estimates pairwise relatedness between all samples, and compares it to isolation-by-distance expectations to identify barriers to gene flow and corridors of higher connectivity. Can handle both haploid and diploid data | Requires to convert VCF file into PLINK binary format. Estimates effective migration rates (does not disantangle migration rates and effective population sizes). Setting parameters for the MCMC chain requires some trial-and-error | https://github.com/dipetkov/eems | (Petkova et al., 2015) |
MAPS | Landscape genomics | Estimating barriers to gene flow in a spatio-temporal context | Expands on EEMS, but takes into account the phase to reconstruct past changes in connectivity. Can disentantle migration rates and effective population sizes (unlike EEMS) | Relies on identity-by-descent tracks, requiring phasing (for example using BEAGLE). A pipeline to obtain IBD tracks is available, with a few details here: https://github.com/halasadi/ibd_data_pipeline/issues/1 | https://github.com/halasadi/MAPS | (Al-Asadi et al., 2019) |
TESS3R | Landscape genomics | Grouping individuals in clusters maximizing HW equilibrium and LD between loci | Incorporates geographic information of samples. Can run genome scans of selection based on contrasting ancestral and modern allele frequencies. | Importing data requires using conversion tools found in the LEA suite | https://bcm-uga.github.io/TESS3_encho_sen/ | (Caye et al., 2016) |
SNPRelate | Multivariate analysis | Summarizing variance across loci and visualizing inter-individual genetic distance | Fast. Can use VCF files as an input | Requires careful interpretation (Jombard et al. 2009) | https://bioconductor.org/packages/release/bioc/html/SNPRelate.html | (Zheng et al., 2012) |
Eigenstrat/smartpca | Multivariate analysis | Summarizing variance across loci and visualizing inter-individual genetic distance | Fast. Can use VCF files as an input | Requires careful interpretation (Jombard et al. 2009) | https://github.com/DReichLab/EIG/tree/master/EIGENSTRAT | (Price et al., 2006) |
DAPC (adegenet) | Multivariate analysis/Clustering | Maximizes divergence between groups identified by PCA | Fast. Less sensitive to HWE assumptions. Claims to be more efficient than Structure | Requires careful interpretation (Jombard et al. 2009) | http://adegenet.r-forge.r-project.org/ | (Jombart et al., 2010) |
sPCA (adegenet) | Multivariate analysis/Clustering | Spatially explicit model to assess population structure | Spatially explicit and able to detect cryptic structure. Fast. | Does not take into account HW equilibrium or LD | http://adegenet.r-forge.r-project.org/ | (Jombart et al., 2008) |
LAMP | Pedigree, Identity by descent/state | Chromosome painting, relatedness | LAMP also allows for association and pedigree analyses | Identifies local ancestry in windows (source of noise), requires phased data | http://lamp.icsi.berkeley.edu/lamp/ | (Baran et al., 2012) |
PLINK | Pedigree, Identity by descent/state | Estimating inbreeding and relatedness | Allows studying identity by descent and by state. PLINK is a multi-purpose tool, facilitating data analysis within the same software | NA | http://pngu.mgh.harvard.edu/~purcell/plink/ | (Purcell et al., 2007) |
VCFTOOLS | Pedigree, Identity by descent/state | Estimating inbreeding and relatedness | Computes unadjusted Ajk and kinship coefficient | NA | https://vcftools.github.io/man_latest.html | (Danecek et al., 2011) |
KING | Pedigree, Identity by descent/state | Estimating inbreeding and relatedness, multivariate analysis | Mendelian error checking, testing family structure, highly accurate kinship coefficient, association analysis, population structure inference | Kinship coefficient also computed in VCFTOOLS | http://people.virginia.edu/~wc9c/KING/Download.htm | (Manichaikul et al., 2010) |
COLONY | Pedigrees | Pedigree inference from SNPs | Robust even with high error rates (e.g. low-depth sequencing). Can handle haplo-diploids systems (e.g. ants). Multi-threaded. | Can only simulate genotypes with the Windows version. | https://www.zsl.org/science/software/colony | (Wang, 2019) |
sequoia | Pedigrees | Pedigree inference from SNPs | Can be applied to large pedigrees (>1000 individuals). Accomodates unknown birth times. | Handles hundreds of SNPs. For whole-genome data, preliminary filtering and LD-pruning may be recommended. Efficient with ~100 SNPs. | https://cran.r-project.org/web/packages/sequoia/index.html | (Huisman, 2017) |
LDHat | Recombination | Estimating variation in recombination rates along a genome | Handles unphased and missing data, underlying model can be used for organisms such as viruses or bacteria | Limited to 300 sequences, specific format (not VCF), model for recombination hotspots based on human data | http://ldhat.sourceforge.net/ | (McVean et al., 2002) |
LDHot | Recombination | Identifying recombination hotspots | Specifically designed for detecting recombination hotspots | Requires data to be phased, working with LDHat | https://github.com/auton1/LDhot | (Myers et al., 2005) |
iSMC | Recombination | Recombination from a single diploid genome | No phasing needed. Accepts VCF files as input. | Introgression and demographic misspecification may bias results. No detailed tutorial | https://github.com/gvbarroso/iSMC | (Barroso et al., 2018) |
LDHelmet | Recombination | Estimating variation in recombination rates along a genome | Higher accuracy than LDHat | Requires phased data. Does not handle VCF, only fasta and fastq formats. Requires dividing the genome in short segments to be analysed in parallel. | https://sourceforge.net/projects/ldhelmet/ | (Chan et al., 2012) |
References
Al-Asadi, H., Petkova, D., Stephens, M., & Novembre, J. (2019). Estimating recent migration and population-size surfaces. PLoS Genetics, 15(1), 1–21. doi: 10.1371/journal.pgen.1007908
Alexander, D. H., & Novembre, J. (2009). Fast Model-Based Estimation of Ancestry in Unrelated Individuals. Genome Research, (19), 1655–1664. doi: 10.1101/gr.094052.109.vidual
Baran, Y., Pasaniuc, B., Sankararaman, S., Torgerson, D. G., Gignoux, C., Eng, C., … Halperin, E. (2012). Fast and accurate inference of local ancestry in Latino populations. Bioinformatics, 28(10), 1359–1367. doi: 10.1093/bioinformatics/bts144
Barroso, G. V., Puzovic, N., & Dutheil, J. Y. (2018). Inference of recombination maps from a single pair of genomes and its application to archaic samples. BioRxiv, 1–21. doi: 10.1101/452268
Bradburd, G. S., Ralph, P. L., & Coop, G. M. (2013). Disentangling the effects of geographic and ecological isolation on genetic differentiation. Evolution, 67(11), 3258–3273. doi: 10.1111/evo.12193
Brisbin, A., Bryc, K., Byrnes, J., Zakharia, F., Omberg, L., Degenhardt, J., … Bustamante, C. D. (2012). PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations. Human Biology, 84(4), 343–364. doi: 10.3378/027.084.0401
Caye, K., Deist, T. M., Martins, H., Michel, O., & François, O. (2016). TESS3: Fast inference of spatial population structure and genome scans for selection. Molecular Ecology Resources, 16(2), 540–548. doi: 10.1111/1755-0998.12471
Chan, A. H., Jenkins, P. A., & Song, Y. S. (2012). Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster. PLoS Genetics, 8(12). doi: 10.1371/journal.pgen.1003090
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., … Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. doi: 10.1093/bioinformatics/btr330
Excoffier, L., & Lischer, H. E. L. (2010). Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10(3), 564–567. doi: 10.1111/j.1755-0998.2010.02847.x
Ferretti, L., Ramos-Onsins, S. E., & Pérez-Enciso, M. (2013). Population genomics from pool sequencing. Molecular Ecology, 22(22), 5561–5576. doi: 10.1111/mec.12522
François, O., & Jay, F. (2020). Factor analysis of ancient population genomic samples. Nature Communications, 11(1). doi: 10.1038/s41467-020-18335-6
Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G., & François, O. (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics, 196(4), 973–983. doi: 10.1534/genetics.113.160572
Guillot, G., Renaud, S., Ledevin, R., Michaux, J., & Claude, J. (2012). A unifying model for the analysis of phenotypic, genetic, and geographic data. Systematic Biology, 61(6), 897–911. doi: 10.1093/sysbio/sys038
Hanghøj, K., Moltke, I., Andersen, P. A., Manica, A., & Korneliussen, T. S. (2019). Fast and accurate relatedness estimation from high-throughput sequencing data in the presence of inbreeding. GigaScience, 8(5), 1–9. doi: 10.1093/gigascience/giz034
Hellenthal, G., Busby, G. B. J., Band, G., Wilson, J. F., Capelli, C., Falush, D., & Myers, S. (2014). A Genetic Atlas of Human Admixture History. Science, 343(February), 747–751. doi: 10.1126/science.1243518
Huisman, J. (2017). Pedigree reconstruction from SNP data: parentage assignment, sibship clustering and beyond. Molecular Ecology Resources, 17(5), 1009–1024. doi: 10.1111/1755-0998.12665
Jombart, T, Devillard, S., Dufour, a-B., & Pontier, D. (2008). Revealing cryptic spatial patterns in genetic variability by a new multivariate method. Heredity, 101, 92–103. doi: 10.1038/hdy.2008.34
Jombart, Thibaut, Devillard, S., & Balloux, F. (2010). Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics, 11(1), 94. doi: 10.1186/1471-2156-11-94
Joseph, T. A., & Pe’er, I. (2019). Inference of Population Structure from Time-Series Genotype Data. American Journal of Human Genetics, 105(2), 317–333. doi: 10.1016/j.ajhg.2019.06.002
Kofler, R., Orozco-terWengel, P., De Maio, N., Pandey, R. V., Nolte, V., Futschik, A., … Schlötterer, C. (2011). PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PloS One, 6(1), e15925. doi: 10.1371/journal.pone.0015925
Kofler, R., Pandey, R. V., & Schlötterer, C. (2011). PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq). Bioinformatics, 27(24), 3435–3436. doi: 10.1093/bioinformatics/btr589
Korneliussen, T. S., Albrechtsen, A., & Nielsen, R. (2014). ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics, 15(1), 356. doi: 10.1186/s12859-014-0356-4
Link, V., Kousathanas, A., Veeramah, K., Sell, C., Scheu, A., & Wegmann, D. (2017). ATLAS: Analysis Tools for Low-depth and Ancient Samples. BioRxiv. doi: 10.1101/105346
Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W.-M. (2010). Robust relationship inference in genome-wide association studies. Bioinformatics, 26(22), 2867–2873. doi: 10.1093/bioinformatics/btq559
McVean, G., Awadalla, P., & Fearnhead, P. (2002). A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics, 160(3), 1231–1241.
Myers, S., Bottolo, L., Freeman, C., McVean, G., & Donnelly, P. (2005). A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome. Science, 310(5746), 321–324. doi: 10.1126/science.1117196
Petkova, D., Novembre, J., & Stephens, M. (2015). Visualizing spatial population structure with estimated effective migration surfaces. Nature Genetics, 48(1), 94–100. doi: 10.1038/ng.3464
Pfeifer, B., Wittelsburger, U., Ramos-Onsins, S. E., & Lercher, M. J. (2014). PopGenome: An efficient swiss army knife for population genomic analyses in R. Molecular Biology and Evolution, 31(7), 1929–1936. doi: 10.1093/molbev/msu136
Price, A., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. a, & Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8), 904–909. doi: 10.1038/ng1847
Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2), 945–959. Retrieved from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1461096&tool=pmcentrez&rendertype=abstract
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., … Sham, P. C. (2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics, 81(3), 559–575. doi: 10.1086/519795
Raj, A., Stephens, M., & Pritchard, J. K. (2014). FastSTRUCTURE: Variational inference of population structure in large SNP data sets. Genetics, 197(2), 573–589. doi: 10.1534/genetics.114.164350
Salter-Townshend, M., & Myers, S. (2019). Fine-Scale Inference of Ancestry Segments Without Prior Knowledge of Admixing Groups. Genetics, 212(July), 869–889.
Takezaki, N., Nei, M., & Tamura, K. (2010). POPTREE2: Software for constructing population trees from allele frequency data and computing other population statistics with windows interface. Molecular Biology and Evolution, 27(4), 747–752. doi: 10.1093/molbev/msp312
Wang, J. (2019). Pedigree reconstruction from poor quality genotype data. Heredity, 122(6), 719–728. doi: 10.1038/s41437-018-0178-7
Zheng, X., Levine, D., Shen, J., Gogarten, S. M., Laurie, C., & Weir, B. S. (2012). A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics, 28(24), 3326–3328. doi: 10.1093/bioinformatics/bts606