Detecting ancient balancing selection and multi-epoch model of population size change

Today I am updating the tables by adding two new methods, BetaScan and Stairway.

  • BetaScan: long-term balancing selection

There are not many methods to detect ancient balancing selection. You can use a combination of summary statistics like Tajima’s D or nucleotide diversity, and check for outliers, but these statistics are prone to produce false negatives and false positives. Besides, ancient balancing selection can be hard to detect as its effects do not extend far from the site under selection since recombination had much more time to break associations between alleles. ARGWeaver is an interesting option to detect alleles that are extremely old, but is time-consuming at genomic scales.
BetaScan (Siewert and Voight 2017) is an interesting new method that uses predictions about the frequency spectrum of alleles linked to balanced polymorphisms (see Figure 1 in the paper for more details). In my experience it works rather well (I find many peaks at immune genes in multiple species, which makes sense from a biological perspective). It seems rather robust to demography. If you have a good idea of the demographic history for your species, it may be worth performing coalescent simulations and estimate the rate of false positives. Note that the method has been used on panmictic populations.
Quick tip: If you have an idea of the recombination rate/bp/generation (rho), then for a time of T generations since the balanced polymorphism arose, the 95% percentile of the length distribution of ancestral fragments on each side of the selected site is estimated by L=-log(0.05)/(T*rho). This value corrresponds to half the size of the window that should be used by the algorithm (see Sup. Mat. in the original paper). Adjust the size depending on how old you expect the balanced polymorphisms to be.

  • Stairway: Multi-epoch model of population size changes

Methods that use the allele frequency spectrum to infer demography often require that the number of population size changes be defined a priori. Stairway (Liu and Fu, 2015) provides a way to use SNP frequency data to produce plots that remind skyline plots from, e.g., BEAST. You will need an estimate of the number of invariant sites (e.g. the number of bases covered by RAD-sequencing or the number of genomic regions with high enough depth of coverage) and the allele frequency spectrum of (filtered) SNPs. It is fast, does not require phasing, but should not be trusted for really old events. In my experience it seems to work on RAD-seq data, giving results that were consistent with the ML method implemented in fastsimcoal.

  • New version of fastsimcoal

Note that there is a new version of fastsimcoal (v 2.6). Two major changes include:
1) the possibility to exclude singletons for optimizing demographic parameters (dadi did that already), which should limit the impact of badly called SNPs on inference.
2) the possibility to specify an average inbreeding rate for each population. This should be useful for scientists working on organisms that like selfing…

 

References

Liu X, Fu Y-X (2015). Exploring population size changes using SNP frequency spectra. Nat Genet 47: 555–559.

Siewert KM, Voight BF (2017). Detecting Long-Term Balancing Selection Using Allele Frequency Correlation. Mol Biol Evol 34: 2996–3005.

Scripts for incorporating heterogeneous gene flow in models of divergence (for dadi and ABC)

Today I provide links to scripts and tutorials that should help people interested in incorporating heterogeneous migration rates in their isolation-with-migration models.
It is increasingly acknowledged that gene flow between two populations is heterogeneous across the genome (“semi-permeable genome”), since effective migration rates are reduced around genes under positive selection or involved in reproductive isolation. Besides, speciation and differentiation often involves episodes of secondary contact or progressive interruption of gene flow. Incorporating these aspects improves model fitting to observed data, and can improve detection of genomic region that are selected or introgressed (see for example Christe et al. 2016).
Below you will find scripts for dadi and to perform ABC inference (using either ms, msms, or msnsam). I also provide the links in the tables.

  • dadi

A zip file containing scripts and tutorials to compare various models of divergence with heterogeneous gene flow, kindly provided by Christelle Fraïsse (Merci Christelle !):
tutoriel_dadi and a modified version of dadi that is needed for the annealing step dadi-1.7.0_modif
You can find more scripts in this dryad folder, that correspond to analyses performed in Christe et al. 2016.

And a website that also provides details about how to implement heterogeneous gene flow in dadi:

Dadi inference

  • ABC

This link leads to the github repository for the recent paper by Camille Roux et al. (2016), that used ABC to compare 14 different models of divergence using sequence data. The key script here is priorgen.py, that generates priors for the 14 models and can be piped through ms/msms/msnsam. The script popPhyl2ABC_v2.py converts a phased fasta file into input files required by priorgen.py. Look at the example folder to check the way the fasta file is organized. If I get it right, sequence header names should follow this convention: >Gene_name|Species|Individual|Allele_number

Thanks to Christelle Fraïsse , Camille Roux and P-A Gagnaire for putting their scripts online.

More posts and updates to come, with a new version of fastsimcoal that can incorporate information about inbreeding, tests for ancient balancing selection, inferring population structure from RAD-seq and methods for inferring pedigrees using population genomics data…

 

References

Christe, C., Stolting, K. N., Paris, M., Frayisse, C., Bierne, N., & Lexer, C. (2016). Adaptive evolution and segregating load contribute to the genomic landscape of divergence in two tree species connected by episodic gene flow. Molecular Ecology. https://doi.org/10.1111/mec.13765

Roux, C., Fraïsse, C., Romiguier, J., Anciaux, Y., Galtier, N., & Bierne, N. (2016). Shedding Light on the Grey Zone of Speciation along a Continuum of Genomic Divergence. PLOS Biology. https://doi.org/10.1371/JOURNAL.PBIO.2000234

New methods (species trees with c-genes) and new preprint on bioRxiv

Today I am adding three methods that all aim at inferring species trees from short non-recombining fragments (c-genes), namely SVDQuartets (Chifman and Kubatko, 2014), ASTRAL-2 (Mirarab and Warnow, 2015) and NJst (Liu and Yu, 2011). The first one works in PAUP*, the last one works in R, and all use a coalescence framework. A recent study compared their performance (Chou et al. 2015) and suggests ASTRAL-2 works better in a context of high Incomplete Lineage Sorting. I am not usually a huge fan of phylogenetic methods at the microevolutionary scale, and tend to use population genetics tools such as fastsimcoal or dadi, but this class of methods is relevant in a context where there is sufficient divergence or structure between populations (it does not have to be proper “species”, see the recent paper in PNAS by Sukumaran and Knowles (2017) about this. Another rabbit hole…).
Also, since these methods use short non-recombining fragments, there should work well with RAD-markers or GBS data.

I also added a new version of the paper on BioRxiv. Hopefully you find it useful.
Good luck!

References

Chifman J, Kubatko L (2014). Quartet inference from SNP data under the coalescent model. Bioinformatics 30: 3317–3324.

Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, et al. (2015). A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 16: S2.

Liu L, Yu L (2011). Estimating species trees from unrooted gene trees. Syst Biol 60: 661–667.

Mirarab S, Warnow T (2015). ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics 31: i44–i52.

Sukumaran J, Knowles LL (2017). Multispecies coalescent delimits structure, not species. Proc Natl Acad Sci 114: 1607–1612.

Demographic history: SMC++

A new method has been published last december that I did not see at the time, SMC++ (Terhorst et al. 2016). It works in a way similar to MSMC, allowing to infer variation in population size with time. But instead of using phased data like MSMC, it can use unphased data and many more genomes for inference in a reasonable amount of time. It can also infer divergence time between populations, but assumes a clean split. They plan to include gene flow in next versions. Note that the method assumes that the ancestral allele is the reference allele, so you might have to play a bit with outgroups to perform clean analyses or use the option –polarization-error 0.5 which makes the software aware that the ancestral allele is unknown.
Basically, the method is similar to PSMC in spirit, relying on hidden markov chains (HMM) and coalescence theory to describe the linear structure of DNA variation. It however takes into account the information from other individuals in a population in the form of conditional allele frequency spectrum (emitting for each site in a specific, ‘distinguished’ individual the allele frequency spectrum for other individuals).
I reproduce below the Methods paragraph summarizing the advantages of the method:
” SMC++ stands for ‘sequential Markov coalescent + plenty of unlabeled samples’. SMC++ unites the PRF and coalescent HMM approaches, combining the strengths of each while overcoming several of their limitations. The inclusion of ‘unlabeled samples’ in the standard coalescent HMM is achieved via novel theoretical results on what we term CSFS, the sample frequency spectrum conditioned on the coalescence time and allelic state of a distinguished diploid individual. In comparison to existing methods, the main advantages of SMC++ are as follows.

  1. Scalability. SMC++ can analyze hundreds of individuals at a time while requiring a modest amount of memory and processing time. Analyzing 100 human genomes takes roughly 1 h on a laptop.
  2. Accuracy. By accommodating larger sample sizes, SMC++ has greatly improved power to infer demographic events, particularly in the recent past.
  3. Phase invariance. SMC++ only requires unphased sequence data as input (that is, results do not depend on phasing).
  4. Regularity. SMC++ uses cubic splines to enforce a smoothness constraint on the inferred demographies. In comparison with existing methods, the resulting estimates exhibit far less variance, with only a minimal increase in bias. “

Thanks to Anne Roulin and Joseph Manthey for pointing this to me.

Reference: Terhorst J, Kamm JA, Song YS (2016). Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet 49: 303–309

New methods added: KimTree, selEstim, ABLE

Today I am adding a few methods that aim at resolving relationships between populations, estimating selection coefficients and estimate parameters for arbitrarily complex scenarios.

SelEstim and Kimtree both relie on diffusion approximations, which model how alleles diffuse in a set of populations forward in time (whereas coalescent simulations work backward in time, something that has puzzled so many of us at first). They do not take into account mutation and are therefore suited for the study of adaptation in recently diverged populations.  Kimtree (Gautier and Vitalis, 2013) allows to estimate the divergence time between several populations, given a prior tree. It is robust to gene flow and variation in population size . The time is estimated as tau, which depends on the effective population size N and the time in demographic units t. It is therefore necessary to estimate the effective population size using other methods to get t in years/generations. An interesting aspect of this software is that it allows to estimate the support of the topology given as a prior using deviance information criterion (DIC). This is useful when trying to get the order at which populations diverge before using demographic models in, e.g., fastsimcoal or IMa2 that require prior information about the topology.

selEstim (Vitalis et al., 2014) is a method to detect Fst outliers that has been shown to perform better than Bayescan and provides an estimate of the selection coefficient, which I think is its big strength. Note however that it assumes an island model, even if it seems rather robust to false positives if a proper calibration procedure is used. I would suggest using the R function simulate.baypass() in BayPass (Gautier, 2015) to easily produce pseudo-observed datasets in the right format. This way you can also run a BayPass analysis and compare the results.

These two methods have been developed by the same people as BayPass (Gautier, 2015), and I would advise using the three of them in a single pipeline since they use the same input for allele counts. Note that they handle pooled data. This is a nice and powerful toolbox to make sense of new datasets, especially for pooled data and recent divergence. Note also that there is a much faster implementation of REHH now available in R to perform haplotype-based tests of selection (Gautier et al., 2017).

The last method is ABLE (Beeravolu et al., 2016), currently under review but extremely promising since it allows estimating parameters from any arbitrary complex scenario using whole genome data, in a way similar to fastsimcoal, except that it takes into account the linkage disequilibrium within blocks of a given size, which provides more information and can be useful to discriminate between different demographic scenarios. Long story short, it summarizes the data using blockwise site frequency spectrum (SFS), counting haplotypes to feed the SFS instead of using single SNPs. Since it models non recombining blocks of a size specified by the user, it also allows analyzing RAD-seq data. Note that it does NOT require phasing. It estimates an approximate composite likelihood that can be used to estimate models support. I would be personally careful when computing AIC based on this when differences in likelihood are small between models, since it can overestimate the support for the best model (check discussion in Excoffier 2013). One should perform the likelihood estimation 10-100 times and compare the distributions for each model, like what is done in the paper. Note that the syntax is mostly based on ms, and that it is apparently rather sensitive to small mistakes in data format, such as blank lines or white spaces…

Tables have been updated to take these additions into account. Do not hesitate to contact me for any comment or suggestion !

Thanks to Renaud Vitalis (selEstim and Kimtree) and Isaac Overcast (ABLE) for pointing these tools to me!
Enjoy playing around with these new toys…

References

Beeravolu CR, Hickerson MJ, Frantz LAF, Lohse K (2016). Approximate Likelihood Inference of Complex Population Histories and Recombination from Multiple Genomes. bioarXiv: 1–31.

Excoffier L, Dupanloup I, Huerta-Sanchez E, Sousa VC, Foll M (2013). Robust Demographic Inference from Genomic and SNP Data. PLoS Genet 9.

Gautier M (2015). Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates. Genetics 201: 1555–1579.

Gautier M, Klassmann A, Vitalis R (2017). rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure. Mol Ecol Resour 17: 78–90.

Gautier M, Vitalis R (2013). Inferring population histories using genome-wide allele frequency data. Mol Biol Evol 30: 654–68.

Vitalis R, Gautier M, Dawson KJ, Beaumont MA (2014). Detecting and measuring selection from gene frequency data. Genetics 196: 799–817.

New method for landscape genomics and detection of selection: SAMBADA

This method looks interesting, although in the study presenting it there was little overlap with other classical methods such as LFMM or BayEnv. An interesting feature of this software is that it outputs information about spatial autocorrelation of genotypes (using Moran’s I) to correct for spurious association. Population structure can also be included (multivariate analysis) to correct for relatedness between samples. Also, the software allows for converting data to a suitable format from PLINK input files (this private format is something that would have been nice to avoid but I am picky). Having read the paper, I would suggest to use it in combination with other landscape genomics programs (LFMM, BayPass…) and look at common candidates. It makes a nice addition to the landscape genomics toolbox.

You can check the paper here (open access): http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12629/full

And thanks to Anne Roulin for pointing this software to me 🙂

New method added (Trinculo)

Hi,

I recently performed genome-wide association with phenotypes that were not continuous but fell in categories. I therefore looked for tests that could extend the classical association test for binary phenotypes (like 0/1, resistant/susceptible) to several categories (in my case a combination of binary phenotypes, which could be coded as 0001, 0011, 0111 etc.). While I struggled to compile it on the servers I am using, I have been really satisfied with the results. Note that the software is able to perform fine-mapping and assesses which specific category is associated with SNP variation. It also allows comparing outputs based on frequentist and bayesian approaches. Enjoy !

You can downloadTrinculo for Mac and Linux here : https://sourceforge.net/projects/trinculo/files/?source=navbar


Yann