A new method has been published last december that I did not see at the time, SMC++ (Terhorst et al. 2016). It works in a way similar to MSMC, allowing to infer variation in population size with time. But instead of using phased data like MSMC, it can use unphased data and many more genomes for inference in a reasonable amount of time. It can also infer divergence time between populations, but assumes a clean split. They plan to include gene flow in next versions. Note that the method assumes that the ancestral allele is the reference allele, so you might have to play a bit with outgroups to perform clean analyses or use the option –polarization-error 0.5 which makes the software aware that the ancestral allele is unknown.
Basically, the method is similar to PSMC in spirit, relying on hidden markov chains (HMM) and coalescence theory to describe the linear structure of DNA variation. It however takes into account the information from other individuals in a population in the form of conditional allele frequency spectrum (emitting for each site in a specific, ‘distinguished’ individual the allele frequency spectrum for other individuals).
I reproduce below the Methods paragraph summarizing the advantages of the method:
” SMC++ stands for ‘sequential Markov coalescent + plenty of unlabeled samples’. SMC++ unites the PRF and coalescent HMM approaches, combining the strengths of each while overcoming several of their limitations. The inclusion of ‘unlabeled samples’ in the standard coalescent HMM is achieved via novel theoretical results on what we term CSFS, the sample frequency spectrum conditioned on the coalescence time and allelic state of a distinguished diploid individual. In comparison to existing methods, the main advantages of SMC++ are as follows.
- Scalability. SMC++ can analyze hundreds of individuals at a time while requiring a modest amount of memory and processing time. Analyzing 100 human genomes takes roughly 1 h on a laptop.
- Accuracy. By accommodating larger sample sizes, SMC++ has greatly improved power to infer demographic events, particularly in the recent past.
- Phase invariance. SMC++ only requires unphased sequence data as input (that is, results do not depend on phasing).
- Regularity. SMC++ uses cubic splines to enforce a smoothness constraint on the inferred demographies. In comparison with existing methods, the resulting estimates exhibit far less variance, with only a minimal increase in bias. “
Thanks to Anne Roulin and Joseph Manthey for pointing this to me.
Reference: Terhorst J, Kamm JA, Song YS (2016). Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat Genet 49: 303–309