## Agree, relaxed phylogenetics and dating with confidence all not

In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a continuum. Despite their dominance in phylogenetic inference, it is evident that both are biologically unrealistic and that the real evolutionary process lies between these two extremes. Fortunately, intermediate models employing relaxed molecular clocks have been described. We describe how it can be used to estimate phylogenies and divergence times in the face of uncertainty in evolutionary rates and calibration times. Our approach also provides a means for measuring the clocklikeness of datasets and comparing this measure between different genes and phylogenies. We find no significant rate autocorrelation among branches in three large datasets, suggesting that autocorrelated models are not necessarily suitable for these data.

Despite their dominance in phylogenetic inference, it is evident that both are biologically unrealistic and that the real evolutionary process lies between these two extremes. Fortunately, intermediate models employing relaxed molecular clocks have been described.

These models open the gate to a new field of "relaxed phylogenetics. We describe how it can be used to estimate phylogenies and divergence times in the face of uncertainty in evolutionary rates and calibration times. Our approach also provides a means for measuring the clocklikeness of datasets and comparing this measure between different genes and phylogenies.

We find no significant rate autocorrelation among branches in three large datasets, suggesting that autocorrelated models are not necessarily suitable for these data. The three nodes with normally distributed calibration priors are indicated by orange bars. For the marsupial dataset there is a small degree of autocorrelation suggested by the mean estimate, but it is not significantly different from zero mean: 0.

We would expect that larger datasets, particularly of diverse organisms that vary considerably in life-history traits or proofreading mechanisms, might exhibit substantial autocorrelation. Five large datasets were obtained from previous studies: 1 amino acid alignments of genes from eight bacterial species; 2 nucleotide alignments of genes from eight yeast species [ 37 ]; 3 nucleotide alignments of 61 genes from nine plants; 4 amino acid alignments of 99 genes from nine metazoans; and 5 nucleotide alignments of non-coding sequence from nine primates.

The bacterial dataset was a subset of a larger dataset comprising genes representing 45 species of bacteria [ 38 ]. Eight species of Proteobacteria were selected due to their close phylogenetic relationship, as well as representation among the genes.

A total of genes spanned all eight of the species that were retained for analysis. The plant dataset was taken from a larger dataset comprising 61 genes from 12 taxa [ 39 ]. Nine species were selected in accordance with the stipulation that their phylogeny was known with almost complete certainty. The metazoan alignment was a subset of a larger dataset comprising genes from 36 eukaryotes [ 40 ].

Nine metazoan taxa were selected from this dataset so that the tree relating the selected taxa was not in dispute. Genes that were unavailable for one or more of the nine selected taxa were removed, leaving 99 genes in the final dataset used for phylogenetic analysis.

The primate dataset was a subset of a 2, bp alignment of non-coding DNA from 19 mammals [ 4142 ]. The non-primates were removed from the alignment, and sites with a gap in any sequence were removed.

The remaining alignment was broken up into alignments of equal length bp. These individual alignments were each intended to represent the data produced by an ordinary phylogenetic study in which a gene fragment has been sequenced from a number of organisms. The question being asked is if we only have one such alignment, how well are we able to reconstruct the phylogenetic relationships of the organisms?

To assess the accuracy of the phylogenetic methods being tested, estimates of the phylogeny need to be tested against the true phylogeny for each dataset. In order to obtain the best possible estimates of the phylogeny for each dataset, the alignments in each of the five datasets were concatenated. The five concatenated alignments were analyzed under the HKY model of nucleotide substitution with gamma-distributed rate variation among sites and a proportion of invariant sites.

Each analysis was run for 5, MCMC steps, with a discarded burn-in ofsteps. The trees inferred from the plant, metazoan, and primate datasets agree well with the established trees for these groups. However, the bacterial and yeast phylogenies are relatively uncertain [ 3743 ], and the trees inferred from the concatenated alignments are probably the best estimates currently available. Even if these trees turn out to be different from the true evolutionary histories of the studied organisms, we can at least assume that the trees used in this analysis are very near in tree space to the truth, and therefore we would expect our results to be little affected.

The yeast tree inferred in this study from concatenated data agreed with that published by Rokas et al. The HKY model of nucleotide substitution was assumed, with gamma-distributed rate variation among sites and a proportion of invariant sites.

Most analyses were run forMCMC steps with 50, burn-in steps, although some datasets required 1, steps withburn-in steps. All analyses were checked for convergence using the program Tracer 1. These terms have statistical definitions, but we take liberties here to facilitate easier interpretation.

All three methods performed poorly in analyses of the bacterial and metazoan datasets. This result is not surprising, however, considering the substantial time depth of these trees.

# Relaxed phylogenetics and dating with confidence

The uncorrelated relaxed-clock method produced the most accurate estimates of phylogeny overall Table 4. It outperformed other methods in analyses of the bacterial, yeast, metazoan, and primate data, but the molecular clock method was the most accurate in the analysis of the plant data. In the case of the primate data, all three methods were similarly accurate in estimating phylogenies.

This is probably because the data were relatively clocklike, with the molecular clock assumption rejected for less than a third of the alignments. For all of the datasets that were analyzed, the phylogenetic estimates made using a strict molecular clock were the most precise. Under conditions in which the data more or less conform to a molecular clock, such as the primate data examined in this study, the molecular clock method should be used due to its superior precision.

The relaxed phylogenetics methods described here co-estimate phylogeny and divergence times under a relaxed molecular clock model, thus providing an integrated framework for biologists interested in reconstructing ancestral divergence dates and phylogenetic relationships. The method presented here naturally incorporates the time-dependent nature of the evolutionary process without assuming a strict molecular clock. One of the byproducts of estimating a phylogeny using a relaxed clock is an estimate of the position of the root of the tree, even in the absence of a non-reversible model of substitution [ 4445 ] or a known outgroup.

Recently, a number of authors have begun to investigate the impact of various forms of model misspecification on the accuracy of posterior probabilities of clade support [ 46 - 48 ].

In a Bayesian framework, the absence of a molecular clock assumption either strict or relaxed represents a prior belief that the tree topology provides no information about relative branch lengths.

May 19, Despite their dominance in phylogenetic inference, it is evident that both are biologically unrealistic and that the real evolutionary process lies between these two extremes. Fortunately, intermediate models employing relaxed molecular clocks have been described. These models open the gate to a new field of "relaxed phylogenetics."Cited by: Relaxed Phylogenetics and Dating with Confidence Abstract. In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a Figures. Citation: Drummond AJ, Ho SYW, Phillips MJ, Rambaut A Relaxed Phylogenetics and Dating with Confidence. Cited by: Relaxed phylogenetics and dating with confidence In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a continuum. Despite their dominance in phylogenetic inference, it is evident that both are biologically unrealistic and that the real evolutionary process lies between these two benjamingaleschreck.com by:

We suggest that this represents a poor prior belief, and that Bayesian estimation of phylogeny from short sequences may be biased when the time-dependency of the evolutionary process is not modeled. We would argue that the complex time-dependency of the evolutionary process should not be ignored a priori as has been common practice, but should instead be carefully modeled. This paper represents a first attempt at incorporating a relaxed-clock model into a Bayesian method of phylogenetic inference.

We have presented a large analysis of bacterial, yeast, 61 plant, 99 metazoan, and primate alignments that overall suggests the relaxed-clock models are both more accurate and more precise at estimating phylogenetic relationships than current unrooted methods implemented in MrBayes and other programs. Overall, these initial results suggest that a relaxed phylogenetic approach may be the most appropriate even when phylogenetic relationships are of primary concern and the rooting and dating of the tree are of less interest.

The molecular clock assumption can be relaxed in a variety of ways [ 13 - 151749 - 52 ]. To convert this tree from units of time to molecular evolutionary units, the rates are either assigned to branches [ 1517 ] or to nodes [ 5354 ]. The first such model to be described [ 15 ] assigned rates to the midpoints of branches and the assumed lognormal prior distribution relating the midpoint of the ancestral branch to the midpoint of the derived branch.

Another interesting model is the exponential distribution model of Aris-Brosou and Yang [ 17 ], which employed an exponential prior distribution on rate r with a mean and therefore standard deviation equal to the ancestral rate r Aand with no dependence on the time between the two rates.

This second model represents a more punctuated view of change in evolutionary rate, so that only the number of branching events, and not the length of time between events, determines the amount of change in evolutionary rate. In all autocorrelated relaxed-clock models, an additional assumption must be made about the rate at the root. For models that assign rates to nodes, it is necessary to treat the root node in a special way, as it does not have a parent node [ 15 ].

For models that assign rates to branches, a branch above the root is implied and must be assigned a rate. In the autocorrelated relaxed-clock models that have been described, including the commonly used lognormal model [ 151755 ], it is also necessary to specify the degree of autocorrelation as a prior. Other prior models of rate change, such as the gamma distribution model and the Ornstein-Uhlenbeck process [ 55 ], require more than one hyperparameter to be specified, so that selecting suitable values for a particular dataset may be an even more difficult exercise.

The effects of varying these hyperparameters are poorly understood [ 22 ], but there is likely to be a considerable impact on posterior estimates of rates.

We present an alternative to the autocorrelated prior in which there is, a priori, no correlation of the rates on adjacent branches of the tree.

### Feeling Insecure About Dating? (How to boost your confidence!)

Instead we propose a model in which the rate on each branch of the tree is drawn independently and identically from an underlying rate distribution. We investigate two candidates for the rate distribution among branches:. These uncorrelated priors can be framed in a hierarchical Bayesian framework, as with the autocorrelated priors. In this scenario the exponential version of uncorrelated relaxed clock would have a prior probability on the rate vector of:.

This prior reflects a punctuated view of change in evolutionary rate, so that the prior expectation of the rate at all branches is the same, with no autocorrelation between adjacent branches.

Notice that the posterior distribution of rates among branches need not be the same as the prior in this setup and that autocorrelation may exist in the posterior, even though it is not specified in the prior. Instead of framing Equations 2 and 3 as prior distributions in a hierarchical Bayesian framework, they can instead be reformulated as a full likelihood model. In this case, the branch rates are not independent random variables with a prior distribution, but are instead constrained so as to fit one of the distributions in Equations 2 and 3 exactly.

The parameters of the rate distribution are no longer hyperparameters of a prior distribution, but are instead parameters of the likelihood model. This is closely analogous to the common way in which rate heterogeneity among sites is treated [ 28 ]. A particular requirement of Bayesian phylogenetic inference is the responsibility given to users to specify a prior probability distribution on the shape of the phylogeny node ages and branching order. This can be either a benefit or a burden, largely depending on whether an obvious prior distribution presents itself for the data at hand.

For example, the coalescent prior [ 5657 ] is a commonly used prior for population-level data and has been extended to include various forms of demographic functions [ 5859 ], sub-divided populations [ 60 ], and other complexities. Traditional speciation models such as the Yule process [ 61 ] and various birth-death models [ 6263 ] can also provide useful priors for species-level data.

Such models generally have a number of hyperparameters for example, effective population size, growth rate, or speciation and extinction rateswhich, under a Bayesian framework, can be sampled to provide a posterior distribution of these potentially interesting biological quantities.

In some cases, the choice of prior on the phylogenetic tree can exert a strong influence on inferences made from a given dataset [ 64 ].

The sensitivity of inference results to the prior chosen will be largely dependent on the data analyzed and few general recommendations can be made. It is, however, good practice to perform the MCMC analysis without any data in order to sample wholly from the prior distribution.

This distribution can be compared to the posterior distribution for parameters of interest in order to examine the relative influence of the data and the prior Figure 3.

The full Bayesian sequence analysis with an uncorrelated relaxed-clock model allows the co-estimation of substitution parameters, relaxed-clock parameters, and the ancestral phylogeny. The posterior distribution is of the following form:. If, for example, the divergence times are of primary interest then the other sampled parameters can be thought of as nuisance parameters, and vice versa. The formulation in Equation 5 implies that the branch-rates could be integrated analytically in the Felsenstein likelihood.

Although this could be accomplished relatively easily by discretizing the rate distribution and averaging the likelihood over the rate categories on each branch, we elected to do the integration using MCMC.

During the calculation of the likelihood the rate category c is converted to a rate by the following method:. This discretization of the underlying rate distribution is illustrated in Figure 5 for a lognormal distribution with 12 rate categories sufficient for a tree of seven tips. To integrate the branch rates out, the assignment of rate categories c to branches was sampled via MCMC.

One issue that remains largely unresolved in this piece of work is the issue of model comparison and model selection. Within a Bayesian framework, Bayes factors are usually regarded as the correct way to deal with model selection. Typically this involves a technique known as reversible-jump MCMC. We have not implemented this, but we do plan on developing a reversible-jump MCMC version of this framework in the future. Typically model selection is easy when one model produces a much better fit.

Because all of the models for rate variation examined here differ by one free parameter at most, a simple comparison of the average log posterior probabilities will usually be revealing.

It is only when the log posteriors are very similar and the results are qualitatively different between the two models that model selection becomes an issue.

This combination of conditions did not occur in any of our real datasets. The MCMC must sample the tree topology, the divergence times, and the individual parameters of the substitution model and tree prior s.

For example, some moves propose local changes to the tree topology while keeping the coalescent interval and all the other parameters constant. Some moves propose a change to a single substitution parameter such as the shape parameter of the gamma distribution while keeping everything else constant. The general scheme is to 1 choose a random move with a probability proportional to a specified weight, then 2 apply the move to the current state, and 3 assess the relative score of the new state.

The new state is adopted if it has a higher posterior probability; otherwise it is adopted with probability equal to the ratio of its posterior probability to the posterior probability of the previous state.

The weights allow the researcher to favor certain moves which can help with the performance of the MCMC, but generally the default weights give good results. Most of the moves used in our MCMC implementation have been previously described [ 30 ]. The two new moves involve sampling the rate categories of the branches a random pair of branches are chosen and their categories are swapped and dealing with rate categories of branches when a change to the tree topology is made.

Here we introduce a new approach to performing relaxed phylogenetic analysis. We describe how it can be used to estimate phylogenies and divergence times in the face of uncertainty in evolutionary rates and calibration times. A. J., Ho, S. Y. W., Phillips, M. J., & Rambaut, A. Relaxed phylogenetics and dating with confidence. PLoS Cited by:

We implement two alternatives: keeping all the rate categories the same when a subtree is moved or performing a single rate swap simultaneously with a tree topology change. These moves are very simplistic, and we suspect that better proposal distributions exist. We have found a small number of datasets in which our current proposal distribution does not work well.

Nevertheless, for a large number of datasets including the ones presented in this paper, our scheme performs more than adequately as assessed by repeated runs and estimation of integrated autocorrelation times. The output of an MCMC analysis is a set of samples from the posterior distribution.

In the case of the uncorrelated relaxed-clock models described above, the posterior distribution is a distribution over tree topologies, dates of divergence, branch rates, and parameters of the rate and substitution models.

This complex set of samples can be summarized in many ways. This is the simple average of the calculated over all L samples in the estimated posterior distribution. In a similar manner, marginal posterior estimates can be calculated for.

Some subtlety in the interpretation of the posterior distribution of rates is required because both the amount of time a branch represents, t jand the rate of evolution along the branch, r jare random variables in the MCMC analysis.

Mar 14, For these reasons, a "relaxed phylogenetics" approach, in which the phylogeny and the divergence dates are co-estimated under a relaxed molecular clock, is preferred [ 18]. Here we present a Bayesian Markov chain Monte Carlo (MCMC) [ 19, 20 ] method for performing relaxed phylogenetics that is able to co-estimate phylogeny and divergence times under a new class of Cited by: '' relaxed phylogenetics '' approach, in which the phylogeny and the divergence dates are co-estimated under a relaxed molecular clock, is preferred [18]. CiteSeerX - Relaxed Phylogenetics and Dating with Confidence CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a continuum.

For the purposes of this paper, when we refer to the average rate for a set of branches B such as the set of external branches or the set of internal brancheswe define it as the weighted average:. In general, this will be different from the mean of the underlying rate distribution because the rate at each branch is weighted by the time represented by the branch.

The justification for this is that the overall rate is best summarized by the total amount of substitutions over the total amount of time, which is what Equation 8 calculates. In the above discussion on rate models, it was assumed that it is possible to estimate absolute rates of evolution and the variance in absolute rates.

In fact, even under a molecular clock assumption, the divergence times and the overall substitution rate can only be separately estimated if there is a source of external calibration information.

In the framework described here, this information can come from one of three sources: 1 Prior information on the age of internal nodes: In a phylogenetic context, calibration information is often obtained by assigning the age of a known fossil to a particular internal node [ 2 ].

Uncertainty in the association between an internal node and the fossil record can be accommodated by providing a prior probability distribution for the age of the node. Previous studies have used a uniform distribution with upper and lower bounds on the age [ 54 ], although other distributions may be suitable [ 35 ].

### Opinion obvious. relaxed phylogenetics and dating with confidence pity, that

In the above Results section, we presented examples in which calibration times are treated with parametric prior distributions normal and lognormal. Assigning an age to a particular node is only possible when the tree itself is assumed to be known and fixed, a limitation of previous relaxed-clock implementations [ 151754 ].

In the framework presented here, the tree itself is being sampled and thus we cannot define the age of a particular internal node.

Instead we specify the age, or the prior distribution of age, for the most recent common ancestor of a set of taxa. Every time a new tree is proposed in the MCMC chain, the most recent common ancestor of the specified taxa is located in the tree, and the prior probability of the age of this node is used to assess the acceptance probability of the proposed tree. Again, there may be uncertainty in calibration dates [ 67 ]. The RNA virus data in this study provide examples of this form of calibration information.

In the simplest case this can be achieved by fixing the rate of evolution to a known value.

## Absolutely relaxed phylogenetics and dating with confidence the excellent

It is also straightforward to sample the rate from a parametric distribution obtained from a previous independent analysis [ 6869 ]. If there is no prior information about the mean substitution rate, then it can be fixed to 1, resulting in time being in units of substitutions per site.

All of these forms of calibration information can be incorporated into our MCMC implementation either on their own or in any combination, as appropriate. The authors would like to thank S. Chaw and H. Competing interests.

## Properties leaves relaxed phylogenetics and dating with confidence consider, that you

The authors have declared that no competing interests exist. Author contributions. AJD and AR conceived the original idea, developed the software, and performed the marsupial and virus data analyses.

SYWH developed the simulation software, performed the simulation analysis, developed the use of prior distributions for calibrating node ages, and performed the analyses on the bacteria, yeast, plant, metazoan, and primate datasets. MJP collected and curated the marsupial dataset and provided expert calibration information. PLoS Biol 4 5 : e AJD was supported by the Wellcome Trust. National Center for Biotechnology InformationU.

PLoS Biol. Published online Mar Simon Y. David Penny, Academic Editor. Author information Article notes Copyright and License information Disclaimer. Corresponding author.

Relaxed Phylogenetics and Dating with Confidence Alexei J. Drummond[¤, Simon Y. W. Ho, Matthew J. Phillips, Andrew Rambaut[* Department of Zoology, University of Oxford, Oxford, United Kingdom In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a. Relaxed phylogenetics and dating with confidence. Drummond AJ(1), Ho SY, Phillips MJ, Rambaut A. Author information: (1)Department of Zoology, University of Oxford, Oxford, United Kingdom. Comment in PLoS Biol. May;4(5):eCited by: Relaxed Phylogenetics and Dating With Confidence In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a continuum. Despite their dominance in phylogenetic inference, it is evident that both are biologically unrealistic and that the real evolutionary process lies between these two benjamingaleschreck.com by:

Andrew Rambaut: ku. Received May 16; Accepted Jan This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.

See " Model Selection and the Molecular Clock " in volume 4, e This article has been cited by other articles in PMC. Abstract In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a continuum. Introduction From obscure beginnings, phylogenetics has become an essential tool for understanding molecular sequence variation.

Results Simulations We generated alignments of nine nucleotide sequences, each 1, nucleotides in length, on the rooted tree in Figure 1. Open in a separate window. Figure 1. Figure 2. Marsupials In addition to the viral sequences, we analyzed a marsupial dataset.

Figure 3. Assessing Accuracy and Precision with Five Large Datasets Five large datasets were obtained from previous studies: 1 amino acid alignments of genes from eight bacterial species; 2 nucleotide alignments of genes from eight yeast species [ 37 ]; 3 nucleotide alignments of 61 genes from nine plants; 4 amino acid alignments of 99 genes from nine metazoans; and 5 nucleotide alignments of non-coding sequence from nine primates.

Figure 4. Discussion The relaxed phylogenetics methods described here co-estimate phylogeny and divergence times under a relaxed molecular clock model, thus providing an integrated framework for biologists interested in reconstructing ancestral divergence dates and phylogenetic relationships. Materials and Methods The molecular clock assumption can be relaxed in a variety of ways [ 13 - 151749 - 52 ].

Uncorrelated relaxed clocks We present an alternative to the autocorrelated prior in which there is, a priori, no correlation of the rates on adjacent branches of the tree. Priors on phylogeny A particular requirement of Bayesian phylogenetic inference is the responsibility given to users to specify a prior probability distribution on the shape of the phylogeny node ages and branching order. Bayesian inference The full Bayesian sequence analysis with an uncorrelated relaxed-clock model allows the co-estimation of substitution parameters, relaxed-clock parameters, and the ancestral phylogeny.

Figure 5. Model selection One issue that remains largely unresolved in this piece of work is the issue of model comparison and model selection. Proposing new states in the MCMC kernel The MCMC must sample the tree topology, the divergence times, and the individual parameters of the substitution model and tree prior s. Summarizing the posterior distribution The output of an MCMC analysis is a set of samples from the posterior distribution.

### Pity, relaxed phylogenetics and dating with confidence impossible

Calibrating the rate of evolution In the above discussion on rate models, it was assumed that it is possible to estimate absolute rates of evolution and the variance in absolute rates. Click here for additional data file. Acknowledgments The authors would like to thank S. References Zuckerkandl E, Pauling L.

Molecular disease, evolution and genic heterogeneity. In: Kasha M, Pullman B, editors. Horizons in biochemistry. New York: Academic Press; Evolutionary divergence and convergence in proteins. Evolving genes and proteins. Rates of DNA sequence evolution differ between taxonomic groups.