01 The Human Mitochondrial Tree

David Reich is a bit dismissive when he speaks of the mitochondrion:

The name [Mitochondrial Eve] captured the collective imagination … [but] has fostered the mistaken impression that all of our DNA comes from precisely two ancestors … The National Geographic Society’s “Genographic Project,” beginning in 2005, collected mitochondrial DNA and Y-chromosome data from close to a million people of diverse ethnic groups. But the project was outdated even before it began. … From the outset, it was clear that most of the information about the human past present in mitochondrial DNA and Y-chromosome data had already been mined…

While this is a bit passé, it’s probably still worth perusing mitochondrial DNA to get a sense of what it can tell us about larger patterns in human history. Let’s get a sense for how this analysis is done by downloading mtDNA genomes of your choice and using R to generate maximum likelihood tree with branch lengths proportional to genetic distance (and therefore approximately proportional to time).

In R, you will need to install a bunch of packages like so:

install.packages("adegenet", dep = TRUE) 
install.packages("expm", dep = TRUE)
install.packages("phangorn", dep = TRUE)

if (!requireNamespace("BiocManager", quietly = TRUE))

And following that, you’ll want to load the libraries:


At this point you’ll want to assemble a set of genomes to analyze. To do this, you’ll need to peruse some Genbank records, starting here:


Record the accession numbers for at least one Chimp, at least one Denisovan, and at least one Neanderthal (You may wish to note down several Denisovans and Neanderthals to get a sense for their within-group divergences). For the remainders, pick about ten or more genomes scattered among the various mtDNA haplotype groups (e.g. L0-L6, M, N, and R). For each genome, check that it’s easy to find the provenance of the specimen. To do that, click on the link and scroll down to the “source” feature. If you see recognizable /isolate (e.g. “Thai22”) or a /country (e.g. “Belarus”) you can be confident of the provenance. If you don’t find this information, either move on to another genome or click the PUBMED link, download the paper, and look up the provenance that way.

Once you have identified a list of Genbank accession numbers that you wish to download, you can create a hyperlink suffixed with a long list of these codes, each comma-separated like so:

http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=portal &sendto=on&dopt=fasta&val=KC867127,EF495220,AP012414

Naturally, you will have substituted the three numbers indicated above (KC867127, EF495220, and AP012414) with your own list of comma-separated accession numbers. This should download an unaligned FASTA file with these sequences.

The next step is to edit the FASTA file so that the headers make sense to you. On a Mac, I recommend using TextWrangler. Equivalent text editors are available on Windows.

For example, a record header that reads like this:


…Is best off edited to something simple like this:


Once you’ve finished editing the record headers, you’re now ready to analyze the data. Import the sequences as follows:

humans.fas <- "sequence.fas"
humans.seqs <- readDNAStringSet(humans.fas)

The sequences are unaligned. To align them, do as follows:

aligned <- AlignSeqs(humans.seqs) writeXStringSet(aligned, file="sequence_aligned.fas") humans.seq <- read.FASTA("sequence_aligned.fas")

We now have a new object called humans.seq, which contains the mitochondrial data that you downloaded, but now aligned. See what a quick neighbor joining tree (based on distance data) looks like:

nj.humans <- nj(dist.dna(humans.seq, model = "Raw")) plot(nj.humans, type = "unr", cex=0.5, no.margin=T)

If it looks good, we can proceed by converting the alignment object for the phangorn library:

humans.pd <- as.phyDat(humans.seq)

We can now measure the tree length in a parsimony context:

parsimony(nj.humans, humans.pd)

Using this tree as a starting point, we can do some branch rearranging to see if we can find a shorter tree – i.e. the maximum parsimony tree:

pars.humans <- optim.parsimony(nj.humans, humans.pd) plot(pars.humans, type = "unr", cex=0.5, no.margin=T)

Did it find a shorter tree? In any case, we want to move on to run a maximum likelihood analysis with a clock-constrained model. First, we need to start with an ultrametric tree. The UPGMA method is good for this:

upgma.humans <- upgma(dist.dna(humans.seq, model="Raw"))

And then compute the likelihood of the data given this UPGMA tree:

mli.humans <- pml(upgma.humans, type="unr", humans.pd, k=4)

Now we can make some rearrangements to the starting tree to see if we can discover a topology that estimates a higher likelihood for the data:

mlt.humans <- optim.pml( mli.humans, optNni=TRUE, optBf=TRUE, optQ=TRUE, optGamma=TRUE, optRooted=TRUE)
mlt.humans.tree <- mlt.humans$tree

Let’s draw this tree and attach labels to the internal nodes that indicate the genetic distance between each internal node and the leaf nodes:

plot(mlt.humans.tree, cex=0.5, no.margin=T)
nodelabels( round( branching.times(mlt.humans.tree), 4) )

Note that the internal nodes are labeled with genetic distances from present time. It would be useful to recalibrate these values by designating the root node as being 8 million years old:

root.humans <- getMRCA( mlt.humans.tree, 
c(mlt.humans.tree$tip.label[grep(".", mlt.humans.tree$tip.label)])
chrono.tree = chronopl(mlt.humans.tree, node = root.humans, lambda = 1, age.min = 8, age.max = NULL)
plot(chrono.tree, cex=0.5, no.margin=T, x.lim = 10) nodelabels( round( branching.times(chrono.tree), 3), cex=0.5)

Can you infer something about human history from this tree?

Leave a Reply

Your email address will not be published. Required fields are marked *