David Reich is a bit dismissive when he speaks of the mitochondrion:
The name [Mitochondrial Eve] captured the collective imagination … [but] has fostered the mistaken impression that all of our DNA comes from precisely two ancestors … The National Geographic Societyâs âGenographic Project,â beginning in 2005, collected mitochondrial DNA and Y-chromosome data from close to a million people of diverse ethnic groups. But the project was outdated even before it began. … From the outset, it was clear that most of the information about the human past present in mitochondrial DNA and Y-chromosome data had already been mined…
While this is a bit passÃ©, itâs probably still worth perusing mitochondrial DNA to get a sense of what it can tell us about larger patterns in human history. Letâs get a sense for how this analysis is done by downloading mtDNA genomes of your choice and using R to generate maximum likelihood tree with branch lengths proportional to genetic distance (and therefore approximately proportional to time).
In R, you will need to install a bunch of packages like so:
install.packages("adegenet", dep = TRUE)
install.packages("expm", dep = TRUE)
install.packages("phangorn", dep = TRUE)
if (!requireNamespace("BiocManager", quietly = TRUE))
And following that, youâll want to load the libraries:
At this point youâll want to assemble a set of genomes to analyze. To do this, youâll need to peruse some Genbank records, starting here:
Record the accession numbers for at least one Chimp, at least one Denisovan, and at least one Neanderthal (You may wish to note down several Denisovans and Neanderthals to get a sense for their within-group divergences). For the remainders, pick about ten or more genomes scattered among the various mtDNA haplotype groups (e.g. L0-L6, M, N, and R). For each genome, check that itâs easy to find the provenance of the specimen. To do that, click on the link and scroll down to the âsourceâ feature. If you see recognizable /isolate (e.g. âThai22â) or a /country (e.g. “Belarus”) you can be confident of the provenance. If you donât find this information, either move on to another genome or click the PUBMED link, download the paper, and look up the provenance that way.
Once you have identified a list of Genbank accession numbers that you wish to download, you can create a hyperlink suffixed with a long list of these codes, each comma-separated like so:
Naturally, you will have substituted the three numbers indicated above (KC867127, EF495220, and AP012414) with your own list of comma-separated accession numbers. This should download an unaligned FASTA file with these sequences.
The next step is to edit the FASTA file so that the headers make sense to you. On a Mac, I recommend using TextWrangler. Equivalent text editors are available on Windows.
For example, a record header that reads like this:
>JX289103.1 Homo sapiens isolate MMR087 mitochondrion, complete genome GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG GTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC
…Is best off edited to something simple like this:
>Myanmar_JX289103 GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG GTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTC
Once youâve finished editing the record headers, youâre now ready to analyze the data. Import the sequences as follows:
humans.fas <- "sequence.fas"
humans.seqs <- readDNAStringSet(humans.fas)
The sequences are unaligned. To align them, do as follows:
aligned <- AlignSeqs(humans.seqs) writeXStringSet(aligned, file="sequence_aligned.fas") humans.seq <- read.FASTA("sequence_aligned.fas")
We now have a new object called humans.seq, which contains the mitochondrial data that you downloaded, but now aligned. See what a quick neighbor joining tree (based on distance data) looks like:
nj.humans <- nj(dist.dna(humans.seq, model = "Raw")) plot(nj.humans, type = "unr", cex=0.5, no.margin=T)
If it looks good, we can proceed by converting the alignment object for the phangorn library:
humans.pd <- as.phyDat(humans.seq)
We can now measure the tree length in a parsimony context:
Using this tree as a starting point, we can do some branch rearranging to see if we can find a shorter tree â i.e. the maximum parsimony tree:
pars.humans <- optim.parsimony(nj.humans, humans.pd) plot(pars.humans, type = "unr", cex=0.5, no.margin=T)
Did it find a shorter tree? In any case, we want to move on to run a maximum likelihood analysis with a clock-constrained model. First, we need to start with an ultrametric tree. The UPGMA method is good for this:
upgma.humans <- upgma(dist.dna(humans.seq, model="Raw"))
And then compute the likelihood of the data given this UPGMA tree:
mli.humans <- pml(upgma.humans, type="unr", humans.pd, k=4)
Now we can make some rearrangements to the starting tree to see if we can discover a topology that estimates a higher likelihood for the data:
mlt.humans <- optim.pml( mli.humans, optNni=TRUE, optBf=TRUE, optQ=TRUE, optGamma=TRUE, optRooted=TRUE)
mlt.humans.tree <- mlt.humans$tree
Letâs draw this tree and attach labels to the internal nodes that indicate the genetic distance between each internal node and the leaf nodes:
plot(mlt.humans.tree, cex=0.5, no.margin=T)
nodelabels( round( branching.times(mlt.humans.tree), 4) )
Note that the internal nodes are labeled with genetic distances from present time. It would be useful to recalibrate these values by designating the root node as being 8 million years old:
root.humans <- getMRCA( mlt.humans.tree,
chrono.tree = chronopl(mlt.humans.tree, node = root.humans, lambda = 1, age.min = 8, age.max = NULL)
plot(chrono.tree, cex=0.5, no.margin=T, x.lim = 10) nodelabels( round( branching.times(chrono.tree), 3), cex=0.5)
Can you infer something about human history from this tree?