02 Database of Simulated Humans

Introduction to SQL

I’ve created two databases that attempt to simulate the relationships among people in small populations after 50 and 100 generations. The following is an “entity relationship” diagram, sometimes known as a database model or schema, which shows the tables (or “relations”) and fields (or “columns”). Each table contains a number of records (or “tuples”):

The crow’s feet lines indicate constraints between the relations. For example, each record in the persons table has zero or more records in the person_path table. This “one-to-many” relationship between persons and person_path is enforced by a persons_id (for which there is only one unique record) matching a child_id (for which there can be zero or more matching records). A person_id can also match to an ancestor_id. Likewise, each persons record (except for the first generation) has a mother_id and a father_id that matches with a person_idin the previous generation.

A persons record also stores the sex of the individual (female = TRUE or FALSE), the number of children produced, and the generation number. A person_path record is the transitive closure mapping between all paths between ancestors and descendants. If all links in the path are female, the lineage = ‘f’; if all links in the path are male, the lineage = ‘m’; if links are a mix of either sex, the lineage = ‘e’. The distance is the number of generations separating ancestor_id and child_id.

I populated this database using a script written in Perl. The script uses the DBI library to interface with the database. Since SQL is a somewhat static and “one-off” language, it’s often helpful to use a procedural language, like Perl or Python, to interface with it. For those curious as to how this is done, see here: script_people.pl To run some queries, log into the VM and connect to a database like so:

psql onehundredbyonehundred

You can get a list of databases with \l. You can connect to a different database with \c different_database You can list the tables with \dt. You can see the structure of a table using \d persons. Try something simple like count up the records in each table:

SELECT count(*) 
FROM persons;
SELECT count(*)
FROM person_path;

The person_path table has all possible paths through the network stored in persons, but only for those paths that end in the most recent generation. To check what the most recent generation is, do as follows:

SELECT max(generation) FROM persons; 

In the case of the onehundredbyonehundred database, the most recent generation is 100.

Let’s take a look at a few records for people in generation 1:

FROM persons
WHERE generation = 1

The WHERE constrains the set of selected records to some criteria, here the generation has to be 1. The LIMIT puts a limit on how many rows to display.

List the men that have more than 3 children:

FROM persons
WHERE generation = 1
AND children > 3
AND female IS FALSE;

Change this query so that it counts how many females have more than 3 children.

Let’s think about how you can list all the grandchildren of an individual. First, make a query that finds the first 20 records for people in generation 97 that have at least two children. From the result of this query, pick three persons_ids (e.g. say “120633”, “120617”, and “120663”) and use these in the following query:

SELECT persons_id, child_id 
FROM persons JOIN person_path ON (persons_id = ancestor_id)
WHERE persons_id IN ( 120633, 120617, 120663 )
AND distance = 3;

Notice that we have now referenced two tables in the FROM clause: persons has been joined to person_path by mapping the persons_idfield in persons to the ancestor_id field in person_path. And rather than make persons_id equal to a single integer, we allow it to match any of three integers.

Suppose we want simply to count up the number of grandchildren for each persons_id. Do as follows, modifying it using your own set of persons_id:

SELECT persons_id AS "Grandparent", count(*) AS "Number of Grandchildren" 
FROM persons JOIN person_path ON (persons_id = ancestor_id)
WHERE persons_id IN ( 120633, 120617, 120663 )
AND distance = 3
GROUP BY persons_id;

You can see that the result set is being grouped into rows, one for each persons_id, and that the count(*) function is counting up the number of records subsumed into each group. Note too that I’m using aliases with the “AS” clause so that I can rename the columns.

Querying For Population Genetic Patterns

We can now do some fancier queries to understand how people are related.

Let’s count how many members of generation 1 have a descendant at the present time. You will recall that each generation has 100 people.

SELECT count(*) 
FROM persons p
WHERE p.generation = 1
AND EXISTS (SELECT 1 FROM person_path pp WHERE pp.ancestor_id = p.persons_id);

Two things to note: the “p” and “pp” are aliases following each table name and preceding each field name. Aliases are useful for keeping track of which table different field names are referencing. The other thing to noice is the use of a subquery: the EXISTS clause reports true if the query in the parentheses has found at least one record indicating that this persons_id in generation 1 has at least one descendant 100 generations later. Do the results surprise you?

Let’s see how this changes over time by making the same calculation for each of the 100 generations. To do this I’m using three nested queries, where the outermost one serves as a loop to run through all 100 generations:

SELECT gen AS "generation", 
( SELECT count(*)
FROM persons p
WHERE p.generation = gen
AND EXISTS ( SELECT 1 FROM person_path pp WHERE pp.ancestor_id = p.persons_id )
) AS "ancestors"
FROM generate_series(1,100) gen;

Do you notice any trends?

Here’s the result I got for the first 10 generations in the onethousandbyfifty database:

  generation | ancestors 
1 | 803
2 | 817
3 | 831
4 | 831
5 | 819
6 | 796
7 | 801
8 | 786
9 | 780
10 | 793
(10 rows)

The lesson here is that pretty much anyone who does reproduce, will be the ancestor of all present day individuals. But is there variance in how many present-day descendants each one is an ancestor to? We will modify the query to only look at persons in generation 1, asking how many great-great-great (etc) grandchildren did they each produce:

SELECT persons_id, (
SELECT count(*)
FROM person_path pp
WHERE pp.ancestor_id = p.persons_id
AND pp.lineage = 'e'
) AS "gggggGrandchildren"
FROM persons p
WHERE p.generation = 1
ORDER BY persons_id;

This query might be a bit slow (e.g. 5 minutes) — you can cancel with -c. But what you should see is that each ancestor either gave rise to all descendants or to none. This means that if Queen Elizabeth II is a descendant of William the Conquer, then every European (and perhaps every human on earth) is also a descendant of William the Conquer.

Let’s see if we can trace how many generations ago the “mitochondrial Eve” existed, using the following query:

SELECT ppa.distance, ppa.ancestor_id 
FROM person_path ppa, person_path ppb
WHERE ppa.ancestor_id = ppb.ancestor_id
AND ppa.lineage = 'f'
AND ppb.lineage = ppa.lineage
GROUP BY ppa.ancestor_id, ppa.distance
HAVING count(*) = (
SELECT count(*) FROM persons
WHERE generation = (SELECT max(generation) FROM persons)
) ^ 2
ORDER BY ppa.distance

This query works by finding all female lineages that share the same ancestor_id. It does this by examining two instances of the table person_path (ppa and ppb) and selecting all combinations that share the same ancestor_id. Since there are 100 persons at the present, we expect 1002 combinations. So if we group by ancestor_id and distance, limiting ourselves to just those rows that have a count(*) of 1002, we have found all persons that are the mitochondrial ancestors of all present-day people. Now we just sort them by distance and pick the first row. This must be the most recent of all the common ancestors — i.e. the mitochondrial Eve.

Modify the above query so that it finds the Y-chromosome Adam.

Clearly there’s a lot of variance as to how recent or ancient is the MRCA, and for tracings that are limited to either male or female lineages, the MRCA may well be older than 100 generations.

Another question of interest is whether you can find the most recent ancestor that is in common with pairs of present day individuals picked at random and regardless of the sex of the lineages, and what is the distribution of these ancestors.

SELECT dist AS "generation", count(*) AS "MRCAs"
( SELECT pa.persons_id, pb.persons_id, MIN(ppa.distance) AS dist
person_path ppa,
(SELECT persons_id
FROM persons
WHERE generation = (SELECT max(generation) FROM persons)
ORDER BY random() LIMIT 5) pa,
person_path ppb,
(SELECT persons_id
FROM persons
WHERE generation = (SELECT max(generation) FROM persons)
ORDER BY random() LIMIT 5) pb
WHERE pa.persons_id > pb.persons_id
AND ppa.child_id = pa.persons_id
AND ppb.child_id = pb.persons_id
AND ppa.ancestor_id = ppb.ancestor_id
AND ppa.lineage = 'e'
AND ppb.lineage = 'e'
GROUP BY pa.persons_id, pb.persons_id
) minDist
ORDER BY dist;

This is not an easy query to figure out. One thing to notice is that “ORDER BY random() LIMIT 5” is one way to generate five random persons_ids. It does this twice, picking out only those pairs where the first persons_id is larger than the second. The result is about 12 pairs (give or take). We then search for the MRCAs for each of these pairs and summarize the findings.

How does this distribution compare with the mitochondrial (or ‘f’) distribution? Note that each time you run it, you should get a different result because it’s based on random pairs of persons_id.

Consider that the onethousandbyfifty database has the following distribution for any sex (i.e. ‘e’):

  onethousandbyfifty, by any sex:

generation | MRCAs
3 | 1
4 | 3
5 | 9
6 | 2
(4 rows)

Are these MRCAs surprisingly shallow? Compare this to just tracking the mitochondrial ancestry or just tracking the y-chromosome ancestry.

More on this:

  • Chang, J. 1999. Recent common ancestors of all present-day individuals. Advances in Applied Probability, 31: 1002-1026: End of the 13th century, around the time of Richard II, lived a man or woman from whom all Europeans can trace their ancestry.
  • Ralph, P. 2013. The geography of recent genetic ancestry across Europe. PLOS Biology, 11:5: DNA for Europeans shows the same pattern.
  • Rohde, DLT 2004. Modeling the recent common ancestry of all living humans. Nature, 431: 562-566: Globally, all living humans have a MRCA only 3,400 – 3,600 years ago.

Richard II(1367-1413)

Leave a Reply

Your email address will not be published. Required fields are marked *