12 SNP Resource: Personal Genome Project




I notice that there is something on the web called The Personal Genome Project, which is an effort to assemble and share genomes from volunteers. Many of the genomes are SNP scoring from 23&me and Ancestry, but some are whole genomes and some come from research projects. The main focus seems to be in inferring traits and diseases based on surveys, but in some cases we may be able to use these data for our own purposes. For some records, the survey also asks for the country and/or ethnicity of the person’s maternal and paternal grandparents. Some portion of these data is available here.

The search features, regrettably, seem inadequate, so I’ve tried to assemble their metadata into a PostgreSQL database I’m calling “pgppy.” Let’s look at the entity-relationship diagram:

As you can see, each pgp_demographics record is associated with one or more data files in the pgp_uploaded_data table. The relevant part for us is the ethnicity field and the maternal and paternal grandparent fields. Unfortunately the ethnicity field is the rather uninformative “race” per US government definition of it. Let’s see what are the possible values. Per usual, connect to the database with:

psql pgppy 

This query will return all the distinct values for ethnicity:

SELECT ethnicity, count(*) 
FROM pgp_demographics 
GROUP BY ethnicity
ORDER BY ethnicity;

                                                ethnicity                                                | count 
---------------------------------------------------------------------------------------------------------+-------
 American Indian Alaska Native                                                                           |     4
 American Indian Alaska Native Asian Black or African American White                                     |     3
 American Indian Alaska Native Asian White                                                               |     1
 American Indian Alaska Native Black or African American White                                           |     9
 American Indian Alaska Native Hispanic or Latino                                                        |     2
 American Indian Alaska Native Hispanic or Latino Black or African American                              |     1
 American Indian Alaska Native Hispanic or Latino Black or African American White                        |     2
 American Indian Alaska Native Hispanic or Latino White                                                  |    11
 American Indian Alaska Native Native Hawaiian or Other Pacific Islander Black or African American White |     1
 American Indian Alaska Native White                                                                     |    93
 American Indian Alaska Native White No response                                                         |     1
 Asian                                                                                                   |   107
 Asian Black or African American White                                                                   |     1
 Asian Hispanic or Latino                                                                                |     2
 Asian Native Hawaiian or Other Pacific Islander Black or African American White                         |     2
 Asian Native Hawaiian or Other Pacific Islander White                                                   |     4
 Asian White                                                                                             |    36
 Black or African American                                                                               |    33
 Black or African American White                                                                         |    17
 Hispanic or Latino                                                                                      |    52
 Hispanic or Latino Black or African American                                                            |     2
 Hispanic or Latino Black or African American White                                                      |     3
 Hispanic or Latino Native Hawaiian or Other Pacific Islander                                            |     1
 Hispanic or Latino Native Hawaiian or Other Pacific Islander White                                      |     2
 Hispanic or Latino White                                                                                |    60
 Native Hawaiian or Other Pacific Islander                                                               |     2
 Native Hawaiian or Other Pacific Islander White                                                         |     1
 White                                                                                                   |  2817
                                                                                                         |    32
(29 rows)

As you can see, these metadata don’t look very normalized. What might be more useful is the country of origin. The following query looks for all individuals where all four grandparents came from the same country:

SELECT pd.maternal_grandmother_country, count(*)
FROM pgp_demographics pd 
WHERE pd.maternal_grandmother_country = pd.maternal_grandfather_country 
AND pd.maternal_grandfather_country = pd.paternal_grandmother_country 
AND pd.paternal_grandmother_country = pd.paternal_grandfather_country
GROUP BY pd.maternal_grandmother_country;

     maternal_grandmother_country     | count 
--------------------------------------+-------
 Albania                              |     2
 Antigua and Barbuda                  |     1
 Argentina                            |     1
 Armenia                              |     3
 Australia                            |     1
 Austria                              |     2
 Azerbaijan                           |     1
 Bangladesh                           |     2
 Belarus                              |     2
 Belgium                              |     5
 Brazil                               |     3
 Bulgaria                             |     7
 Canada                               |    15
 China                                |    41
 Colombia                             |     5
 Croatia                              |     2
 Cuba                                 |     4
 Czech Republic                       |     4
 Egypt                                |     2
 El Salvador                          |     1
 Estonia                              |     2
 Ethiopia                             |     1
 Finland                              |     2
 France                               |     1
 Germany                              |    23
 Greece                               |     2
 Guyana                               |     1
 Hungary                              |     4
 India                                |    33
 Iran Islamic Republic of             |     4
 Iraq                                 |     2
 Ireland                              |    12
 Italy                                |    10
 Japan                                |     1
 Korea South (Republic of)            |     1
 Lebanon                              |     1
 Lithuania                            |     2
 Malaysia                             |     1
 Mexico                               |    11
 Netherlands                          |     3
 Nigeria                              |     1
 Norway                               |     2
 Pakistan                             |     1
 Palistinian Territory Occupied       |     1
 Peru                                 |     1
 Philippines                          |     2
 Poland                               |    16
 Portugal                             |     3
 Puerto Rico                          |     8
 Romania                              |     5
 Russian Federation                   |    11
 Singapore                            |     1
 Slovakia                             |     1
 Slovenia                             |     2
 South Africa                         |     2
 Spain                                |     8
 Sri Lanka                            |     1
 Sweden                               |     3
 Syrian Arab Republic                 |     3
 Taiwan Province of China             |     3
 Thailand                             |     1
 Turkey                               |     3
 Ukraine                              |    10
 United Kingdom                       |    31
 United States                        |  1574
 United States Minor Outlying Islands |     1
 Venezuela                            |     2
 Viet Nam                             |     3
(68 rows)

Clearly, these data have a strong US bias. Still, some of these other genomes might be useful — e.g. the following finds individuals where all four grandparents are from Poland:

SELECT pd.human_id 
FROM pgp_demographics pd JOIN pgp_uploaded_data pu USING (human_id)
WHERE pd.maternal_grandmother_country = pd.maternal_grandfather_country
AND pd.maternal_grandfather_country = pd.paternal_grandmother_country
AND pd.paternal_grandmother_country = pd.paternal_grandfather_country
AND pd.maternal_grandmother_country = 'Poland'
AND pu.url IS NOT NULL
GROUP BY pd.human_id
ORDER BY pd.human_id;

Bulk download of the data from the Personal Genome Project website is difficult because they store all kinds of data (e.g. CAT scans, IQ tests, whole genomes, 23&me SNPs, etc.) using different formats and different methods of compression. I’ve written a longish Perl script that attempts to handle all these issues, here: download_pgp.plIt seems to work (more or less) on the Ubuntu machine. If you want it to work on a Mac, you’ll need to edit line 399. Otherwise, all you need to edit is the query. In this example, it’s trying to find all people who have “African” as an ethnicity and have at least one grandparent from the United States — the idea being to target African Americans. The script excludes VCF files for whole genomes, as these tend to be quite huge and a pain to process. To run it, just type “perl download_pgp.pl”.


Leave a Reply

Your email address will not be published. Required fields are marked *