03 Database of SNPs

We will be working with a database of SNPs, which we can mine in order to extract datasets useful for further analyses. The ER diagram for our data model is as follows:



You can read this data model with the following narrative:

  • The datasets table keeps records for each dataset. We will mostly work with just one dataset with about 5,000 individuals and about 130,000 SNPs.
  • Each record in datasets has many records in individuals, one record for each person. Each individual record has a family_id, which we are using to store the ethnicity or country of origin. Also there is an individual_id, which is an identifier to this person, and can often be Googled to locate more metadata about them. Most importantly, the snps column stores a very long string of tab-separated diploid SNP scores. Each diploid score consists of one nucleotide followed by a space followed by the other nucleotide (e.g. “A C”).
  • The groupings table associates each family_id with a group_name which is one of eight categories: Africa, America, CentralSouthAsia, EastAsia, Europe, MiddleEast, Oceania, and Siberia.
  • The snps table keeps a record of each SNP, including the chromosome_id, base_pair location on the chromosome, and snp_id (i.e. in the form rs123456).
  • The datasets_snps is a joining table that lists all the SNPs associated with a particular datasets record. The seq_order column stores the index of the diploid score in the snps column of individuals.

Example Queries

To run some queries, log into the VM and connect to a database like so:

psql snppy

You can list the tables with \dt. You can see the structure of the individuals table using “\d individuals”.

Calculate the number of individuals in each dataset:

SELECT d.dataset_name, count(*) 
FROM datasets d JOIN individuals i ON ( i.datasets_id = d.datasets_id )
GROUP BY d.dataset_name
ORDER BY d.dataset_name;

Calculate the number of SNPs in each dataset:

SELECT d.dataset_name, count(*) 
FROM datasets d JOIN datasets_snps ds ON ( ds.datasets_id = d.datasets_id )
GROUP BY d.dataset_name
ORDER BY d.dataset_name;

Calculate the number of individuals in each regional group:

SELECT g.group_name, count(*) 
FROM groupings g JOIN individuals i ON ( i.family_id = g.family_id )
GROUP BY g.group_name
ORDER BY g.group_name;

List a random set of 20 European men and provide their ethnic/country group:

SELECT i.individual_id, i.family_id 
FROM individuals i NATURAL JOIN groupings g
WHERE g.group_name = 'Europe' AND i.sex = '1'
ORDER BY random() LIMIT 20;

(Note that a NATURAL JOIN allows the software to pick a pair of IDs where the names of the columns exactly match — in this case family_id is matched between the individuals table and the groupings table.)

For each chromosome, count up the number of SNPs, the number of base pairs between the SNPs that are farthest apart, and the average SNP separation:

                              SELECT s.chromosome_id, to_char(count(*), 'FM99,999') AS "number of SNPs", 
to_char( (max(s.base_pair) - min(s.base_pair)), 'FM999,999,999') AS "Max Separation", 
to_char( (max(s.base_pair) - min(s.base_pair))/count(*), 'FM99,999') AS "SNP separation"
FROM snps s 
GROUP BY s.chromosome_id 
ORDER BY s.chromosome_id::int;


Note that the to_char function is simply formatting the number to include commas to make it more readable. Also the “::int” in “ORDER BY s.chromosome_id::int” is to cast the
column as an integer even though it’s stored as a string.

Leave a Reply

Your email address will not be published. Required fields are marked *