The following is an exercise to help introduce us to genomic analysis with PLINK v1.9 — a Swiss-army-knife for a performing a range of large-scale genomic analyses. PLINK is developed by Shaun Purcell at the Broad Institute of Harvard & MIT and is freely available as an open-source distribution.
Our basic approach is the following:
- Log in to the VM computer
- Connect to the snppy database psql interface
- Write a query to select a set of genomes and export these as PED files that PLINK can read
- Convert the PED files into their binary equivalents called BED files
- Use PLINK to clean up the data — e.g. filtering out linked SNPs
- Use PLINK to explore the data
Optional: Installing PLINK
PLINK is already installed on our VM, so you’re free to run all your analyses remotely. But some of you may prefer to run PLINK on your personal computer and just use snppy on the VM to prepare files of raw data. Bioinformatics is best done with a UNIX-based operating system such as LINUX and MacOS. Those of you with Windows computers can run LINUX as a dual-boot, or on a VM (e.g. VirtualBox or VMware Player). For the MacOS or LINUX users, you can download PLINK as follows (check the PLINK website for links to the latest versions):
curl -O http://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20190617.zip
curl -O http://s3.amazonaws.com/plink1-assets/plink_mac_20190617.zip
Then you want to create a directory to decompress the zip file, unzip it, and then move the binaries to “/usr/local/bin” so that these programs can be called from any directory.
unzip plink_linux_x86_64_20190617.zip -d ~/PLINK/
sudo cp plink /usr/local/bin
sudo cp prettify /usr/local/bin
rm -R PLINK
A. Querying to Create PED Files
Let’s start by creating a directory where we will store our files. Let’s call it “FREYUB” to mean our French and Yoruban people project. Then we will connect to our snppy database interface:
I notice that there are 29 French genomes (i.e.: SELECT count(*) FROM individuals WHERE family_id = ‘French’;) and 132 Yoruban genomes. It so happens that if we query for either source but limit the result to 58 records, we will end up with 29 French people and 29 Yoruban people. Let’s test this by writing a query that shows everything except that it leaves off the SNPs (which would otherwise clutter up the interface):
SELECT i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype FROM individuals i NATURAL JOIN groupings g WHERE i.family_id LIKE 'French' OR i.family_id LIKE 'Yoruba' ORDER BY i.individuals_id LIMIT 58;
This looks good, so now let’s create the PED files. Let’s try the following:\copy (SELECT i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype, i.snps FROM individuals i NATURAL JOIN groupings g WHERE i.family_id LIKE 'French' OR i.family_id LIKE 'Yoruba' ORDER BY i.individuals_id LIMIT 58) TO 'FREYUB.ped' (FORMAT CSV, DELIMITER ' ');
\copy (SELECT i.family_id, i.individual_id, i.family_id FROM individuals i NATURAL JOIN groupings g WHERE i.family_id LIKE 'French' OR i.family_id LIKE 'Yoruba' ORDER BY i.individuals_id LIMIT 58 ) TO 'FREYUB.pop' (FORMAT CSV, DELIMITER ' ');
\copy ( SELECT s.chromosome_id, s.snp_id, s.genetic_distance, s.base_pair FROM datasets d NATURAL JOIN datasets_snps ds NATURAL JOIN snps s WHERE d.dataset_name = 'HGDP' ORDER BY ds.seq_order ) TO 'FREYUB.map' (FORMAT CSV, DELIMITER ' ');
It’s a little annoying that each \copy command has to be a single line, as this makes it harder to read or compose your queries. There is an alternative way of exporting to a text file using \o, but the downside is that it doesn’t allow relative paths — so I think it’s best to use \copy. At any rate, you can now quit psql with \q and return to your shell. The CSV format typically puts quotes around tokens that need to be escaped. Let’s get rid of the quote using this Perl one-liner:perl -i -pe's/"//g' FREYUB.ped
Optional: Transferring Genome Files to Your Personal Computer
Some of you will prefer to run your analyses on your personal computer. This way you don’t have to have a network connection and the software might run faster running natively than running on a virtual machine. For those of you who would rather run your analyses on the remote VM, please skip this step. To transfer your data, I suggest archiving the FREYUB directory and using scp to do the transfer.cd
tar cvfz freyub.tar.gz FREYUB
This creates the compressed archive freyub.tar.gz in your home directory. You have disconnected from the remote shell with “exit,” so you’re now in the home directory of your PC. Transfer freyub.tar.gz (but change “team1” to whatever your login is) and decompress as follows:scp email@example.com:freyub.tar.gz ~/
tar -xzf freyub.tar.gz
You should now find the FREYUB directory in your home directory of your local PC.
B: Converting PED Files to BED Files
It’s less efficient for PLINK to work with PED files, so we will convert them to binary files. Type ls to see a list of the files in your FREYUB directory — you should see FREYUB.ped, FREYUB.map, and FREYUB.pop (recall that you can learn about the PED and MAP files here and here). Do the conversion to binary as follows:plink --file FREYUB --make-bed -out FREYUB
Notice that you should not include the extensions on the file names — PLINK will automatically know to look for the FREYUB.ped and FREYUB.map files. Note that there are various tools you can use to inspect your files to check that everything looks good. For example, head and tail are programs for printing the beginning and ending of the files. To read the first and last n lines, check out the FREYUB.fam and FREYUB.map:head -n 20 FREYUB.fam
Beware that each line in FREYUB.ped file is very long, so best to only show one line. Note too that it will not be informative to inspect the binary files (e.g. FREYUB.bed).
C: Cleansing the dataset of SNPs in LD
Whether you’re doing GAWAS analysis or just inferring population history, you often want to assume that each of your SNPs is accurately scored and each of your SNPs is segregating independently. Rare minor allele frequencies (commonly called MAFs) are more likely to be false scores in the genotyping process, and either way they’re less informative about group relationships. Rare MAFs with frequencies less than 1% can be filtered out with the following switch: –maf 0.01. In population genetics, the jargon term for alleles that segregate independently is known as linkage equilibrium, and therefore the opposite property is linkage disequilibrium or LD. A common parameter for measuring this is r2, where an r2 of 0 is when two neighboring alleles are in perfect equilibrium, while an r2 of 1 is when two neighboring alleles are perfectly linked and provide identical information among the populations under study. To identify and prune the genome of LD SNPs, we can use a “moving window” of 50,000 bases, a step length of 5,000, and an r2 cutoff of 0.2, like so: –indep-pairwise 50 5 0.2. Of course, if your purpose is to use LD blocks to find regions under strong selection, you would not want to filter out LD SNPs. Run the following command:plink --bfile FREYUB --maf 0.01 --indep-pairwise 50 5 0.2 --out FREYUB
Notice that from now on we are using –bfile instead of –file because the binary version of the PED file has already been calculated. The SNPs identified for exclusion are stored in FREYUB.prune.out, and those remaining are in FREYUB.prune.in. If you want to verify how many SNPs remain, try the following:wc -l FREYUB.prune.in
For more information about this command see here.
D: Cleansing the dataset of closely related individuals
While it seems exceedingly unlikely that broadly sampled genomes would happen on closely related individuals, duplicate entries and inclusion of data from studies that target particular families are still possible. Like with LD blocks, closely related individuals interfere with statistics for GWAS and population relatedness studies. One way of estimating relatedness is to use Identical by Descent (IBD) mapping. We can calculate the proportion of IBD using the –genome switch (see for info here).plink --bfile FREYUB --genome --extract FREYUB.prune.in --out FREYUB
Notice that we are using our prune.in file to exclude the SNPs we had filtered out in step C. You can use head FREYUB.genome to inspect the top of the file. As you can see, each individual is compared against every other individual and their relatedness is estimated under the PI_HAT column.
E: Inspecting the Proportion of IBD in R
If you are running PLINK locally on your Mac or PC, you can use R-Studio to graph results as you usually would. Note that you want to start by telling R-Studio what your working directory is. In Terminal, you can get your working directory using the pwd command and then in R-Studio type in: setwd(‘/Users/my/working/directory/for/FREYUB/’).
If you are running PLINK remotely on the VM, you may be able to use R running on the VM too. To do that, you have to allow X11 forwarding so that when graphics is displayed it gets pushed back to your computer. For this to work, connect to the VM in the following way:ssh -X firstname.lastname@example.org
Next, test whether X-windows is working by typing xeyes. If it’s working, you should see a little animated graphic appear on your screen. If it’s not working, try the following:
- If you have a Mac, do you have XQuartz installed? If not install it from here.
- If you have a Mac and XQuartz is installed, but X-forwarding is still not working, create a file called “config” and in it type XAuthLocation /opt/X11/bin/xauth and then move this file to ~/.ssh/ in your home directory.
- If you have Windows, you may need to install a program like Xming.
- If none of these methods work, I’m afraid you’ll just need to use scp to move the relevant files to your local computer and run R-Studio per normal (that, or ask EdTech or Yale-NUS IT to help solve the issue).
Okay, so back to R. If you’re using R on the remote server, you can start it at the shell prompt by simply typing “R”. At the R prompt, type in the following to ingest and graph the IBD results:
ibd <- read.table("FREYUB.genome", hea=T, as.is=T)
hist( ibd$PI_HAT, breaks = 100 )
You can better zoom in to the bar graph by limiting the y-axis: hist( ibd$PI_HAT, breaks = 100, ylim = c(0,300) ). As you can see, most individuals are unrelated, but there are a few with a relatedness close to 0.5 — i.e. either siblings or parent-children. We can exclude these individuals by picking out all those with a relatedness greater than 0.2 and saving them to a file called FREYUB.relatives.txt.
excluded = ibd[ ibd$PI_HAT > 0.2, c('FID2','IID2')]
write.table( excluded, file="FREYUB.relatives.txt", col.names = F, row.names = F, quote = F )
You’re now done with R, so you can type “q()” and then type “n”
F: Create a Cleansed / Reduced Copy of the Data
Now that we have a list of SNPs to exclude and relatives to exclude, we can filter these out and save a new binary file set that has been cleaned up:
plink --bfile FREYUB --extract FREYUB.prune.in --remove FREYUB.relatives.txt --make-bed -out FREYUB_clean
And you should now see a new set of FREYUB_clean files where the FREYUB_clean.bed takes up considerably less disk space than the original FREYUB.bed.
G: Calculate Allele Frequencies
One thing you might want to do with your data is calculate allele frequencies among these individuals. Run the following:
plink --bfile FREYUB_clean --freq --out FREYUB_clean
And use head to examine the top few lines in FREYUB_clean.frq. As you can see, each SNP is listed along with allele A1 and allele A2. The MAF column is giving you the frequency of allele A1.
Recall that previously we had saved a query to a file called FREYUB.pop. This stored a list of individual_id and family_id values, which we can use to recalculate allele frequencies specific to each population:
plink --bfile FREYUB_clean --freq --within FREYUB.pop --out FREYUB_clean
This saves a “cluster-stratified” file called FREYUB_clean.frq.strat. Use head -n 40 to examine it, and see how the frequency of each SNPs is calculated separately for both the French and Yoruba populations. In most cases the allele frequencies are quite similar, but there may be some that differ noticeably.
H: Calculate Heterozygosity
We might want to compare the amount of heterozygosity in the two populations, as this gives some sense of ancestral population size. Run the following:
plink --bfile FREYUB_clean --het -out FREYUB_clean
And use head to examine the top few lines in FREYUB_clean.het. The column “O(HOM)” reports the number of homozygotes and the column “N(NM)” reports the number of scored SNPs. We can use R to ingest the table and infer the proportion of heterozygous SNPs by subtracting the homozygotes from the total and dividing this by the total. We can do a box plot and a t-test to compare the differences between the populations.
het <- read.table("FREYUB_clean.het", hea=T, as.is=T)
boxplot( ((N.NM. - O.HOM.)/O.HOM.) ~ FID,data=het, xlab="Pops",
t.test( ((N.NM. - O.HOM.)/O.HOM.) ~ FID, het)
I: Cluster the Genomes into Groups
For this particular dataset, there’s little doubt that the genetics of French and Yoruba people should cluster separately. Still, a significant proportion of the Yoruba live in Republic of Benin, which used to be a French colony, so there were opportunities for admixture both during colonial times and in contemporary France as a consequence of migration. Let’s ask PLINK to group these genomes into two clusters by specifying the switch: –K 2 as follows:
plink --bfile FREYUB_clean --cluster --K 2 --out FREYUB_clean
And then inspect the results like so:
You should see two self-organized clusters, SOL-0 and SOL-1. Are any of our individuals in the “wrong” group, based on their family_id values?
If you’re completely done with these data, you can delete them like this:
rm -R FREYUB