09 Principal Components Analysis




Principal components analysis (PCA) is frequently used to investigate population structure. As with others, this method is sensitive to relatedness and linkage disequilibrium, so formally you should first filter these with some cleansing steps. But here we will run a PCA on a subset of Europeans as-is.

The following query selects for Europeans but excludes various peoples in the east (ethnicities of Russia, the Caucuses, Turkic people, Jews, etc.):

                              SELECT i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype
FROM individuals i NATURAL JOIN groupings g
WHERE g.group_name = 'Europe' 
AND i.family_id NOT IN ('Abkhazians', 'Adygei', 'Altaians', 'Armenians', 'AshkenazyJews', 'AzerbaijaniJews', 
'Azeri', 'Balkar', 'Belorussian', 'Chechens', 'Dolgans', 'Evens', 'FrenchJews', 'Gagauz', 'Georgian', 'GeorgianJews', 
'ItalianJews', 'Karelian', 'Komi', 'Kumyks', 'Lezgins', 'Lithuanians', 'Maris', 'Moldavian', 'Mordovians', 'Ossetian', 
'Russian', 'Selkups', 'Tabassaran', 'Tatar', 'Turks', 'Udmurt', 'Ukranians', 'UtahWhite', 'Veps')
ORDER BY i.individuals_id;
                            
                

As usual, we will remove all return keys from this query so that it can output to a local file. I will be using the prefix ‘EUROPE’:

\copy ( SELECT i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype, i.snps FROM individuals i NATURAL JOIN groupings g WHERE g.group_name = 'Europe' AND i.family_id NOT IN ('Abkhazians', 'Adygei', 'Altaians', 'Armenians', 'AshkenazyJews', 'AzerbaijaniJews', 'Azeri', 'Balkar', 'Belorussian', 'Chechens', 'Dolgans', 'Evens', 'FrenchJews', 'Gagauz', 'Georgian', 'GeorgianJews', 'ItalianJews', 'Karelian', 'Komi', 'Kumyks', 'Lezgins', 'Lithuanians', 'Maris', 'Moldavian', 'Mordovians', 'Ossetian', 'Russian', 'Selkups', 'Tabassaran', 'Tatar', 'Turks', 'Udmurt', 'Ukranians', 'UtahWhite', 'Veps') ORDER BY i.individuals_id ) TO 'EUROPE.ped' (FORMAT CSV, DELIMITER ' ');

\copy ( SELECT s.chromosome_id, s.snp_id, s.genetic_distance, s.base_pair FROM datasets d NATURAL JOIN datasets_snps ds NATURAL JOIN snps s WHERE d.dataset_name = 'HGDP' ORDER BY ds.seq_order ) TO 'EUROPE.map' (FORMAT CSV, DELIMITER ' ');

We then exit psql and use Perl to remove quotes, followed by making the PED binary, followed by a PCA analysis that expresses variation in 6 axes:

perl -i -pe's/"//g' EUROPE.ped
plink --file EUROPE --make-bed -out EUROPE
plink --bfile EUROPE --pca 6 --out EUROPE

It’s probably worth looking at the eigenvalues to check that there is not a big drop among the subsequent components beyond the first 2 we plan to plot:

cat EUROPE.eigenval

Those of you who cannot get X-forwarding to work, you will now want to transfer the eigenvector and eigenvalue files to your local computer. You’ll need to have these R libraries on your local machine:

install.packages("ggplot2")
install.packages("reshape2")
install.packages("plyr")
install.packages("ape")
install.packages("igraph")
install.packages("ggplot2")

Create a file called “makePCA.R” and save the following function, courtesy of Razib Kahn:

                              library(ggplot2)
library(reshape2)
library(plyr)

# path to the data directory
theDir=""

plinkPCA = function(PC=NULL,colLabel="Group", subVec=NULL,
                    xLim=NULL,yLim=NULL,labelPlot=NULL,
                    file="plink",theSep=" ") {
   if(!is.null(PC)){
      xAxis=PC[1]
      yAxis=PC[2]
   }else{
      xAxis="PC1"
      yAxis="PC2"
   }
   vecfile = paste(paste(theDir,file,sep=""),"eigenvec",sep=".")
   PCA=read.csv(vecfile,header=F,sep=theSep)

   valfile = paste(paste(theDir,file,sep=""),"eigenval",sep=".")
   EigVal=read.csv(valfile,header=F,sep=theSep)

   getLength=length(PCA[1,])
   HeaderVector=c("Group", "ID")
   start = 1
   while (start <= getLength-2){
      HeaderVector=c(HeaderVector, paste("PC",start,sep=""))
      start=start+1
   }
   colnames(PCA) <- HeaderVector

   p=ggplot(PCA,aes_string(xAxis,yAxis,colour="Group"))+geom_point(size=2,alpha = 6/10)
   p=p+ggtitle(paste(xAxis,yAxis,sep=" vs. "))+ theme(plot.title = element_text(lineheight=.5, face="bold",size=14))

   # p=p+guides(colour=guide_legend(nrow=20,byrow=TRUE)) 

   if(!is.null(subVec)){
      colorList=rainbow(length(subVec))
      start = 1
      for (i in subVec){
         print(i)
         p=p+geom_point(data=subset(PCA,Group==i), aes(shape=Group), size=3, color="black")
         start=start+1    
      }
   }

   if(!is.null(labelPlot)){
      start = 1
      for (i in labelPlot){
         print(i)
         p=p+geom_text(data=subset(PCA,Group==i), aes(label=ID),size=2.5,color='black')
         start=start+1    
      }
   }

   if(!is.null(xLim)){
      p=p+xlim(xLim[1],xLim[2])
   }

   if(!is.null(yLim)){
      p=p+ylim(yLim[1],yLim[2])
   }

   p
}
                            
              

Now you can enter the R environment and load up this function like so:

source("makePCA.R")

And draw a PCA with this:

plinkPCA(file="EUROPE")

Because there are lots of family_id values, the colors become a bit redundant. You can help highlight certain groups by offering a subVec value:

plinkPCA( file="EUROPE", subVec = c("Sicily") )

You will notice that one data point seems out of place. You can investigate this by having the individual_id printed using labelPlot and zooming in using xLim and yLim, like so:

plinkPCA( file="EUROPE", subVec = c("Croat"), labelPlot = c("Sicily"), xLim=c(0,0.03), yLim=c(-0.01,0.03) )

We can now see that one of the data points that is classified as Sicilian, is actually Croatian. This is something we should fix in the database.

See if you can figure out what the PC1 and PC2 best represent. Are you surprised by what you find?




Leave a Reply

Your email address will not be published. Required fields are marked *