08 Inferring Individual Ancestries with ADMIXTURE




A common method of demonstrating the histories of populations is to examine genetic affinities within individuals. Both Structure and ADMIXTURE programs take a parameter K to mean the presumed number of genetic clusters within the entire gene pool, and then the genome of each individual is fragmented into which segments most likely belong to each of these clusters. This approach allows us to infer the existence of distinct populations, assign individuals to populations, visualize hybrid zones, and identify admixed individuals. The Structure program uses a Bayesian approach while ADMIXTURE uses maximum likelihood. We will be using the latter because it runs faster.

If you have a Mac or LINUX computer, you are welcome to download ADMIXTURE from here and run your analyses locally. Otherwise run your analyses on our VM.

To begin with, you’ll want to assemble a dataset. The following query picks a random collection of at least 3 individuals from each family_id:

                              SELECT DISTINCT ON (rf.individual_id) rf.* 
FROM ( SELECT rank() OVER ( PARTITION BY family_id ORDER BY random() ), 
              g.group_name, i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype  
   FROM individuals i NATURAL JOIN groupings g 
   WHERE i.family_id NOT IN 
              (SELECT family_id FROM groupings WHERE family_id LIKE 'African%' OR family_id LIKE '%Jews')
 ) rf 
WHERE rf.rank < 4;
                            

You may find this query difficult to decipher because it uses an unusual technique called “windowing.” Essentially, the set of individuals records (which is basically everyone except African Americans and Jews — removed because we can already assume that their ancestry conflicts with their geography) are partitioned by family_id, and within each family_id the records are ordered by a random number and given a rank. We then further filter the set of records by excluding those with a rank greater than 3. This has the effect of picking out up to 3 individuals at random from each family_id (which we use to mean ethnicity or country).

Let’s start by creating a directory where we will store our files. I’ll call this “MYDATA” — but feel free to name it what you want and modify all further instances of it accordingly. Then we will connect to our snppy database interface:

mkdir MYDATA
cd MYDATA
psql snppy

The \copy method of exporting data requires that the entire query be written in one line. Use the following to create PED and MAP files:

\copy ( SELECT DISTINCT ON (rf.individual_id) rf.* FROM ( SELECT rank() OVER ( PARTITION BY family_id ORDER BY random() ), g.group_name, i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype, i.snps FROM individuals i NATURAL JOIN groupings g WHERE i.family_id NOT IN (SELECT family_id FROM groupings WHERE family_id LIKE 'African%' OR family_id LIKE '%Jews') ) rf WHERE rank < 4 ) TO 'MYDATA.txt' (FORMAT CSV, DELIMITER ' ');

\copy ( SELECT s.chromosome_id, s.snp_id, s.genetic_distance, s.base_pair FROM datasets d NATURAL JOIN datasets_snps ds NATURAL JOIN snps s WHERE d.dataset_name = 'HGDP' ORDER BY ds.seq_order ) TO 'MYDATA.map' (FORMAT CSV, DELIMITER ' ');

The CSV file contains double quotes around the string of SNPs, so we need to remove that with the following Perl one-liner:

perl -i -pe's/"//g' MYDATA.txt

Now we need to create the PED file, as well as a file that connects individual_id, family_id, and regional groups.

perl -F"\s" -nlae'print join "\t", ($F[3], $F[2], $F[1])' MYDATA.txt > MYDATA.grp 
perl -F"\s" -nlae'print join "\t", @F[2..@F]' MYDATA.txt > MYDATA.ped

Note that these commands can take a minute or two to execute.

We can now create the binary files:

plink --file MYDATA --make-bed -out MYDATA

Normally we would run through some data cleansing steps as we did previously but for the sake of expediency, let’s just proceed with our data “as is.” Our VM has 16 CPUs, which means that each of our teams can safely use 2 threads, so we can set -j to 2. The K number it up to you to decide. Here I’m using a K of 5:

nohup time admixture -j2 MYDATA.bed 5 > admix.log 2>&1 &
echo $! > save_pid.txt

This analysis will probably take about ten minutes, but depending on the number of individuals and SNPs, it could take many hours. This is why we’re beginning with “nohup” and ending with “&” — which allows the process to continue working even when you have logged out. What the program would normally write to the screen is not diverted to the file called admix.log. The echo command will write the process ID of nohup to a file called save_pid.txt. Feel free to check on your analysis periodically by typing:

tail admix.log

To get the process number, read the save_pid.txt file with “cat save_pid.txt”. Supposing that the process ID number is 18934, you can check whether the process is still running like so:

ps -p 18934 -o %cpu,%mem,cmd

Also, if you want you can use this PID to kill the process.

Once complete, you should find a Q file with data that we will use to graph using R. This is a fairly large script — you may prefer to create a file called “drawbargraph.R” and then start up R and execute it with source(“drawbargraph.R”). Be sure to modify projectname and KNum to match your parameters:

# Change these parameters to match your needs:
projectname <- "MYDATA"
KNum <- 5
fileType <- "Q"

# Ingest the Q file
Q=read.table( paste(projectname,KNum,fileType,sep="."), sep=" ", header=FALSE )
Qmat=as.matrix(Q)

# We can get the population groups from the GRP file:
clst=read.table( paste( projectname,"grp",sep="."), header=FALSE )

# We also need the FAM file:
fam=read.table(paste(projectname,"fam",sep="."))

# We will use the match function to link the family ids with the cluster id
clst_unord=clst$V3[match(fam$V2,clst$V1)]

# And then re-order alphabetically by region:
ordered_indices=order(clst_unord)
QmatO=Qmat[ordered_indices,]

# We will write a master table matching the alphabetical order:
fams <- fam[c(1,2)]
masterlist = cbind(fams[ordered_indices,], QmatO)
write.table(masterlist,file=paste(projectname,"csv",sep="."), sep=",", col.names=NA)

# Compute where to print the region labels:
n=length(ordered_indices)
clst_ord=clst_unord[ordered_indices]
breaks=c(0,which(clst_ord[1:(n-1)]!=clst_ord[2:n]),n)
nbrks=length(breaks)
midpts=(breaks[1:(nbrks-1)]+breaks[2:nbrks])/2

# Draw the plot:
barplot(t(QmatO),col=c("red","blue","yellow","orange","purple", "brown"),border=NA,space=0,inside=TRUE)
abline(v=breaks,lwd=2)
mtext(levels(clst_ord),side=1,at=midpts,las=2)

Here’s what I got from my set of individuals:




Leave a Reply

Your email address will not be published. Required fields are marked *