07 Detecting Population Substructure Using Hardy-Weinberg Expectations




Sometimes it’s useful to test whether a population is “stratified” (or contains “structure”), or whether it is panmictic. If a grouping is made up of two separate and smaller, inbred populations, we expect homozygosity to be elevated in the separate populations, thus decreasing the heterozygosity that Hardy-Weinberg equilibrium would otherwise predict.

Let’s run two sets of queries: one that finds a combination of French and Yoruba people, and the other that finds only Yoruba people:

\copy (SELECT i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype, i.snps FROM individuals i NATURAL JOIN groupings g WHERE i.family_id LIKE 'French' OR i.family_id LIKE 'Yoruba' ORDER BY i.individuals_id ) TO 'FREYUB.ped' (FORMAT CSV, DELIMITER ' ');

\copy (SELECT s.chromosome_id, s.snp_id, s.genetic_distance, s.base_pair FROM datasets d NATURAL JOIN datasets_snps ds NATURAL JOIN snps s WHERE d.dataset_name = 'HGDP' ORDER BY ds.seq_order ) TO 'FREYUB.map' (FORMAT CSV, DELIMITER ' ');

\copy (SELECT i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype, i.snps FROM individuals i NATURAL JOIN groupings g WHERE i.family_id LIKE 'Yoruba' ORDER BY i.individuals_id ) TO 'YORUBA.ped' (FORMAT CSV, DELIMITER ' ');

\copy (SELECT s.chromosome_id, s.snp_id, s.genetic_distance, s.base_pair FROM datasets d NATURAL JOIN datasets_snps ds NATURAL JOIN snps s WHERE d.dataset_name = 'HGDP' ORDER BY ds.seq_order ) TO 'YORUBA.map' (FORMAT CSV, DELIMITER ' ');

Per usual, we will eliminate any quotes added in by the CSV format:

perl -i -pe's/"//g' FREYUB.ped 
perl -i -pe's/"//g' YORUBA.ped

We can now make binaries:

plink --file FREYUB --make-bed -out FREYUB
plink --file YORUBA --make-bed -out YORUBA

And now we can do Hardy-Weinberg calculations:

plink --bfile FREYUB --hardy --out FREYUB
plink --bfile YORUBA --hardy --out YORUBA

And we will only retain lines that have “ALL” in each line:

perl -i -ne 'print if /ALL/' FREYUB.hwe
perl -i -ne 'print if /ALL/' YORUBA.hwe

Now we can plot this in R. Here is a function courtesy of Graham Coop’s lab:

                              plot.geno.vs.HW<-function(file,title=""){
 plink.hwe<-read.table(file,as.is=TRUE)
 names(plink.hwe)<-c("chr","SNP.id","which.inds","a1","a2","genotype","obs.het","exp.het","HWE.pval")
 counts<-sapply(plink.hwe$genotype,function(x){as.numeric(strsplit(x,"/")[[1]])})
 counts<-t(counts)
 tot.counts<-rowSums(counts)
 geno.freq<-counts/tot.counts
 allele.freq<-(geno.freq[,1]+.5*geno.freq[,2])
 these.minor<-sample(1:nrow(geno.freq),3000)
 these.major<-sample(1:nrow(geno.freq),3000)
 ss.allele<-c(allele.freq[these.minor],1-allele.freq[these.major])
 ss.geno<-rbind(geno.freq[these.minor,],geno.freq[these.major,c(3,2,1)])

 # If you have adjustcolor library installed, use the following code:
 # plot(ss.allele,ss.geno[,1],xlim=c(0,1),ylim=c(0,1),col=adjustcolor("red",0.1), 
 #   xlab="allele frequency",ylab="genotype frequency",main=title)
 # points(ss.allele,ss.geno[,3],xlim=c(0,1),ylim=c(0,1),col=adjustcolor("blue",0.1))
 # points(ss.allele,ss.geno[,2],xlim=c(0,1),ylim=c(0,1),col=adjustcolor("green",0.1))

 plot(ss.allele,ss.geno[,1],xlim=c(0,1),ylim=c(0,1),col="red", xlab="allele frequency",
   ylab="genotype frequency",main=title)
 points(ss.allele,ss.geno[,3],xlim=c(0,1),ylim=c(0,1),col="blue")
 points(ss.allele,ss.geno[,2],xlim=c(0,1),ylim=c(0,1),col="green")
 smooth=1/5
 lines(lowess(ss.geno[,1]~ss.allele,f = smooth),col="black")
 lines(lowess(ss.geno[,3]~ss.allele,f = smooth),col="black")
 lines(lowess(ss.geno[,2]~ss.allele,f = smooth),col="black")
 x=1:1000/1000
 lines(x,x^2,lty=2)
 lines(x,2*x*(1-x),lty=2)
 lines(x,(1-x)^2,lty=2)
 legend(x=0.3,y=1,col=c("red","blue","green",rep("black",2)),
 legend=c("Homozygote AA","Homozygote aa","Heterozygote Aa","Mean","Hardy Weinberg Expectation"),
   pch=c(rep(1,3),rep(NA,2)),lty=c(rep(NA,3),1,2))
}
                            

c89ef60c-cfbe-cfbe-3a40-8c8527ca38bb

Try drawing two graphs:

plot.geno.vs.HW(file="FREYUB.hwe",title="French and Yoruba")
plot.geno.vs.HW(file="YORUBA.hwe",title="Yoruba Alone")

… and see if you can tell which dataset is a blend of two populations, and which is one population.




Leave a Reply

Your email address will not be published. Required fields are marked *