Often it is necessary to analyze disparate datasets together. For example, maybe you sequenced a portion of your genome through 23&me and want to analyze it together with a selection of data from snppy. Difficulties include (1) are there enough SNPs in common to create a merged dataset that’s worth analyzing? (2) have the SNPs been sequenced from the same strand, or were they sequenced from different strands?
Let’s go over to the Personal Genome Project site and pick out someone’s 23&me data. I’m going to randomly select someone — in this case a PGP participant named Noel Sarah Dietrich, who has kindly made her genome available to us. She is a 39 year old fitness instructor and communications director at a school in Wilmington, Delaware who has the user name “NSD23andme” and ID number “hu8BF030.”
On PGP, you can find her profile here: https://my.pgp-hms.org/profile/hu8BF030 along with a download link. She has also provided some health and trait information — e.g. she has iron deficiency anemia, she has blue eyes, dark brown hair, weighs 70 kg, and is 172cm tall. None of this is useful to us, and unfortunately her grandparents are all from the United States, so this doesn’t tell us much, other than she identifies as “white.” Dietrich is a German name, so presumably she has some German heritage, likely by way of her paternal line. But that by itself tells us nothing about what proportion of her ancestry is German.
So let’s check out her genome. Create a directory on the VM (e.g. “mkdir NOEL”) and enter the directory (e.g. “cd NOEL”). Using your personal computer, on the PGP webpage for Noel, use control-click to copy the download link. To download her genome on the VM, paste it into a wget command as follows (note that the -O switch is a capital letter “O” not the integer zero “0”):
wget https://my.pgp-hms.org/user_file/download/2863 -O noel_genome.txt
Examine the top 30 lines to check that the file is really a 23&me format:
head -n 30 noel_genome.txt
Now let’s assemble some genomes to merge with Noel so as to make some productive comparisons. Assuming that Noel is of European descent, let’s just download a bunch of Europeans, but excluding “UtahWhite” as they will just muddy the distinction between populations:
SELECT i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype
FROM individuals i NATURAL JOIN groupings g
WHERE g.group_name = 'Europe' AND i.family_id <> 'UtahWhite'
ORDER BY i.individuals_id;
As per usual, we will download as single-line COPY commands:
\copy ( SELECT i.family_id, i.individual_id, i.paternal_id, i.maternal_id, i.sex, i.phenotype, i.snps FROM individuals i NATURAL JOIN groupings g WHERE g.group_name = 'Europe' AND i.family_id <> 'UtahWhite' ORDER BY i.individuals_id ) TO 'NWEUROPE.ped' (FORMAT CSV, DELIMITER ' ');
\copy ( SELECT s.chromosome_id, s.snp_id, s.genetic_distance, s.base_pair FROM datasets d NATURAL JOIN datasets_snps ds NATURAL JOIN snps s WHERE d.dataset_name = 'HGDP' ORDER BY ds.seq_order ) TO 'NWEUROPE.map' (FORMAT CSV, DELIMITER ' ');
Followed by our usual fix and BED file formatting:
perl -i -pe's/"//g' NWEUROPE.ped
plink --file NWEUROPE --make-bed -out NWEUROPE
Now let’s do BED file formatting for Noel’s genome. The first parameter is the filename of her 23&me genome (“noel_genome.txt”); the second parameter is the family_id we will give her (“NoelDietrich”); and the third parameter is her individual_id (“hu8BF030”). The fourth parameter is the sex of the person — in this case “i” tells PLINK to guess at the sex based on the presence of sex chromosomes. The fifth parameter is the phenotype (here -9 to indicate missing or not applicable), and the next two parameters are the parental IDs.
plink --23file noel_genome.txt NoelDietrich hu8BF030 i -9 0 0 --alleleACGT --make-bed -out NOELDIETRICH
If we compare NWEUROPE.log with NOELDIETRICH.log, we can see that the former has 135,056 SNPs while the latter has 610,545 SNPs. We will need to reconcile these SNPs.
First, let’s try to merge the two datasets:
plink --bfile NWEUROPE --bmerge NOELDIETRICH.bed NOELDIETRICH.bim NOELDIETRICH.fam --make-bed --out NWEUROPE_ND
If binary merging fails because some SNPs appear to have more than two alleles, a list of offending variant(s) will be written to NWEUROPE_ND-merge.missnp. Count up the number of lines in NWEUROPE_ND-merge.missnp using the following:
wc -l NWEUROPE_ND-merge.missnp
I’m getting 21,799 3+ SNPs, which are problematic (this number is also reported in the NWEUROPE_ND.log file). Human populations are too small to support many tri-allelic SNPs because in small populations it’s unlikely for SNPs to persist for a long time. Typically, biallelic SNPs will have run to fixation or extinction before another mutation has a chance to turn it into a triallelic SNP. So if there are more than two alleles, it’s likely because this method of genotyping used different strands in the genomes that you’re trying to merge. We can test this by “flipping” the problematic alleles in one of the datasets:
plink --bfile NOELDIETRICH --flip NWEUROPE_ND-merge.missnp --make-bed --out NOELDIETRICH_F
I’ve added an “F” to the end of the project name in order to indicate that the problem SNPs have been flipped.
So now we can try again, this time merging NOELDIETRICH_F with PGPAA:
plink --bfile NWEUROPE --bmerge NOELDIETRICH_F.bed NOELDIETRICH_F.bim NOELDIETRICH_F.fam --make-bed --out NWEUROPE_NDF
This time the merge succeeded. Had it not succeeded, but if only a few SNPs were still problematic, you could still force a merge by excluding the remaining problematic SNPs using this: –exclude NWEUROPE_NDF-merge.missnp.
At any rate, this merge worked, so all’s well. At this point our dataset has tons of SNPs from the 23&me data that are not scored in the snppy data. We might as well exclude these by first generating a SNP list for the original smaller file:
plink --bfile NWEUROPE --write-snplist --out NWEUROPE
Filter our merged data so that our SNPs match the smaller dataset:
plink --bfile NWEUROPE_NDF --extract NWEUROPE.snplist --make-bed --out NWEUROPE_NDFS
I’ve now added an “S” to our project name to indicate that it’s now smaller, and we’ve reduced the number of SNPs from 630,451 to 135,056.
At this point I might want to do a PCA analysis, like so:
plink --bfile NWEUROPE_NDFS --pca 4 --out NWEUROPE_NDFS
And the following graph seems to put her somewhere between the British and French populations:
plinkPCA( file="NWEUROPE_NDFS", subVec = c("NoelDietrich", "German", "GreatBritain", "French", "Spaniards", "Orcadian"), xLim=c(-0.02,0.01), yLim=c(0.01,0.03) )
Note that because she comes from mixed ancestry, it’s not necessarily true that she is a blend of British and French ancestry — she could well be a blend of Swedish and Spanish ancestry and she’d still emerge in the same location. What kind of analysis would do a better job of estimating how Noel is blended?