Hi All, I have made a PCA plot from SNP data using SNPRelate with following code:
library(SNPRelate)
library(gdsfmt)
vcf.fn <- "input.vcf"
snpgdsVCF2GDS(vcf.fn, "test.gds", method="biallelic.only")
genofile <- snpgdsOpen("test.gds")
pop_code <- read.gdsn(index.gdsn(genofile, "genotype"))
set.seed(1000)
snpset <- snpgdsLDpruning(genofile, autosome.only=FALSE, ld.threshold=0.1)
snpset.id <- unlist(snpset)
pca <- snpgdsPCA(genofile, autosome.only=FALSE, snp.id=snpset.id, num.thread=2)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
tab <- data.framesample.id = pca$sample.id,
EV1 = pca$eigenvect[,1],
EV2 = pca$eigenvect[,2],
stringsAsFactors = FALSE)
plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1")
I want to add colors to the plot based on pre-defined groups in the population. However, my input file does not have that information. I have a separate file for the group information with two columns in dataset named "sample.id" and "group". I have seen some posts for adding color to the PCA plot using SNPRelate if the input file used to generate PCA plot has this information. In my case, I have a separate file and I could not find a way to make my file work for SNPRelate to add colors to plot. Is there any different way of doing the same thing with some other resource. Thank you for the help!