Hi, I am studying the genetic diversity of a population and I'm using R for filtering my genotypic data (Single Nucleotide Polymorphism/SNP), dendrogram construction and estimating the error rate between replicated samples (duplicates/triplicates).
My code :
mat <- 1-ibsmat
mat1 <- mat[-99:-100,-99:-100]
#ERROR RATE FOR triplicates
x <- seq(1,274,3)
err.r <- rep(NA, length(x))
for (i in 1:(length(x)-1)){
k <- x[i]
k1=k+2
ibx <- mat1[k:k1,k:k1]
print(ibx)
err.r[i] <- mean(ibx[lower.tri(ibx)])
}
errorrate <- mean(na.omit(err.r))
#ERROR RATE FOR duplicates
x <- seq(1,274,2)
err.r <- rep(NA, length(x))
for (i in 1:(length(x)-1)){
k <- x[i]
k1=k+1
ibx <- mat1[k:k1,k:k1]
print(ibx)
err.r[i] <- mean(ibx[lower.tri(ibx)])
}
errorrate <- mean(na.omit(err.r))
My questions are: 1) my .csv document should contain only sorted triplicates and duplicates or all data (duplicates, triplicates, no replicated samples)
2) Should I filter that csv.document before estimating error rate? what I mean by filteriing is %NA (missing) by genotypes for example.
3) If there's an error in my code, please feel free to comment.
Thanks, Meriam