Ok so given that there are more than one way to pick up the SNPs we should do it in a really random fashion. The "memory-keeping initial_position" approach introduces a systematic bias. Imagine for example that you have SNPs at 100 consecutive positions, say at positions 101:200, then you're going to pick up SNPs at 101, 121, 141, 161, and 181, which is not good.
So here is an algo that is super fast and extracts a random subset of SNPS with a space of at least 'min.space' between them. The returned subset is "maximal" in the sense that no SNP can be added to it (i.e. the SNPs that didn't make it into it are too close):
library(IRanges)
clusterSNPs <- function(snp_pos, min.space=20)
{
clusters_start_idx <-
which(c(TRUE, diff(snp_pos) >= min.space))
clusters_end_idx <-
c((clusters_start_idx-1L)[-1L], length(snp_pos))
PartitioningByEnd(clusters_end_idx)
}
selectOneFromEachCluster <- function(x)
{
x_eltlens <- unname(elementLengths(x))
i0 <- sample(max(x_eltlens), length(x_eltlens),
replace=TRUE) %% x_eltlens
end(x) - i0
}
### Assumes that 'snp_pos' is sorted. Returns an index into the
### original set.
extractSpacedSNPs <- function(snp_pos, min.space=20)
{
ans <- integer(0)
remaining_idx <- seq_along(snp_pos)
while (length(remaining_idx) != 0L) {
remaining_pos <- snp_pos[remaining_idx]
## 1. Cluster the SNPs:
clusters <- clusterSNPs(remaining_pos, min.space)
## 2. Pick up one SNP in each cluster and add it to
## the result:
selection <- selectOneFromEachCluster(clusters)
ans <- c(ans, remaining_idx[selection])
## 3. Drop remaining SNPs that are at a distance less
## than 'min.space' from the SNPs we just added to
## the result:
remaining_clusters <- relist(remaining_pos, clusters)
keep <- which(unlist(abs(remaining_clusters -
remaining_pos[selection]) >=
min.space))
remaining_idx <- remaining_idx[keep]
}
sort(ans)
}
Then:
snp_pos <- 101:200
idx <- extractSpacedSNPs(snp_pos)
snp_pos[idx]
# [1] 109 136 163 191
Cheers,
H.
would the snpgdsLDpruning function of SNPRelate package help? This requires genotype data in gds format,
and there is a VCF->gds converter. For some reason package gdsfmt has been dropped from CRAN, but can be
installed from the CRAN archive. it would be nice to know the relative performances of methods described here and that one.