Question

Filter every X position

1

Entering edit mode

remi.tournebize ▴ 10

@remitournebize-7099

Last seen 9.2 years ago

France

Dear all,

I would like to filter the SNPs of a VCF file so to extract only SNPs distant of X kb between each other. I do not see any built-in function to perform this kind of filtering so aimed at writing a filter rule to achieve it and pass it through the filterVcf function of the VariantAnnotation package.

However, the kind of function I was considering would require a for loop which would compare for each SNP entry its position to the last_accepted_position so that the SNP is given a logical TRUE if last_accepted_position - position > X or a FALSE if not. At each iteration of the loop, last_accepted_position would be incremented to last_accepted_position + position if the SNP is TRUE, or would remain the same if SNP is FALSE.

How can I pass such loop using filterRules and filterVcf of the VariantAnnotation package?

Many thanks for your help,

Rémi

variantannotation • 1.8k views

ADD COMMENT • link 9.4 years ago • updated 9.3 years ago remi.tournebize ▴ 10

0

Entering edit mode

would the snpgdsLDpruning function of SNPRelate package help? This requires genotype data in gds format,

and there is a VCF->gds converter. For some reason package gdsfmt has been dropped from CRAN, but can be

installed from the CRAN archive. it would be nice to know the relative performances of methods described here and that one.

ADD REPLY • link 9.4 years ago Vincent J. Carey, Jr. 6.7k

score 1 · Answer 1 · 2014-12-17

1

Entering edit mode

Julian Gehring ★ 1.3k

@julian-gehring-5818

Last seen 5.0 years ago

You should be able to do this without a loop. Assuming that you have your SNPs in a VRanges or VCF object:

pos = start(vr)
inter_snp_distance = diff(pos)
idx = ( inter_snp_distance >= your_threshold )

ADD COMMENT • link 9.4 years ago Julian Gehring ★ 1.3k

score 1 · Answer 2 · 2014-12-17

In the context of using filterVcf(), and using Julian's approach, the idea is to create a 'closure' (in this case, a function that returns a function) that can remember state (last_pos) between invocations. Here's my filter factory (notice the use of <<- to update last_pos in the environment in which the filter function is defined, rather than in the body of the function itself).

factory <- function(threshold) {
    last_pos <- 0L
    function(x) {
        dist <- diff(c(last_pos, start(x)))
        keep <- dist > threshold
        if (any(keep))
            last_pos <<- tail(start(x)[keep], 1)
        keep
    }
}
filt <- FilterRules(list(isDistant = factory(100)))

and application

fl <- system.file(package="VariantAnnotation", "extdata", "chr22.vcf.gz")
destination <- tempfile()
filtered <- filterVcf(fl, "hg19", destination, filters=filt)

A little more challenging is dealing with changes in chromosome, which requires that the last position be reset.

score 1 · Answer 3 · 2014-12-19

Ok so given that there are more than one way to pick up the SNPs we should do it in a really random fashion. The "memory-keeping initial_position" approach introduces a systematic bias. Imagine for example that you have SNPs at 100 consecutive positions, say at positions 101:200, then you're going to pick up SNPs at 101, 121, 141, 161, and 181, which is not good.

So here is an algo that is super fast and extracts a random subset of SNPS with a space of at least 'min.space' between them. The returned subset is "maximal" in the sense that no SNP can be added to it (i.e. the SNPs that didn't make it into it are too close):

library(IRanges)

clusterSNPs <- function(snp_pos, min.space=20)
{
   clusters_start_idx <-
       which(c(TRUE, diff(snp_pos) >= min.space))
   clusters_end_idx <-
       c((clusters_start_idx-1L)[-1L], length(snp_pos))
   PartitioningByEnd(clusters_end_idx)
}

selectOneFromEachCluster <- function(x)
{
    x_eltlens <- unname(elementLengths(x))
    i0 <- sample(max(x_eltlens), length(x_eltlens),
                 replace=TRUE) %% x_eltlens
    end(x) - i0
}

### Assumes that 'snp_pos' is sorted. Returns an index into the
### original set.
extractSpacedSNPs <- function(snp_pos, min.space=20)
{
    ans <- integer(0)
    remaining_idx <- seq_along(snp_pos)
    while (length(remaining_idx) != 0L) {
        remaining_pos <- snp_pos[remaining_idx]
        ## 1. Cluster the SNPs:
        clusters <- clusterSNPs(remaining_pos, min.space)
        ## 2. Pick up one SNP in each cluster and add it to
        ##    the result:
        selection <- selectOneFromEachCluster(clusters)
        ans <- c(ans, remaining_idx[selection])
        ## 3. Drop remaining SNPs that are at a distance less
        ##    than 'min.space' from the SNPs we just added to
        ##    the result:
        remaining_clusters <- relist(remaining_pos, clusters)
        keep <- which(unlist(abs(remaining_clusters -
                                 remaining_pos[selection]) >=
                                                  min.space))
        remaining_idx <- remaining_idx[keep]
    }
    sort(ans)
}

Then:

snp_pos <- 101:200
idx <- extractSpacedSNPs(snp_pos)
snp_pos[idx]
# [1] 109 136 163 191

Cheers,

H.

score 0 · Answer 4 · 2014-12-18

0

Entering edit mode

remi.tournebize ▴ 10

@remitournebize-7099

Last seen 9.2 years ago

France

Dear Julian and Martin, thank you so much for your messages and propositions.

Julian, I wrote this script initially but the problem of the diff function is that it computes the difference between consecutive pairs of array elements. Thus, if I have for instance an array

0 50 60 70 100

And a threshold of 20, my SNP will be

TRUE FALSE FALSE TRUE

However using a memory-keeping initial_position variable, I would get

TRUE TRUE FALSE TRUE TRUE

I believe using the diff function would underestimate the number of relevant SNPs.

Best regards,

Rémi

ADD COMMENT • link 9.4 years ago remi.tournebize ▴ 10

0

Entering edit mode

Okay, now I understand what you want to do, and why you switched to the loop. Out of curiosity, is there a particular reason why the algorithm should not filter based the difference between neighboring sites? The approach you have outlined can yield different results, depending if you start filtering from the beginning or from the end of the chromosome. Is this intentional?

ADD REPLY • link 9.4 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

or to emphasize Julian's last comment, if you had the following situation

0 45 55 100

your "memory-keeping initial_position" approach would keep the SNP at position 45 and drop the SNP at position 55. Why do you consider the SNP at 45 more relevant than the SNP at 55? I guess I'm questioning the notion of SNP relevance based on the gap between them.

Cheers,

H.

ADD REPLY • link 9.4 years ago Hervé Pagès 16k

0

Entering edit mode

Dear Julian and Hervé,

"Relevant" might be an unappropriate adjective. My objective is to filter a dataset of SNPs for demographic studies. One priority for my models is to consider SNPs assumed to be independent. To do so, I have calculated a mean LD distance and want to extract SNPs which are located outside that threshold. It does not matter however that I take, referring to Hervé's example, 45 instead of 55, or 55 instead of 45. This is why I do not mind that the memory-keeping initial_position variable initialize at an arbitrary site (whether the beginning or the end of the chromosome, or an arbitrary position in the middle). True though that for detecting selection signal, for instance, this arbitrarity would introduce a bias and the algorithm should be different.

Best regards,

Rémi

ADD REPLY • link 9.4 years ago remi.tournebize ▴ 10

0

Entering edit mode

I think there's probably a clever pure R way to do what you want, but that requires a lot of thinking! Here's an Rcpp solution.

library(Rcpp)
runningdiff <- cppFunction("
    LogicalVector runningdiff(IntegerVector x, IntegerVector d) {
        int len = x.length(), diff = d[0], curr = x[0], last=0;
        LogicalVector result = LogicalVector(len);
        result[0] = TRUE;
        for (int i=1; i < len; ++i)
            if (x[i] - last > diff) {
                result[i] = TRUE;
                last = x[i];
            }
        return result;
    }
")

Once compiled, runningdiff can be used like a regular R function, keep = runningdiff(start(vcf), 20) and will be fast.

ADD REPLY • link 9.4 years ago Martin Morgan 25k

0

Entering edit mode

...and perhaps the pure R solution is, for sorted x

!duplicated(findInterval(x, x + diff))

though some tweaking might be required to deal with notions of distance. I think for non-sorted inputs the solution is to replace x + diff with unique(sort(x)) + diff.

ADD REPLY • link 9.4 years ago Martin Morgan 25k

score 0 · Answer 5 · 2015-01-09

0

Entering edit mode

remi.tournebize ▴ 10

@remitournebize-7099

Last seen 9.2 years ago

France

Many thanks Martin and Hervé!

Wishing you a very happy new year,

Rémi

ADD COMMENT • link 9.3 years ago remi.tournebize ▴ 10