Question: parRapply in BiocParallel ?
1
gravatar for kevin.rue
3.3 years ago by
kevin.rue220
University of Oxford
kevin.rue220 wrote:

Hi all,

I have a feeling this is an easy one, but browsing the forum and the BiocParallel vignette, I can't find it.

Is there a BiocParallel equivalent of parRapply?

I have a nice MulticoreParam object to handle parallel computing, and if I could I would like to avoid making a separate cluster using the parallel package.

 

For context, I would like to apply a function to each row of the geno(vcf)[["GT"]] matrix to calculate the allele frequency of each variant (AF = 1*[0/1] + 2*[1/1]). I have ~80k thousand variants that I would like to reduce by filtering those > 1% MAF.

...except if I am missing an existing function that does part or all that I want? The reason why I am calculating it myself is that there are some funky genotypes in there (1/0, ./1, 0/.,  and more) which makes me wonder how the AF field of the VCF is calculated. At this stage, I simply prefer calculating the AF checking genotypes myself.

Cheers!

==== EDIT to add example ===

Thanks for both your comments.

Here is a made-up example:

# Define accepted genotypes
refG = c("0/0")
altHet = c("0/1")
altHom = c("1/1")
# A sample of genotypes observed in the VCF file
allG = c(
  "0/0", "0", "1/1",
  "1/0", "1", "2",
  ".", "./.", "./0", "./1", "0/.", "1/.")

# Make up a matrix like the one obtained using VariantAnnotation
genos <- matrix(
  data = sample(
    x = allG, size = 1E6, replace = TRUE,
    prob = c(rep(0.25, 3), rep(0.25/9, 9))),
  nrow = 1E3,
  ncol = 1E3)
dim(genos)
genos[1:4,1:4]

# My current solution to calculating MAF considering "accepted" genotypes
minorAlleleFreq <- function(genos, refGenos, altHetGenos, altHomGenos){
  tableCounts <- table(factor(
    x = genos,
    levels = c(refGenos, altHetGenos, altHomGenos)
  ))
  alleleCount <- tableCounts[altHetGenos] + 2 * tableCounts[altHomGenos]
  alleleNumber <- 2 * sum(tableCounts)
  return(alleleCount / alleleNumber)
}
mafs <- apply(
  X = genos,
  MARGIN = 1,
  FUN = minorAlleleFreq,
  refGenos = refG,
  altHetGenos = altHet,
  altHomGenos = altHom)

# For your eyes only :)
hist(mafs)

I am just wondering if there was a more efficient way to calculate mafs above.

Cheers!

ADD COMMENTlink modified 3.2 years ago by Michael Lawrence11k • written 3.3 years ago by kevin.rue220

There isn't a parRapply equivalent. But can you provide more detail, maybe with an example using one of the vcf files that comes with VariantAnnotation or a simple matrix constructed 'by hand' of what you want to calculate? I'd guess that there is a vectorized version of the calculation that will be very fast, so no need for parallel evaluation.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by Martin Morgan ♦♦ 23k

Martin is right, there has to be a faster way than looping, and it doesn't make sense to use rapply over a genotype matrix anyway, since there is only a single level of nesting. To avoid the nesting entirely, call expand(vcf) and then use conventional matrix operations.

ADD REPLYlink written 3.2 years ago by Michael Lawrence11k
Answer: parRapply in BiocParallel ?
3
gravatar for Michael Lawrence
3.2 years ago by
United States
Michael Lawrence11k wrote:

Here is one simple thing that is about 8X faster:

    nHet <- rowSums(genos == altHet)
    nHom <- rowSums(genos == altHom)
    nRef <- rowSums(genos == refG)
    af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))

 

 

ADD COMMENTlink written 3.2 years ago by Michael Lawrence11k

Thanks so much.

One thing though is that I am dealing with a more general situation, where refG, altHet and altHom can be vectors of genotypes, such as in the case of altHet = ["0/1", "1/0"] or partially phased genotypes where refHet = ["0|0", "0/0"] and so on.

My problem is now that genos %in% refG "flattens" the matrix to a vector of 1E6 elements, making the following fail :

    nHet <- rowSums(genos %in% altHet)
    nHom <- rowSums(genos %in% altHom)
    nRef <- rowSums(genos %in% refG)
    af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))

How efficient is the following in your opinion? (Actually, I'll run a microbenchmark myself, my question was rather if you had any more efficient in mind)

nHet2 <- rowSums(matrix(genos %in% altHet, nrow = nrow(genos), ncol = ncol(genos)))
nHom2 <- rowSums(matrix(genos %in% altHom, nrow = nrow(genos), ncol = ncol(genos)))
nRef2 <- rowSums(matrix(genos %in% refG, nrow = nrow(genos), ncol = ncol(genos)))
af2 <- (nHet2 + 2 * nHom2) / (2 * (nHet2 + nHom2 + nRef2))

Thanks in advance!

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by kevin.rue220

Your solution (equality to single genotype) is obviously much faster:

> microbenchmark::microbenchmark(
+   {
+     nHet <- rowSums(genos == altHet)
+     nHom <- rowSums(genos == altHom)
+     nRef <- rowSums(genos == refG)
+     af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))
+   },
+   times = 100
+ )
Unit: milliseconds
      min       lq    mean   median       uq      max neval
 25.14072 25.51624 29.9408 26.15891 26.88414 123.7942   100

 

While my adaptation (check against multiple genotypes and reshape as matrix) takes against more time, but seemingly half that of the original code:

> microbenchmark::microbenchmark(
+   {
+     nHet2 <- rowSums(matrix(genos %in% altHet, nrow = nrow(genos), ncol = ncol(genos)))
+     nHom2 <- rowSums(matrix(genos %in% altHom, nrow = nrow(genos), ncol = ncol(genos)))
+     nRef2 <- rowSums(matrix(genos %in% refG, nrow = nrow(genos), ncol = ncol(genos)))
+     af2 <- (nHet2 + 2 * nHom2) / (2 * (nHet2 + nHom2 + nRef2))
+   },
+   times = 100
+ )
Unit: milliseconds
      min       lq     mean   median       uq     max neval
 84.16726 90.53986 131.5128 91.36508 176.7313 182.693   100

 

Originally posted code:

> microbenchmark::microbenchmark(
+   {
+     mafs <- apply(
+       X = genos,
+       MARGIN = 1,
+       FUN = minorAlleleFreq,
+       refGenos = refG,
+       altHetGenos = altHet,
+       altHomGenos = altHom)
+   },
+   times = 100
+ )
Unit: milliseconds
      lq     mean   median      uq     max neval
 219.801 255.2118 225.0104 308.326 327.508   100
 
ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by kevin.rue220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 143 users visited in the last hour