Question

parRapply in BiocParallel ?

1

Entering edit mode

kevin.rue ▴ 350

@kevinrue-6757

Last seen 21 months ago

University of Oxford

Hi all,

I have a feeling this is an easy one, but browsing the forum and the BiocParallel vignette, I can't find it.

Is there a BiocParallel equivalent of parRapply?

I have a nice MulticoreParam object to handle parallel computing, and if I could I would like to avoid making a separate cluster using the parallel package.

For context, I would like to apply a function to each row of the geno(vcf)[["GT"]] matrix to calculate the allele frequency of each variant (AF = 1*[0/1] + 2*[1/1]). I have ~80k thousand variants that I would like to reduce by filtering those > 1% MAF.

...except if I am missing an existing function that does part or all that I want? The reason why I am calculating it myself is that there are some funky genotypes in there (1/0, ./1, 0/., and more) which makes me wonder how the AF field of the VCF is calculated. At this stage, I simply prefer calculating the AF checking genotypes myself.

Cheers!

==== EDIT to add example ===

Thanks for both your comments.

Here is a made-up example:

# Define accepted genotypes
refG = c("0/0")
altHet = c("0/1")
altHom = c("1/1")
# A sample of genotypes observed in the VCF file
allG = c(
  "0/0", "0", "1/1",
  "1/0", "1", "2",
  ".", "./.", "./0", "./1", "0/.", "1/.")

# Make up a matrix like the one obtained using VariantAnnotation
genos <- matrix(
  data = sample(
    x = allG, size = 1E6, replace = TRUE,
    prob = c(rep(0.25, 3), rep(0.25/9, 9))),
  nrow = 1E3,
  ncol = 1E3)
dim(genos)
genos[1:4,1:4]

# My current solution to calculating MAF considering "accepted" genotypes
minorAlleleFreq <- function(genos, refGenos, altHetGenos, altHomGenos){
  tableCounts <- table(factor(
    x = genos,
    levels = c(refGenos, altHetGenos, altHomGenos)
  ))
  alleleCount <- tableCounts[altHetGenos] + 2 * tableCounts[altHomGenos]
  alleleNumber <- 2 * sum(tableCounts)
  return(alleleCount / alleleNumber)
}
mafs <- apply(
  X = genos,
  MARGIN = 1,
  FUN = minorAlleleFreq,
  refGenos = refG,
  altHetGenos = altHet,
  altHomGenos = altHom)

# For your eyes only :)
hist(mafs)

I am just wondering if there was a more efficient way to calculate mafs above.

Cheers!

biocparallel variantannotation • 1.8k views

ADD COMMENT • link updated 9.7 years ago by Michael Lawrence ★ 11k • written 9.7 years ago by kevin.rue ▴ 350

0

Entering edit mode

There isn't a parRapply equivalent. But can you provide more detail, maybe with an example using one of the vcf files that comes with VariantAnnotation or a simple matrix constructed 'by hand' of what you want to calculate? I'd guess that there is a vectorized version of the calculation that will be very fast, so no need for parallel evaluation.

ADD REPLY • link 9.7 years ago Martin Morgan 25k

0

Entering edit mode

Martin is right, there has to be a faster way than looping, and it doesn't make sense to use rapply over a genotype matrix anyway, since there is only a single level of nesting. To avoid the nesting entirely, call expand(vcf) and then use conventional matrix operations.

ADD REPLY • link 9.7 years ago Michael Lawrence ★ 11k

score 3 · Accepted Answer · 2016-05-25

3

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 4.2 years ago

United States

Here is one simple thing that is about 8X faster:

    nHet <- rowSums(genos == altHet)
    nHom <- rowSums(genos == altHom)
    nRef <- rowSums(genos == refG)
    af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))

ADD COMMENT • link 9.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks so much.

One thing though is that I am dealing with a more general situation, where refG, altHet and altHom can be vectors of genotypes, such as in the case of altHet = ["0/1", "1/0"] or partially phased genotypes where refHet = ["0|0", "0/0"] and so on.

My problem is now that genos %in% refG "flattens" the matrix to a vector of 1E6 elements, making the following fail :

    nHet <- rowSums(genos %in% altHet)
    nHom <- rowSums(genos %in% altHom)
    nRef <- rowSums(genos %in% refG)
    af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))

How efficient is the following in your opinion? (Actually, I'll run a microbenchmark myself, my question was rather if you had any more efficient in mind)

nHet2 <- rowSums(matrix(genos %in% altHet, nrow = nrow(genos), ncol = ncol(genos)))
nHom2 <- rowSums(matrix(genos %in% altHom, nrow = nrow(genos), ncol = ncol(genos)))
nRef2 <- rowSums(matrix(genos %in% refG, nrow = nrow(genos), ncol = ncol(genos)))
af2 <- (nHet2 + 2 * nHom2) / (2 * (nHet2 + nHom2 + nRef2))

Thanks in advance!

ADD REPLY • link 9.7 years ago kevin.rue ▴ 350

0

Entering edit mode

Your solution (equality to single genotype) is obviously much faster:

> microbenchmark::microbenchmark(
+   {
+     nHet <- rowSums(genos == altHet)
+     nHom <- rowSums(genos == altHom)
+     nRef <- rowSums(genos == refG)
+     af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))
+   },
+   times = 100
+ )
Unit: milliseconds
      min       lq    mean   median       uq      max neval
 25.14072 25.51624 29.9408 26.15891 26.88414 123.7942   100

While my adaptation (check against multiple genotypes and reshape as matrix) takes against more time, but seemingly half that of the original code:

> microbenchmark::microbenchmark(
+   {
+     nHet2 <- rowSums(matrix(genos %in% altHet, nrow = nrow(genos), ncol = ncol(genos)))
+     nHom2 <- rowSums(matrix(genos %in% altHom, nrow = nrow(genos), ncol = ncol(genos)))
+     nRef2 <- rowSums(matrix(genos %in% refG, nrow = nrow(genos), ncol = ncol(genos)))
+     af2 <- (nHet2 + 2 * nHom2) / (2 * (nHet2 + nHom2 + nRef2))
+   },
+   times = 100
+ )
Unit: milliseconds
      min       lq     mean   median       uq     max neval
 84.16726 90.53986 131.5128 91.36508 176.7313 182.693   100

Originally posted code:

> microbenchmark::microbenchmark(
+   {
+     mafs <- apply(
+       X = genos,
+       MARGIN = 1,
+       FUN = minorAlleleFreq,
+       refGenos = refG,
+       altHetGenos = altHet,
+       altHomGenos = altHom)
+   },
+   times = 100
+ )
Unit: milliseconds
      lq     mean   median      uq     max neval
 219.801 255.2118 225.0104 308.326 327.508   100

ADD REPLY • link 9.7 years ago kevin.rue ▴ 350