Question: parRapply in BiocParallel ?
1
3.3 years ago by
kevin.rue220
University of Oxford
kevin.rue220 wrote:

Hi all,

I have a feeling this is an easy one, but browsing the forum and the BiocParallel vignette, I can't find it.

Is there a BiocParallel equivalent of parRapply?

I have a nice MulticoreParam object to handle parallel computing, and if I could I would like to avoid making a separate cluster using the parallel package.

For context, I would like to apply a function to each row of the geno(vcf)[["GT"]] matrix to calculate the allele frequency of each variant (AF = 1*[0/1] + 2*[1/1]). I have ~80k thousand variants that I would like to reduce by filtering those > 1% MAF.

...except if I am missing an existing function that does part or all that I want? The reason why I am calculating it myself is that there are some funky genotypes in there (1/0, ./1, 0/.,  and more) which makes me wonder how the AF field of the VCF is calculated. At this stage, I simply prefer calculating the AF checking genotypes myself.

Cheers!

==== EDIT to add example ===

# Define accepted genotypes
refG = c("0/0")
altHet = c("0/1")
altHom = c("1/1")
# A sample of genotypes observed in the VCF file
allG = c(
"0/0", "0", "1/1",
"1/0", "1", "2",
".", "./.", "./0", "./1", "0/.", "1/.")

# Make up a matrix like the one obtained using VariantAnnotation
genos <- matrix(
data = sample(
x = allG, size = 1E6, replace = TRUE,
prob = c(rep(0.25, 3), rep(0.25/9, 9))),
nrow = 1E3,
ncol = 1E3)
dim(genos)
genos[1:4,1:4]

# My current solution to calculating MAF considering "accepted" genotypes
minorAlleleFreq <- function(genos, refGenos, altHetGenos, altHomGenos){
tableCounts <- table(factor(
x = genos,
levels = c(refGenos, altHetGenos, altHomGenos)
))
alleleCount <- tableCounts[altHetGenos] + 2 * tableCounts[altHomGenos]
alleleNumber <- 2 * sum(tableCounts)
return(alleleCount / alleleNumber)
}
mafs <- apply(
X = genos,
MARGIN = 1,
FUN = minorAlleleFreq,
refGenos = refG,
altHetGenos = altHet,
altHomGenos = altHom)

# For your eyes only :)
hist(mafs)

I am just wondering if there was a more efficient way to calculate mafs above.

Cheers!

modified 3.2 years ago by Michael Lawrence11k • written 3.3 years ago by kevin.rue220

There isn't a parRapply equivalent. But can you provide more detail, maybe with an example using one of the vcf files that comes with VariantAnnotation or a simple matrix constructed 'by hand' of what you want to calculate? I'd guess that there is a vectorized version of the calculation that will be very fast, so no need for parallel evaluation.

Martin is right, there has to be a faster way than looping, and it doesn't make sense to use rapply over a genotype matrix anyway, since there is only a single level of nesting. To avoid the nesting entirely, call expand(vcf) and then use conventional matrix operations.

3
3.2 years ago by
United States
Michael Lawrence11k wrote:

Here is one simple thing that is about 8X faster:

    nHet <- rowSums(genos == altHet)
nHom <- rowSums(genos == altHom)
nRef <- rowSums(genos == refG)
af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))

Thanks so much.

One thing though is that I am dealing with a more general situation, where refG, altHet and altHom can be vectors of genotypes, such as in the case of altHet = ["0/1", "1/0"] or partially phased genotypes where refHet = ["0|0", "0/0"] and so on.

My problem is now that genos %in% refG "flattens" the matrix to a vector of 1E6 elements, making the following fail :

    nHet <- rowSums(genos %in% altHet)
nHom <- rowSums(genos %in% altHom)
nRef <- rowSums(genos %in% refG)
af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))

How efficient is the following in your opinion? (Actually, I'll run a microbenchmark myself, my question was rather if you had any more efficient in mind)

nHet2 <- rowSums(matrix(genos %in% altHet, nrow = nrow(genos), ncol = ncol(genos)))
nHom2 <- rowSums(matrix(genos %in% altHom, nrow = nrow(genos), ncol = ncol(genos)))
nRef2 <- rowSums(matrix(genos %in% refG, nrow = nrow(genos), ncol = ncol(genos)))
af2 <- (nHet2 + 2 * nHom2) / (2 * (nHet2 + nHom2 + nRef2))

Your solution (equality to single genotype) is obviously much faster:

> microbenchmark::microbenchmark(
+   {
+     nHet <- rowSums(genos == altHet)
+     nHom <- rowSums(genos == altHom)
+     nRef <- rowSums(genos == refG)
+     af <- (nHet + 2 * nHom) / (2 * (nHet + nHom + nRef))
+   },
+   times = 100
+ )
Unit: milliseconds
min       lq    mean   median       uq      max neval
25.14072 25.51624 29.9408 26.15891 26.88414 123.7942   100

While my adaptation (check against multiple genotypes and reshape as matrix) takes against more time, but seemingly half that of the original code:

> microbenchmark::microbenchmark(
+   {
+     nHet2 <- rowSums(matrix(genos %in% altHet, nrow = nrow(genos), ncol = ncol(genos)))
+     nHom2 <- rowSums(matrix(genos %in% altHom, nrow = nrow(genos), ncol = ncol(genos)))
+     nRef2 <- rowSums(matrix(genos %in% refG, nrow = nrow(genos), ncol = ncol(genos)))
+     af2 <- (nHet2 + 2 * nHom2) / (2 * (nHet2 + nHom2 + nRef2))
+   },
+   times = 100
+ )
Unit: milliseconds
min       lq     mean   median       uq     max neval
84.16726 90.53986 131.5128 91.36508 176.7313 182.693   100

Originally posted code:

> microbenchmark::microbenchmark(
+   {
+     mafs <- apply(
+       X = genos,
+       MARGIN = 1,
+       FUN = minorAlleleFreq,
+       refGenos = refG,
+       altHetGenos = altHet,
+       altHomGenos = altHom)
+   },
+   times = 100
+ )
Unit: milliseconds
lq     mean   median      uq     max neval
219.801 255.2118 225.0104 308.326 327.508   100