Search
Question: How to set 'strandMode' with summarizeOverlaps()
0
7 months ago by
Robert Castelo2.2k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.2k wrote:

Hi,

I want to build a table of counts using 'summarizeOverlaps()' from a PE RNA-seq data set generated with a strand-specific library preparation protocol from NewEngland Biolabs called "NEBNext Ultra Directional RNA Library Prep Kit for Illumina". The web page from that protocol says that it "utilizes dUTP-based methodology". The help page from the 'GAlignmentPairs' class explains that when PE data is generated using a stranded protocol such as 'dUTP' one should build such an object using the non-default argument 'strandMode=2'. I want to use 'summarizeOverlaps()' over the set of BAM files, so doing something like:

bfl <- BamFileList(bamfiles, asMates=TRUE)
se <- summarizeOverlaps(features, bfl, mode="Union",
singleEnd=FALSE, ignore.strand=FALSE,
fragments=TRUE)

I've tried to set 'strandMode=2' in the call to 'BamFileList()' and 'summarizeOverlaps()' and within 'ScanBamParam()' through the 'param' argument to 'summarizeOverlaps()', but the argument 'strandMode' does not belong to any those methods. I've googled about it without luck, so my question is, how can I ensure that 'strandMode=2' is being used under the hood through this approach to ensure I'm informing the counting algorithm about the correct library preparation protocol?

thanks!

robert.

modified 7 months ago • written 7 months ago by Robert Castelo2.2k
3
7 months ago by
United States
James W. MacDonald46k wrote:

It's sort of buried, since what you are doing is relying on the method summarizeOverlaps,GRanges,BamFileList, which decides how to read in the bam files based on the singleEnd and fragments arguments. If they are FALSE and TRUE, respectively, then the data will be read in using readGAlignmentPairs, which has an argument (you guessed it!), strandMode.

See e.g., ?summarizeOverlaps:

fragments: (Default FALSE) A logical; applied to paired-end data only.
fragments  controls which function is used to read the data
which subsequently affects which records are included in
counting.

When  fragments=FALSE , data are read with
readGAlignmentPairs  and returned in a  GAlignmentPairs
class. In this case, singletons, reads with unmapped pairs,
and other fragments, are dropped.

So if you do

se <- summarizeOverlaps(features, bfl, mode="Union", singleEnd=FALSE, ignore.strand=FALSE, fragments=TRUE, strandMode = 2)

You will get what you want, because the strandMode argument will be passed down to readGAlignmentPairs.

Hi Jim, I had already tried setting that argument through summarizeOverlaps(), BamFileList() and ScanBamParam() and I get an error. Here's an example reproducing the error:

library(GenomicAlignments)
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
genefeatures <- exonsBy(txdb, by="gene")
bfl <- BamFileList(untreated3_chr4()) ## this is PE RNA-seq data
se <- summarizeOverlaps(genefeatures, bfl, mode="Union", singleEnd=FALSE,
ignore.strand=FALSE, fragments=TRUE, strandMode=2)
Error in FUN(...) : unused argument (strandMode = 2)

There should be a way I guess but it's not obvious, at least to me. Session information below.

robert.

sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
[2] GenomicFeatures_1.30.0
[3] AnnotationDbi_1.40.0
[4] BiocInstaller_1.28.0
[5] pasillaBamSubset_0.16.0
[6] GenomicAlignments_1.14.0
[7] Rsamtools_1.30.0
[8] Biostrings_2.46.0
[9] XVector_0.18.0
[10] SummarizedExperiment_1.8.0
[11] DelayedArray_0.4.1
[12] matrixStats_0.52.2
[13] Biobase_2.38.0
[14] GenomicRanges_1.30.0
[15] GenomeInfoDb_1.14.0
[16] IRanges_2.12.0
[17] S4Vectors_0.16.0
[18] BiocGenerics_0.24.0
[19] nvimcom_0.9-28
[20] colorout_1.1-2

loaded via a namespace (and not attached):
[1] Rcpp_0.12.13            compiler_3.4.2          prettyunits_1.0.2
[4] bitops_1.0-6            tools_3.4.2             zlibbioc_1.24.0
[7] progress_1.1.2          biomaRt_2.34.0          digest_0.6.12
[10] bit_1.1-12              RSQLite_2.0             memoise_1.1.0
[13] tibble_1.3.4            lattice_0.20-35         pkgconfig_2.0.1
[16] rlang_0.1.4             Matrix_1.2-11           DBI_0.7
[19] GenomeInfoDbData_0.99.1 rtracklayer_1.38.0      stringr_1.2.0
[22] bit64_0.9-7             grid_3.4.2              R6_2.2.2
[25] XML_3.98-1.9            RMySQL_0.10.13          BiocParallel_1.12.0
[28] magrittr_1.5            blob_1.1.0              assertthat_0.2.0
[31] stringi_1.1.5           RCurl_1.95-4.8        

I think I set you wrong. You want both pairedEnd AND fragments to be FALSE, otherwise you get readGAlignmentsList, which doesn't have a strandMode argument. I still wouldn't expect an error, however, as that argument is being passed down via the ellipsis argument, and my understanding is that a function with an ellipsis argument will 'accept' any ellipsis arguments that make sense and ignore those that don't. Like this:

> fun <- function(x, ...) sum(x, ...)
> fun(1:4)
[1] 10
> fun(c(1:4, NA))
[1] NA
> fun(c(1:4, NA), na.rm = TRUE)
[1] 10
> fun(c(1:4, NA), notanarg = TRUE)
[1] NA

So it's weird to me that somewhere or another the strandMode argument is getting picked up and complained about. But regardless, if you still get errors you can just make a GAlignmentPairs object using readGAlignmentPairs(bfl, strandMode =2) and feed that to summarizeOverlaps directly.

I guess you refer to 'singleEnd' when you say 'pairedEnd', but anyway, now I understand that if I want to set 'strandMode' I need to set 'fragments=FALSE' to force the call to 'readGAlignmentPairs()'. Unfortunately, that does not work either:

se <- summarizeOverlaps(genefeatures, bfl, mode="Union", singleEnd=FALSE,
ignore.strand=FALSE, fragments=FALSE, strandMode=2)
Error in FUN(...) : unused argument (strandMode = 2)

I've looked at the source code and i think the problem is that the ellipsis argument passes down ultimately to the function given in the 'preprocess.reads' argument of 'summarizeOverlaps()' and not to 'readGAlignmentPairs()'.

As you suggest, I can use 'readGAlignmentPairs()' directly but because I have multiple BAM files, each of them with several Gb of size, I'd have to end up doing quite some programming to stream over them and put the results together. Before I undertake that task, is there any trick to use 'summarizeOverlaps()' with 'strandMode=1'? Shouldn't in fact this functionality be provided through the 'summarizeOverlaps()' interface?

Ah, you're right. In .countWithYieldSize, the specified function is called using

x <- FUN(bf, param = param)

with no ellipsis argument, so strandMode is ignored. But do note that all .countWithYieldSize is doing (that you will have to do yourself) is calling bplapply, using the chosen function. So the 'by hand' version is

library(BiocParallel)

zz <- bplapply(z, function(x) summarizeOverlaps(genefeatures, x, singleEnd=FALSE))

zzz <- do.call(cbind, zz)

Which doesn't require any programming at all.

Thanks for the directions and your sample code. Unfortunately, I can't use it because it assumes that the whole thing fits into main memory. I'll paste my solution for the record, just in case becomes useful for somebody else interested in using the non-default 'strandMode=2':

## settings with respect to 'summarizeOverlaps()' are
## ignore.strand=FALSE, singleEnd=FALSE, fragments=FALSE,
## use the most conservative count mode 'Union()'
bfl <- BamFileList(files, asMates=TRUE, yieldSize=1000000)
cts <- bplapply(setNames(seq_along(bfl), names(bfl)),
inter.feature, param, strandMode) {
open(bf)
on.exit(close(bf))
ct <- integer(length(features))
while (length(r <- readGAlignmentPairs(bf, param=param, strandMode=strandMode)))
ct <- ct +
mode(features, r,
ignore.strand=FALSE,
inter.feature=inter.feature)
ct
}, bfl, genefeatures, mode=Union, inter.feature=TRUE, param=ScanBamParam(), strandMode=2)
counts <- as.matrix(do.call(cbind, cts))

I would call this programming :-D

1

GenomicFiles::reduceByYield() might be a wrapper around some of this. Is there a bug in SummarizedExperiment, along the lines Jim indicated of not passing strandMode correctly? If so, perhaps an issue and / or pull request on https://github.com/Bioconductor/SummarizedExperiment would be appropriate.

1

I sent a patch to Val. After adding a couple of ellipsis args, I get

> SE.all <- summarizeOverlaps(ex, bfl, singleEnd = FALSE, fragments = FALSE, strandMode = 2)

> z <- bplapply(bfl, readGAlignmentPairs, strandMode=2)
> zz <- bplapply(z, function(x) summarizeOverlaps(ex,  x, singleEnd=FALSE))
> zzz <- do.call(cbind, zz)

> all.equal(assays(SE.all)[[1]], assays(zzz)[[1]], check.attributes=FALSE)
[1] TRUE