Search
Question: How to set 'strandMode' with summarizeOverlaps()
0
gravatar for Robert Castelo
8 days ago by
Robert Castelo2.0k
Spain/Barcelona/Universitat Pompeu Fabra
Robert Castelo2.0k wrote:

Hi,

I want to build a table of counts using 'summarizeOverlaps()' from a PE RNA-seq data set generated with a strand-specific library preparation protocol from NewEngland Biolabs called "NEBNext Ultra Directional RNA Library Prep Kit for Illumina". The web page from that protocol says that it "utilizes dUTP-based methodology". The help page from the 'GAlignmentPairs' class explains that when PE data is generated using a stranded protocol such as 'dUTP' one should build such an object using the non-default argument 'strandMode=2'. I want to use 'summarizeOverlaps()' over the set of BAM files, so doing something like:

bfl <- BamFileList(bamfiles, asMates=TRUE)
se <- summarizeOverlaps(features, bfl, mode="Union",
                        singleEnd=FALSE, ignore.strand=FALSE,
                        fragments=TRUE)

I've tried to set 'strandMode=2' in the call to 'BamFileList()' and 'summarizeOverlaps()' and within 'ScanBamParam()' through the 'param' argument to 'summarizeOverlaps()', but the argument 'strandMode' does not belong to any those methods. I've googled about it without luck, so my question is, how can I ensure that 'strandMode=2' is being used under the hood through this approach to ensure I'm informing the counting algorithm about the correct library preparation protocol?

thanks!

robert.
 

ADD COMMENTlink modified 8 days ago • written 8 days ago by Robert Castelo2.0k
3
gravatar for James W. MacDonald
8 days ago by
United States
James W. MacDonald45k wrote:

It's sort of buried, since what you are doing is relying on the method summarizeOverlaps,GRanges,BamFileList, which decides how to read in the bam files based on the singleEnd and fragments arguments. If they are FALSE and TRUE, respectively, then the data will be read in using readGAlignmentPairs, which has an argument (you guessed it!), strandMode.

See e.g., ?summarizeOverlaps:

fragments: (Default FALSE) A logical; applied to paired-end data only.
           fragments  controls which function is used to read the data
          which subsequently affects which records are included in
          counting.

          When  fragments=FALSE , data are read with
           readGAlignmentPairs  and returned in a  GAlignmentPairs
          class. In this case, singletons, reads with unmapped pairs,
          and other fragments, are dropped.

 

So if you do

se <- summarizeOverlaps(features, bfl, mode="Union", singleEnd=FALSE, ignore.strand=FALSE, fragments=TRUE, strandMode = 2)

You will get what you want, because the strandMode argument will be passed down to readGAlignmentPairs.

ADD COMMENTlink written 8 days ago by James W. MacDonald45k

Hi Jim, I had already tried setting that argument through summarizeOverlaps(), BamFileList() and ScanBamParam() and I get an error. Here's an example reproducing the error:

library(GenomicAlignments)
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
genefeatures <- exonsBy(txdb, by="gene")
bfl <- BamFileList(untreated3_chr4()) ## this is PE RNA-seq data
se <- summarizeOverlaps(genefeatures, bfl, mode="Union", singleEnd=FALSE,
                        ignore.strand=FALSE, fragments=TRUE, strandMode=2)
Error in FUN(...) : unused argument (strandMode = 2)

There should be a way I guess but it's not obvious, at least to me. Session information below.

robert.

sessionInfo()

R version 3.4.2 (2017-09-28)                                                                           
Platform: x86_64-apple-darwin15.6.0 (64-bit)                                                           
Running under: macOS Sierra 10.12.6                                                                     

Matrix products: default                                                                               
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib                      
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib                     

locale:                                                                                                
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8                                       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
[1] TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
[2] GenomicFeatures_1.30.0                  
[3] AnnotationDbi_1.40.0                    
[4] BiocInstaller_1.28.0                    
[5] pasillaBamSubset_0.16.0                 
[6] GenomicAlignments_1.14.0                
[7] Rsamtools_1.30.0                        
[8] Biostrings_2.46.0                       
[9] XVector_0.18.0                          
[10] SummarizedExperiment_1.8.0              
[11] DelayedArray_0.4.1                      
[12] matrixStats_0.52.2                      
[13] Biobase_2.38.0                          
[14] GenomicRanges_1.30.0                    
[15] GenomeInfoDb_1.14.0                     
[16] IRanges_2.12.0                          
[17] S4Vectors_0.16.0                        
[18] BiocGenerics_0.24.0                     
[19] nvimcom_0.9-28                          
[20] colorout_1.1-2                           

loaded via a namespace (and not attached):
[1] Rcpp_0.12.13            compiler_3.4.2          prettyunits_1.0.2     
[4] bitops_1.0-6            tools_3.4.2             zlibbioc_1.24.0       
[7] progress_1.1.2          biomaRt_2.34.0          digest_0.6.12         
[10] bit_1.1-12              RSQLite_2.0             memoise_1.1.0         
[13] tibble_1.3.4            lattice_0.20-35         pkgconfig_2.0.1       
[16] rlang_0.1.4             Matrix_1.2-11           DBI_0.7               
[19] GenomeInfoDbData_0.99.1 rtracklayer_1.38.0      stringr_1.2.0         
[22] bit64_0.9-7             grid_3.4.2              R6_2.2.2              
[25] XML_3.98-1.9            RMySQL_0.10.13          BiocParallel_1.12.0   
[28] magrittr_1.5            blob_1.1.0              assertthat_0.2.0      
[31] stringi_1.1.5           RCurl_1.95-4.8        
ADD REPLYlink modified 8 days ago • written 8 days ago by Robert Castelo2.0k

I think I set you wrong. You want both pairedEnd AND fragments to be FALSE, otherwise you get readGAlignmentsList, which doesn't have a strandMode argument. I still wouldn't expect an error, however, as that argument is being passed down via the ellipsis argument, and my understanding is that a function with an ellipsis argument will 'accept' any ellipsis arguments that make sense and ignore those that don't. Like this:

> fun <- function(x, ...) sum(x, ...)
> fun(1:4)
[1] 10
> fun(c(1:4, NA))
[1] NA
> fun(c(1:4, NA), na.rm = TRUE)
[1] 10
> fun(c(1:4, NA), notanarg = TRUE)
[1] NA

So it's weird to me that somewhere or another the strandMode argument is getting picked up and complained about. But regardless, if you still get errors you can just make a GAlignmentPairs object using readGAlignmentPairs(bfl, strandMode =2) and feed that to summarizeOverlaps directly.

ADD REPLYlink written 8 days ago by James W. MacDonald45k

I guess you refer to 'singleEnd' when you say 'pairedEnd', but anyway, now I understand that if I want to set 'strandMode' I need to set 'fragments=FALSE' to force the call to 'readGAlignmentPairs()'. Unfortunately, that does not work either:

se <- summarizeOverlaps(genefeatures, bfl, mode="Union", singleEnd=FALSE,
                        ignore.strand=FALSE, fragments=FALSE, strandMode=2)
Error in FUN(...) : unused argument (strandMode = 2)

I've looked at the source code and i think the problem is that the ellipsis argument passes down ultimately to the function given in the 'preprocess.reads' argument of 'summarizeOverlaps()' and not to 'readGAlignmentPairs()'.

As you suggest, I can use 'readGAlignmentPairs()' directly but because I have multiple BAM files, each of them with several Gb of size, I'd have to end up doing quite some programming to stream over them and put the results together. Before I undertake that task, is there any trick to use 'summarizeOverlaps()' with 'strandMode=1'? Shouldn't in fact this functionality be provided through the 'summarizeOverlaps()' interface?

ADD REPLYlink written 7 days ago by Robert Castelo2.0k

Ah, you're right. In .countWithYieldSize, the specified function is called using

x <- FUN(bf, param = param)

with no ellipsis argument, so strandMode is ignored. But do note that all .countWithYieldSize is doing (that you will have to do yourself) is calling bplapply, using the chosen function. So the 'by hand' version is

library(BiocParallel)

z <- bplapply(bfl, readGAlignmentPairs, strandMode=2)

zz <- bplapply(z, function(x) summarizeOverlaps(genefeatures, x, singleEnd=FALSE))

zzz <- do.call(cbind, zz)

Which doesn't require any programming at all.

 

ADD REPLYlink written 7 days ago by James W. MacDonald45k

Thanks for the directions and your sample code. Unfortunately, I can't use it because it assumes that the whole thing fits into main memory. I'll paste my solution for the record, just in case becomes useful for somebody else interested in using the non-default 'strandMode=2':

## settings with respect to 'summarizeOverlaps()' are
## ignore.strand=FALSE, singleEnd=FALSE, fragments=FALSE,
## preprocess.reads=NULL and the code below adds strandMode=2
## use the most conservative count mode 'Union()'
bfl <- BamFileList(files, asMates=TRUE, yieldSize=1000000)
cts <- bplapply(setNames(seq_along(bfl), names(bfl)),
                function(i, reads, features, mode, 
                         inter.feature, param, strandMode) {
                  bf <- reads[[i]]
                  open(bf)
                  on.exit(close(bf))
                  ct <- integer(length(features))
                  while (length(r <- readGAlignmentPairs(bf, param=param, strandMode=strandMode)))
                    ct <- ct +
                          mode(features, r,
                               ignore.strand=FALSE,
                               inter.feature=inter.feature)
                  ct
                }, bfl, genefeatures, mode=Union, inter.feature=TRUE, param=ScanBamParam(), strandMode=2)
counts <- as.matrix(do.call(cbind, cts))

I would call this programming :-D

 

ADD REPLYlink written 6 days ago by Robert Castelo2.0k
1

GenomicFiles::reduceByYield() might be a wrapper around some of this. Is there a bug in SummarizedExperiment, along the lines Jim indicated of not passing strandMode correctly? If so, perhaps an issue and / or pull request on https://github.com/Bioconductor/SummarizedExperiment would be appropriate.

ADD REPLYlink written 6 days ago by Martin Morgan ♦♦ 20k
1

I sent a patch to Val. After adding a couple of ellipsis args, I get

> SE.all <- summarizeOverlaps(ex, bfl, singleEnd = FALSE, fragments = FALSE, strandMode = 2)

> z <- bplapply(bfl, readGAlignmentPairs, strandMode=2)
> zz <- bplapply(z, function(x) summarizeOverlaps(ex,  x, singleEnd=FALSE))
> zzz <- do.call(cbind, zz)

> all.equal(assays(SE.all)[[1]], assays(zzz)[[1]], check.attributes=FALSE)
[1] TRUE
ADD REPLYlink written 6 days ago by James W. MacDonald45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 265 users visited in the last hour