Several issues with the package easyRNASeq
5
1
Entering edit mode
@michael-dondrup-8328
Last seen 3.6 years ago
Bergen, Norway

I am experiencing multiple problems with the latest easyRNASeq package that make it hard to use with my current setup using the new function simpleRNASeq:

  • BAM files aligned using STAR aligner
  • 3 technical replicates merged into a single BAM file using samtools merge
  • sorted and indexed using samtools 1.2 (old samtools output is not recognized as sorted)
  • annotation from custom annotation file in gff3 format

Some problems might be related to this specific setup or could be ignored, but maybe not all. 

  1. Strandedness validation of BAM file is suspicious, I know my libraries are strand specific, and this is the outcome for all files I checked 
    1: In FUN(X[[i]], ...) :
      Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam is considered unstranded.
    2: In FUN(X[[i]], ...) :
      Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam Strandedness could not be determined using 8 regions spanning 19315 bp on either strand at a 90% cutoff; 12.5 percent appear to be stranded.
  2. There is no return value due to the error
  3. BAM files sorted by older versions of samtools (0.1.18,0.1.19) are not recognized as being sorted "Your BAM file needs to be sorted by coordinate; see (R)samtools sort."
  4. Gff file validation asks specifically for type "mRNA" when choosing 'transcript' quantification, so one needs to change all transcript entries in gff, even though it is technically wrong for non-coding transcripts. This has worked differently before
  5. function easyRNASeq was deprecated but the tutorial still uses it.
  6. Earlier it was possible to get an edgeR or DEseq object directly, now it is a bit unclear to me how this would work 

 

 


library(easyRNASeq)
anns = "/export/kjempetujafs/licebase/genomedata/lsalmonis/LsalmRNA.gff3"
args <- c('59-AK-chal1_S39_L00X_R1_001.fastq.gz.bam.sorted.bam')
annotParam <- AnnotParam(datasource=anns, type="gff3")
bamParam <- BamParam(paired = FALSE, stranded = TRUE, yieldSize = 1000000L)
param <- RnaSeqParam(annotParam = annotParam, bamParam = bamParam, countBy = c("transcripts"), precision = "bp")
ret <- simpleRNASeq(bamFiles = getBamFileList(args), param = param, nnodes = 40, verbose = TRUE, override = TRUE)

Output:

==========================
simpleRNASeq version 2.4.2
==========================
Creating a SummarizedExperiment.
==========================
Processing the alignments.
==========================
Pre-processing 1 BAM files.
Validating the BAM files.
Extracted 36096 reference sequences information.
Found 1 single-end BAM files.
Found 0 paired-end BAM files.
Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam has reads of length 76bp
Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam has 23246292 reads.
==========================
Processing the annotation
==========================
Validating the annotation source
Read 6 records
Validated a datasource of type gff3
Fetching the annotation
Retrieving annotation from a gff3 datasource
Read 149878 records
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘splitAsList’ for signature ‘"GRanges", "NULL"’
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
  Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam is considered unstranded.
2: In FUN(X[[i]], ...) :
  Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam Strandedness could not be determined using 8 regions spanning 19315 bp on either strand at a 90% cutoff; 12.5 percent appear to be stranded.
3: In simpleRNASeq(bamFiles = getBamFileList(args), param = param,  :
  You have chosen to override the detected parameters. Hope you know what you are doing. Contact me if you think the parameter detection failed.

 


> traceback()
8: stop(gettextf("unable to find an inherited method for function %s for signature %s", 
       sQuote(fdef@generic), sQuote(cnames)), domain = NA)
7: (function (classes, fdef, mtable) 
   {
       methods <- .findInheritedMethods(classes, fdef, mtable)
       if (length(methods) == 1L) 
           return(methods[[1L]])
       else if (length(methods) == 0L) {
           cnames <- paste0("\"", vapply(classes, as.character, 
               ""), "\"", collapse = ", ")
           stop(gettextf("unable to find an inherited method for function %s for signature %s", 
               sQuote(fdef@generic), sQuote(cnames)), domain = NA)
       }
       else stop("Internal error in finding inherited methods; didn't return a unique method", 
           domain = NA)
   })(list("GRanges", "NULL"), function (x, f, drop = FALSE, ...) 
   standardGeneric("splitAsList"), <environment>)
6: splitAsList(x, f, drop = drop)
5: .local(x, f, drop, ...)
4: split(grngs, grngs$seqnames)
3: split(grngs, grngs$seqnames)
2: simpleRNASeq(bamFiles = getBamFileList(args), param = param, 
       nnodes = 40, verbose = TRUE, override = TRUE)
1: simpleRNASeq(bamFiles = getBamFileList(args), param = param, 
       nnodes = 40, verbose = TRUE, override = TRUE)
> 

> sessionInfo()R version 3.2.0 (2015-04-16)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.6 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] easyRNASeq_2.4.2

loaded via a namespace (and not attached):
 [1] RColorBrewer_1.1-2      futile.logger_1.4.1     GenomeInfoDb_1.4.1     
 [4] XVector_0.8.0           futile.options_1.0.0    bitops_1.0-6           
 [7] zlibbioc_1.14.0         biomaRt_2.24.0          annotate_1.46.0        
[10] RSQLite_1.0.0           lattice_0.20-31         DBI_0.3.1              
[13] parallel_3.2.0          DESeq_1.20.0            genefilter_1.50.0      
[16] hwriter_1.3.2           Biostrings_2.36.1       S4Vectors_0.6.1        
[19] IRanges_2.2.5           stats4_3.2.0            locfit_1.5-9.1         
[22] grid_3.2.0              LSD_3.0                 Biobase_2.28.0         
[25] AnnotationDbi_1.30.1    XML_3.98-1.3            survival_2.38-3        
[28] BiocParallel_1.2.6      limma_3.24.12           latticeExtra_0.6-26    
[31] geneplotter_1.46.0      lambda.r_1.1.7          edgeR_3.10.2           
[34] Rsamtools_1.20.4        intervals_0.15.0        genomeIntervals_1.24.1 
[37] GenomicAlignments_1.4.1 splines_3.2.0           BiocGenerics_0.14.0    
[40] GenomicRanges_1.20.5    ShortRead_1.26.0        xtable_1.7-4           
[43] RCurl_1.95-4.7         
> 

<font face="sans-serif, Arial, Verdana, Trebuchet MS">
</font>
easyrnaseq bug • 3.3k views
ADD COMMENT
0
Entering edit mode
@nicolas-delhomme-6252
Last seen 6.0 years ago
Sweden
Hej! The error is due to a missing import. I'll fix it right away. I'll let you know once ready and tested. Overall, in simpleRNASeq over easyRNASeq, I've introduced a lot of checks to ensure the validity of the input and output, which is why you have observed some differences. Concerning the strandness warnings, you can ignore them. simpleRNASeq tries to identify various criteria for you (strandedness, etc.), but depending on the genome and annotation it is not yet always figuring out the right value. These warnings are in your case irrelevant since you override them by setting all the params. Otherwise, your setup should have no problem. I'm using STAR also, however I've never experienced any issues with any version of samtools. When you merge your BAM file, do you keep the BAM header? Concerning the gff3 file, the previous easyRNASeq approach was not requesting - and as such had some implementation flaws you could say - gff3 file to strictly follow the GFF3 format (see sequenceontology.org/gff3.html) This means that now that it enforces a valid gff3 format, I'm indeed only looking for the gene > mRNA > exon relationship for counting. To rescue the count for non-coding mRNA, I could add other features' type, e.g. ncRNA, etc. provided that they are listed in the SOFA ontology (see again sequenceontology.org/gff3.html :-)). That should not be take long. Which feature would you also like to count for in your case? Yes, the tutorial still describes easyRNASeq, and I wish I would finally find the time to revamp it for simpleRNASeq, but... Now, you indeed get a SummarizedExperiment object. This is an effort from Bioc and the community to use a single type of object for NGS data representation - as we had for microarray. I know DESeq2 also uses SE objects, I need to check for DESeq and edgeR. I also need to ensure about the compatibility of SE objects between easyRNASeq and DESeq2. Thanks for the feedback, you've added a few to my TODO list ;-) I'll do this asap and let you know. Please keep on providing feedback - if you don't mind -; that way I should be able to get the new vignette as well ;-) Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 03 Jul 2015, at 12:56, mdondrup [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User mdondrup wrote Question: Several issues with the package easyRNASeq: > > > I am experiencing multiple problems with the latest easyRNASeq package that make it hard to use with my current setup using the new function simpleRNASeq: > > • BAM files aligned using STAR aligner > • 3 technical replicates merged into a single BAM file using samtools merge > • sorted and indexed using samtools 1.2 (old samtools output is not recognized as sorted) > • annotation from custom annotation file in gff3 format > Some problems might be related to this specific setup or could be ignored, but maybe not all. > > • Strandedness validation of BAM file is suspicious, I know my libraries are strand specific, and this is the outcome for all files I checked > 1: In FUN(X[[i]], ...) : > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam is considered unstranded. > 2: In FUN(X[[i]], ...) : > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam Strandedness could not be determined using 8 regions spanning 19315 bp on either strand at a 90% cutoff; 12.5 percent appear to be stranded. > • There is no return value due to the error > • BAM files sorted by older versions of samtools (0.1.18,0.1.19) are not recognized as being sorted "Your BAM file needs to be sorted by coordinate; see (R)samtools sort." > • Gff file validation asks specifically for type "mRNA" when choosing 'transcript' quantification, so one needs to change all transcript entries in gff, even though it is technically wrong for non-coding transcripts, this has worked differently before > • function easyRNASeq was deprecated but the tutorial still uses it. > • Earlier it was possible to get an edgeR or DEseq object directly, now it is a bit unclear to me how this would work > > > library(easyRNASeq) > anns = "/export/kjempetujafs/licebase/genomedata/lsalmonis/LsalmRNA.gff3" > args <- c('59-AK-chal1_S39_L00X_R1_001.fastq.gz.bam.sorted.bam') > annotParam <- AnnotParam(datasource=anns, type="gff3") > bamParam <- BamParam(paired = FALSE, stranded = TRUE, yieldSize = 1000000L) > param <- RnaSeqParam(annotParam = annotParam, bamParam = bamParam, countBy = c("transcripts"), precision = "bp") > ret <- simpleRNASeq(bamFiles = getBamFileList(args), param = param, nnodes = 40, verbose = TRUE, override = TRUE) > > Output: > > ========================== > simpleRNASeq version 2.4.2 > ========================== > Creating a SummarizedExperiment. > ========================== > Processing the alignments. > ========================== > Pre-processing 1 BAM files. > Validating the BAM files. > Extracted 36096 reference sequences information. > Found 1 single-end BAM files. > Found 0 paired-end BAM files. > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam has reads of length 76bp > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam has 23246292 reads. > ========================== > Processing the annotation > ========================== > Validating the annotation source > Read 6 records > Validated a datasource of type gff3 > Fetching the annotation > Retrieving annotation from a gff3 datasource > Read 149878 records > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘splitAsList’ for signature ‘"GRanges", "NULL"’ > In addition: Warning messages: > 1: In FUN(X[[i]], ...) : > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam is considered unstranded. > 2: In FUN(X[[i]], ...) : > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam Strandedness could not be determined using 8 regions spanning 19315 bp on either strand at a 90% cutoff; 12.5 percent appear to be stranded. > 3: In simpleRNASeq(bamFiles = getBamFileList(args), param = param, : > You have chosen to override the detected parameters. Hope you know what you are doing. Contact me if you think the parameter detection failed. > > > > sessionInfo()R version 3.2.0 (2015-04-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > Running under: CentOS release 6.6 (Final) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] easyRNASeq_2.4.2 > > loaded via a namespace (and not attached): > [1] RColorBrewer_1.1-2 futile.logger_1.4.1 GenomeInfoDb_1.4.1 > [4] XVector_0.8.0 futile.options_1.0.0 bitops_1.0-6 > [7] zlibbioc_1.14.0 biomaRt_2.24.0 annotate_1.46.0 > [10] RSQLite_1.0.0 lattice_0.20-31 DBI_0.3.1 > [13] parallel_3.2.0 DESeq_1.20.0 genefilter_1.50.0 > [16] hwriter_1.3.2 Biostrings_2.36.1 S4Vectors_0.6.1 > [19] IRanges_2.2.5 stats4_3.2.0 locfit_1.5-9.1 > [22] grid_3.2.0 LSD_3.0 Biobase_2.28.0 > [25] AnnotationDbi_1.30.1 XML_3.98-1.3 survival_2.38-3 > [28] BiocParallel_1.2.6 limma_3.24.12 latticeExtra_0.6-26 > [31] geneplotter_1.46.0 lambda.r_1.1.7 edgeR_3.10.2 > [34] Rsamtools_1.20.4 intervals_0.15.0 genomeIntervals_1.24.1 > [37] GenomicAlignments_1.4.1 splines_3.2.0 BiocGenerics_0.14.0 > [40] GenomicRanges_1.20.5 ShortRead_1.26.0 xtable_1.7-4 > [43] RCurl_1.95-4.7 > > <font face="sans-serif, Arial, Verdana, Trebuchet MS"> > > > > > </font> > > Post tags: easyrnaseq, bug > > You may reply via email or visit Several issues with the package easyRNASeq >
ADD COMMENT
0
Entering edit mode
@nicolas-delhomme-6252
Last seen 6.0 years ago
Sweden
Hej! I can't seem to reproduce your error. I guess it either has to do with your gff3 file or your BAM file; i.e. something might not be as I expect it. To make sure of that, can you try to do the following in a fresh R session: library(RnaSeqTutorial) bamFiles <- getBamFileList(filenames= dir(system.file(package="RnaSeqTutorial","extdata"), pattern="*.bam$", full.names=TRUE)) param <- RnaSeqParam( annotParam=AnnotParam(datasource=file.path(system.file(package="RnaSeqTutorial","extdata"),"Dmel-mRNA-exon-r5.52.gff3"))) sexp <- simpleRNASeq(bamFiles=bamFiles,param=param,verbose=TRUE) head(assay(sexp)) and report on the results? The only other difference is that I'm using R3.2.1, but that should play no role. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 03 Jul 2015, at 12:56, mdondrup [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > > User mdondrup<https: support.bioconductor.org="" u="" 8328=""/> wrote Question: Several issues with the package easyRNASeq<https: support.bioconductor.org="" p="" 69444=""/>: > > I am experiencing multiple problems with the latest easyRNASeq package that make it hard to use with my current setup using the new function simpleRNASeq: > > * BAM files aligned using STAR aligner > * 3 technical replicates merged into a single BAM file using samtools merge > * sorted and indexed using samtools 1.2 (old samtools output is not recognized as sorted) > * annotation from custom annotation file in gff3 format > > Some problems might be related to this specific setup or could be ignored, but maybe not all. > > 1. Strandedness validation of BAM file is suspicious, I know my libraries are strand specific, and this is the outcome for all files I checked > 1: In FUN(X[[i]], ...) : > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam is considered unstranded. > 2: In FUN(X[[i]], ...) : > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam Strandedness could not be determined using 8 regions spanning 19315 bp on either strand at a 90% cutoff; 12.5 percent appear to be stranded. > 2. There is no return value due to the error > 3. BAM files sorted by older versions of samtools (0.1.18,0.1.19) are not recognized as being sorted "Your BAM file needs to be sorted by coordinate; see (R)samtools sort." > 4. Gff file validation asks specifically for type "mRNA" when choosing 'transcript' quantification, so one needs to change all transcript entries in gff, even though it is technically wrong for non-coding transcripts, this has worked differently before > 5. function easyRNASeq was deprecated but the tutorial still uses it. > 6. Earlier it was possible to get an edgeR or DEseq object directly, now it is a bit unclear to me how this would work > > > > > > ________________________________ > > library(easyRNASeq) > anns = "/export/kjempetujafs/licebase/genomedata/lsalmonis/LsalmRNA.gff3" > args <- c('59-AK-chal1_S39_L00X_R1_001.fastq.gz.bam.sorted.bam') > annotParam <- AnnotParam(datasource=anns, type="gff3") > bamParam <- BamParam(paired = FALSE, stranded = TRUE, yieldSize = 1000000L) > param <- RnaSeqParam(annotParam = annotParam, bamParam = bamParam, countBy = c("transcripts"), precision = "bp") > ret <- simpleRNASeq(bamFiles = getBamFileList(args), param = param, nnodes = 40, verbose = TRUE, override = TRUE) > > > ________________________________ > > Output: > > ========================== > simpleRNASeq version 2.4.2 > ========================== > Creating a SummarizedExperiment. > ========================== > Processing the alignments. > ========================== > Pre-processing 1 BAM files. > Validating the BAM files. > Extracted 36096 reference sequences information. > Found 1 single-end BAM files. > Found 0 paired-end BAM files. > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam has reads of length 76bp > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam has 23246292 reads. > ========================== > Processing the annotation > ========================== > Validating the annotation source > Read 6 records > Validated a datasource of type gff3 > Fetching the annotation > Retrieving annotation from a gff3 datasource > Read 149878 records > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ‘splitAsList’ for signature ‘"GRanges", "NULL"’ > In addition: Warning messages: > 1: In FUN(X[[i]], ...) : > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam is considered unstranded. > 2: In FUN(X[[i]], ...) : > Bam file: 1-ch1_S60_L00X_R1_001.fastq.gz.bam.sorted.bam Strandedness could not be determined using 8 regions spanning 19315 bp on either strand at a 90% cutoff; 12.5 percent appear to be stranded. > 3: In simpleRNASeq(bamFiles = getBamFileList(args), param = param, : > You have chosen to override the detected parameters. Hope you know what you are doing. Contact me if you think the parameter detection failed. > > > > > ________________________________ > >> sessionInfo()R version 3.2.0 (2015-04-16) > Platform: x86_64-redhat-linux-gnu (64-bit) > Running under: CentOS release 6.6 (Final) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] easyRNASeq_2.4.2 > > loaded via a namespace (and not attached): > [1] RColorBrewer_1.1-2 futile.logger_1.4.1 GenomeInfoDb_1.4.1 > [4] XVector_0.8.0 futile.options_1.0.0 bitops_1.0-6 > [7] zlibbioc_1.14.0 biomaRt_2.24.0 annotate_1.46.0 > [10] RSQLite_1.0.0 lattice_0.20-31 DBI_0.3.1 > [13] parallel_3.2.0 DESeq_1.20.0 genefilter_1.50.0 > [16] hwriter_1.3.2 Biostrings_2.36.1 S4Vectors_0.6.1 > [19] IRanges_2.2.5 stats4_3.2.0 locfit_1.5-9.1 > [22] grid_3.2.0 LSD_3.0 Biobase_2.28.0 > [25] AnnotationDbi_1.30.1 XML_3.98-1.3 survival_2.38-3 > [28] BiocParallel_1.2.6 limma_3.24.12 latticeExtra_0.6-26 > [31] geneplotter_1.46.0 lambda.r_1.1.7 edgeR_3.10.2 > [34] Rsamtools_1.20.4 intervals_0.15.0 genomeIntervals_1.24.1 > [37] GenomicAlignments_1.4.1 splines_3.2.0 BiocGenerics_0.14.0 > [40] GenomicRanges_1.20.5 ShortRead_1.26.0 xtable_1.7-4 > [43] RCurl_1.95-4.7 >> <font face="sans-serif, Arial, Verdana, Trebuchet MS"> > > > </font> > > ________________________________ > > Post tags: easyrnaseq, bug > > You may reply via email or visit Several issues with the package easyRNASeq
ADD COMMENT
0
Entering edit mode

head(assay(sexp))
          ACACTG.bam ACTAGC.bam ATGGCT.bam gapped.bam subset.bam TTGCGA.bam
CG10035:1         38         26         33          0         38         39
CG10035:2         87         70         86          0         83         96
CG10090:1          0          0          0          0          0          0
CG10091:1          0          0          0          0          0          0
CG10091:2          1          2          1          0          1          0
CG10096:1          0          0          0          0          0          0

Does that look good? 

There is only one thing I could think of that is different, I have mapped against the genome and the mitochondrial genome, but I don't have the mitochondrial genes in the Gff, so there is a reference in in the BAM files that doesn't have a landmark in the GFF, wasn't a problem before, but maybe now?

 

 

ADD REPLY
0
Entering edit mode
Hej Michael! Yes that looks good and no that should not be the problem; that would just raise a warning. Could you share - offline - with me your gff file and an excerpt of your BAM file - with its header? This way would be the easiest to reproduce the issue. If that's OK for you, write to me at the email address below in my signature and I'll set up a box (or similar) to exchange the files. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 03 Jul 2015, at 19:07, Michael Dondrup [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User Michael Dondrup wrote Comment: Several issues with the package easyRNASeq: > > > head(assay(sexp)) > ACACTG.bam ACTAGC.bam ATGGCT.bam gapped.bam subset.bam TTGCGA.bam > CG10035:1 38 26 33 0 38 39 > CG10035:2 87 70 86 0 83 96 > CG10090:1 0 0 0 0 0 0 > CG10091:1 0 0 0 0 0 0 > CG10091:2 1 2 1 0 1 0 > CG10096:1 0 0 0 0 0 0 > > Does that look good? > > There is only one thing I could think of that is different, I have mapped against the genome and the mitochondrial genome, but I don't have the mitochondrial genes in the Gff, so there is a reference in in the BAM files that doesn't have a landmark in the GFF, wasn't a problem before, but maybe now? > > > > > Post tags: easyrnaseq, bug > > You may reply via email or visit C: Several issues with the package easyRNASeq >
ADD REPLY
0
Entering edit mode
@michael-dondrup-8328
Last seen 3.6 years ago
Bergen, Norway

Hei Nicolas,

I found the answer, I can give you a an easy to reproduce example. From my traceback I see that your package is using the following code to split the annotation object by seqnames:

3: split(grngs, grngs$seqnames)
# but it should be:
split(grngs, seqnames(grngs))

It seems like this is only called, if a gff3 file is loaded with several landmarks, or it is otherwise not covered by the D-mel gff test case. 

Example:

library(GenomicRanges)
example(GRanges) # get some toy data
gr
[...]
split(gr, gr$seqnames)
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘splitAsList’ for signature ‘"GRanges", "NULL"’

split(gr, seqnames(gr))
GRangesList object of length 3:
$chr1
GRanges object with 3 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
       <Rle> <IRanges>  <Rle> | <integer>         <numeric>
  a     chr1   [1, 10]      - |         1                 1
  e     chr1   [5, 10]      * |         5 0.555555555555556
  f     chr1   [6, 10]      + |         6 0.444444444444444
$chr2
GRanges object with 3 ranges and 2 metadata columns:
    seqnames  ranges strand | score                GC
  b     chr2 [2, 10]      + |     2 0.888888888888889
  c     chr2 [3, 10]      + |     3 0.777777777777778
  d     chr2 [4, 10]      * |     4 0.666666666666667

$chr3
GRanges object with 4 ranges and 2 metadata columns:
    seqnames   ranges strand | score                GC
  g     chr3 [ 7, 10]      + |     7 0.333333333333333
  h     chr3 [ 8, 10]      + |     8 0.222222222222222
  i     chr3 [ 9, 10]      - |     9 0.111111111111111
  j     chr3 [10, 10]      - |    10                 0

-------
seqinfo: 3 sequences from mock1 genome

 

 

 

 

ADD COMMENT
0
Entering edit mode
@michael-dondrup-8328
Last seen 3.6 years ago
Bergen, Norway

Hei, here is a patch. There was another problem, the 'bp' resolution method was not implemented: This method has not been implemented yet. So I changed my settings to use resolution settings to 'read' and now I really get a result object with the patched package.

Cheers

Patch follows:

diff -rupN a/easyRNASeq/R/easyRNASeq-internal-methods.R b/easyRNASeq/R/easyRNASeq-internal-methods.R
--- a/easyRNASeq/R/easyRNASeq-internal-methods.R    2015-04-16 22:11:15.000000000 +0200
+++ b/easyRNASeq/R/easyRNASeq-internal-methods.R    2015-07-06 12:13:48.989012942 +0200
@@ -248,7 +248,7 @@
                                  mode=mode))$counts
       },
       "bp"={
-          stop("This method has not been implemented yet.")
+          stop("The resolution method 'bp' has not been implemented yet.")
           ## Remove Multiple hits
           ## TODO is that needed since interfeature is TRUE
           if(all(is.na(mcols(chunk)$NH))){
diff -rupN a/easyRNASeq/R/easyRNASeq-simpleRNASeq.R b/easyRNASeq/R/easyRNASeq-simpleRNASeq.R
--- a/easyRNASeq/R/easyRNASeq-simpleRNASeq.R    2015-06-29 02:20:38.000000000 +0200
+++ b/easyRNASeq/R/easyRNASeq-simpleRNASeq.R    2015-07-06 12:00:11.840112272 +0200
@@ -269,7 +269,7 @@ setMethod(f="simpleRNASeq",
             ## to accept a GRanges and not a GRangesList
             rowRanges(sexp) <- switch(precision(param),
                                     "read"={split(grngs,mcols(grngs)[,sub("s$","",countBy(param))])},
-                                    "bp"={split(grngs,grngs$seqnames)})
+                                    "bp"={split(grngs,seqnames(grngs))})

             if(verbose){
               message("==========================")

 

ADD COMMENT
0
Entering edit mode
Output of the sexp object:

head(assay(ret))
                  9-ch9_S57_L00X_R1_001.fastq.gz.bam.sorted.bam
ATP6.t01                                                     97
COX1.t01                                                   3133
COX2.t01                                                    376
COX3.t01                                                    336
CYTB.t01                                                    159
EMLSAT00000000001                                             2

 

ADD REPLY
0
Entering edit mode
Hej Michael! Thanks for the patch! It was not a gff3 related issue (since we use a gff3 in the example I sent), but really the use of the "bp" precision. Thanks for making the message more expressive too :-) I'll integrate your patch asap and push to Bioc. The new version 2.4.4 should be available in a few days. If you want it sooner, just email me personally or get it from svn in a couple of hours. I'll start implementing some of the other changes we talked about in the development version. Can you let me know which over type than mRNA you'd like to have? Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 06 Jul 2015, at 12:21, Michael Dondrup [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User Michael Dondrup wrote Answer: Several issues with the package easyRNASeq: > > > Hei, here is a patch. There was another problem, the 'bp' resolution method was not implemented: This method has not been implemented yet. So I changed my settings to use resolution settings to 'read' and now I really get a result object with the patched package. > > Cheers > > Patch follows: > > diff -rupN a/easyRNASeq/R/easyRNASeq-internal-methods.R b/easyRNASeq/R/easyRNASeq-internal-methods.R > --- a/easyRNASeq/R/easyRNASeq-internal-methods.R 2015-04-16 22:11:15.000000000 +0200 > +++ b/easyRNASeq/R/easyRNASeq-internal-methods.R 2015-07-06 12:13:48.989012942 +0200 > @@ -248,7 +248,7 @@ > mode=mode))$counts > }, > "bp"={ > - stop("This method has not been implemented yet.") > + stop("The resolution method 'bp' has not been implemented yet.") > ## Remove Multiple hits > ## TODO is that needed since interfeature is TRUE > ifallis.na(mcols(chunk)$NH))){ > diff -rupN a/easyRNASeq/R/easyRNASeq-simpleRNASeq.R b/easyRNASeq/R/easyRNASeq-simpleRNASeq.R > --- a/easyRNASeq/R/easyRNASeq-simpleRNASeq.R 2015-06-29 02:20:38.000000000 +0200 > +++ b/easyRNASeq/R/easyRNASeq-simpleRNASeq.R 2015-07-06 12:00:11.840112272 +0200 > @@ -269,7 +269,7 @@ setMethod(f="simpleRNASeq", > ## to accept a GRanges and not a GRangesList > rowRanges(sexp) <- switch(precision(param), > "read"={split(grngs,mcols(grngs)[,sub("s$","",countBy(param))])}, > - "bp"={split(grngs,grngs$seqnames)}) > + "bp"={split(grngs,seqnames(grngs))}) > > if(verbose){ > message("==========================") > > > > Post tags: easyrnaseq, bug > > You may reply via email or visit A: Several issues with the package easyRNASeq >
ADD REPLY
0
Entering edit mode

Regarding the type of RNA I simply had 'transcript' annotated in the gff3 files we got from ensembl. I imagine that this, and mRNA, would be sufficient and general enough to not having to rename the type to something incorrect at least. Or could it be configurable? I understand it could be a bit overkill to have code that parses the complete SO all the time to allow for all sub-classes of transcript.
 

ADD REPLY
0
Entering edit mode
It could easily be configurable. I certainly do not trust the full outcome of gene annotation tools myself, so we do spend a lot of time curating gene models, and as such it will be definitely valuable for us to have it configurable. I certainly don't want to parse the complete SO every time :-), so I will cherry pick the ones I need and make it obvious in the doc how additional ones can be added/used in addition to the set I'll pick. I'll look into the parent-child relationship for "transcript" and I'll implement it in the devel version. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden --------------------------------------------------------------- > On 06 Jul 2015, at 15:09, Michael Dondrup [bioc] <noreply@bioconductor.org> wrote: > > Activity on a post you are following on support.bioconductor.org > User Michael Dondrup wrote Comment: Several issues with the package easyRNASeq: > > > Regarding the type of RNA I simply had 'transcript' annotated in the gff3 files we got from ensembl. I imagine that this, and mRNA, would be sufficient and general enough to not having to rename the type to something incorrect at least. Or could it be configurable? I understand it could be a bit overkill to have code that parses the complete SO all the time to allow for all sub-classes of transcript. > > > > Post tags: easyrnaseq, bug > > You may reply via email or visit C: Several issues with the package easyRNASeq >
ADD REPLY
0
Entering edit mode
Hej Michael! The solution I implemented in this other thread: C: easyRNASeq: "Error in IRangesList" might be useful to you too. I implemented the function to create the synthetic transcript so that it's readily available for easyRNASeq v2.4.5+. That's the same function that will be part of the devel package once I've ported it. Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden ---------------------------------------------------------------
ADD REPLY
0
Entering edit mode
@michael-dondrup-8328
Last seen 3.6 years ago
Bergen, Norway

Unfortunately, there is another bug in the internal function ".streamForParam", line 396 leading to error messages for some files:

Error in DataFrame(Paired = any(bamFlagTest(params$flag, "isPaired")),  :
  missing values in 'row.names'

I have debugged this function, and this could be the culprit:

grng <- reduce(GRanges(IRanges(start=params$pos,width=params$qwidth),
                         seqnames=params$rname,strand=params$strand))

This line is resulting in a short GRanges object with only 16 ranges entries an invalid overlap count table (one that doesn't have both columns "TRUE" and "FALSE") and finally the error message when making a DataFrame selecting the missing column from the  table. I do not fully understand why this is the case, but it is a bit weird, from time to time reduce seems to deliver different results.

I think that is because the wrong reduce method is used, but I need to check this better.

See my debug session below. Note that this code could also fail if the correct reduce would be used, in the unlikely case there are no overlaps at all or all ranges overlap, which would lead to missing column in the table. To continue, I have now simply disabled the check by setting prop <- 1

 

 


debug: params <- scanBam(bamFile, param = ScanBamParam(what = c("flag",
    "rname", "strand", "pos", "qwidth"), flag = scanBamFlag(isUnmappedQuery = FALSE)))[[1]]
Browse[2]>
debug: rng <- range(params$qwidth)
Browse[2]>
debug: grng <- reduce(GRanges(IRanges(start = params$pos, width = params$qwidth),
    seqnames = params$rname, strand = params$strand))
Browse[2]>
debug: tab <- table(countOverlaps(grng[strand(grng) == "+"], grng[strand(grng) ==
    "-"], ignore.strand = TRUE) == 0)
Browse[2]> grng
GRanges object with 16 ranges and 0 metadata columns:
          seqnames        ranges strand
             <Rle>     <IRanges>  <Rle>
   [1] NC_007215.1  [ 180,  309]      +
   [2] NC_007215.1  [ 373,  761]      +
   [3] NC_007215.1  [ 832,  929]      +
   [4] NC_007215.1  [ 952, 7219]      +
   [5] NC_007215.1  [7224, 7469]      +
   ...         ...           ...    ...
  [12] NC_007215.1 [ 493,   759]      -
  [13] NC_007215.1 [ 835,   910]      -
  [14] NC_007215.1 [1038,  6492]      -
  [15] NC_007215.1 [6582,  9890]      -
  [16] NC_007215.1 [9897, 10323]      -
  -------
  seqinfo: 36096 sequences from an unspecified genome; no seqlengths
Browse[2]>
debug: prop <- tab/sum(tab)
Browse[2]> tab

FALSE
    8
Browse[2]>
debug: outDF <- DataFrame(Paired = any(bamFlagTest(params$flag, "isPaired")),
    ReadLength = ifelse(rng[1] == rng[2], paste(rng[1], "bp",
        sep = ""), paste(rng, "bp", sep = "", collapse = "-")),
    Stranded = ifelse(prop["TRUE"] >= strandedness.cutoff, TRUE,
        ifelse(prop["FALSE"] >= strandedness.cutoff, FALSE, NA)),
    Note = ifelse(any(prop <= 1 - strandedness.cutoff), paste("Determined using",
        sum(tab), "regions spanning", sum(width(grng)), "bp on either strand, at a 90% cutoff."),
        paste("Strandedness could not be determined using", sum(tab),
            "regions spanning", sum(width(grng)), "bp on either strand at a 90% cutoff;",
            round(prop["TRUE"] * 100, digits = 2), "percent appear to be stranded.")))
Browse[2]> prop

FALSE
    1
Browse[2]> ifelse(prop["TRUE"] >= strandedness.cutoff, TRUE,
+         ifelse(prop["FALSE"] >= strandedness.cutoff, FALSE, NA)
+ )
<NA>
  NA

This is how grng should look:

GRanges object with 1566 ranges and 0 metadata columns:
            seqnames             ranges strand
               <Rle>          <IRanges>  <Rle>
     [1] NC_007215.1     [10254, 10935]      +
     [2] NC_007215.1     [10954, 12270]      +
     [3] NC_007215.1     [12302, 15372]      +
     [4] NC_007215.1     [10257, 10944]      -
     [5] NC_007215.1     [11012, 12090]      -
     ...         ...                ...    ...
  [1562]  LSalAtl2s1 [3596825, 3596900]      -
  [1563]  LSalAtl2s1 [3600244, 3600319]      -
  [1564]  LSalAtl2s1 [3600638, 3600713]      -
  [1565]  LSalAtl2s1 [3601416, 3601505]      -
  [1566]  LSalAtl2s1 [3602070, 3602150]      -
  -------
  seqinfo: 36096 sequences from an unspecified genome; no seqlengths



 

ADD COMMENT
0
Entering edit mode
Hej Michael! Nice spotting that one. Just to clarify, what this bit of code does is: 1) take the first 1,000,000 reads in the BAM file (or did you change the yield size in you BamFileList?) 2) from these 1,000,000 reads, identify the loci they cover in a strand specific fashion (that's the reduce GRanges bit. There's no issue with the reduce function since I selectively only import the one I need from the IRanges package) 3) It then counts how often these regions overlap in a strand unspecific manner and tabulates that. So it's a very naive way of assessing if a BAM file is stranded. I'll revamp that as well. What is happening in your case is indeed that tab contains only one state: FALSE and nothing like what it should be: > prop FALSE TRUE 0.8 0.2 That was an easy fix implemented in v.2.4.6. That will be available from Bioc SVN in the next hour, I need to first test it. Thanks again for the nice debug. You keep me busy ;-) Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme, PhD The Street Lab Department of Plant Physiology Umeå Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme@umu.se SLU - Umeå universitet Umeå S-901 87 Sweden ---------------------------------------------------------------
ADD REPLY

Login before adding your answer.

Traffic: 530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6