Question

Aspli: error with gbCounts

0

Entering edit mode

jbono ▴ 10

@jbono-7682

Last seen 4 months ago

United States

I am trying to run gbCounts in Aspli. It appears to work through a few samples, but then I get an error message that I do not know how to address. I appreciate any insight. Here is the code I used:

> Dmel.6.38.TxDb=makeTxDbFromGFF(
+     file="dmel-all-r6.38.gtf",
+     format="gtf")

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
  ignored.
2: In makeTxDbFromGRanges(gr, metadata = metadata) :
  The following transcripts were dropped because their exon ranks could not be inferred (either because the
  exons are not on the same chromosome/strand or because they are not separated by introns): FBtr0084079,
  FBtr0084080, FBtr0084081, FBtr0084082, FBtr0084083, FBtr0084084, FBtr0084085, FBtr0307759, FBtr0307760
3: In .reject_transcripts(bad_tx, because) :
  The following transcripts were rejected because they have stop codons that cannot be mapped to an exon:
  FBtr0100857, FBtr0100863, FBtr0433500, FBtr0433501

> saveDb(Dmel.6.38.TxDb,file="Dmel.6.38.TxDb.sqlite")

TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: dmel-all-r6.38.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 35367
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-03-18 12:37:55 -0600 (Thu, 18 Mar 2021)
# GenomicFeatures version at creation time: 1.42.2
# RSQLite version at creation time: 2.2.4
# DBSCHEMAVERSION: 1.2

> features <- binGenome( Dmel.6.38.TxDb )
* Number of extracted Genes = 17869
* Number of extracted Exon Bins = 80637
* Number of extracted intron bins = 72288
* Number of extracted trascripts = 35367
* Number of extracted junctions = 60431
* Number of AS bins (not include external) = 9547
* Number of AS bins (include external) = 9557
* Classified as: 
    ES bins = 2427  (25%)
    IR bins = 1257  (13%)
    Alt5'ss bins = 1497 (16%)
    Alt3'ss bins = 1622 (17%)
    Multiple AS bins = 2744 (29%)
    classified as:
            ES bins = 531   (19%)
            IR bins = 492   (18%)
            Alt5'ss bins = 885  (32%)
            Alt3'ss bins = 725  (26%)

> targets=read.csv("Targets.csv")

> getConditions(targets)
[1] "Mutant_F_1D"   "Mutant_M_1D"   "Mutant_F_28D"  "Mutant_M_28D"  "Control_F_1D"  "Control_M_1D"  "Control_F_28D"
[8] "Control_M_28D"

> gbcounts <- gbCounts( features = features,
+                       targets = targets,
+                       minReadLength = 100, maxISize = 50000,
+                       strandMode=0)
Summarizing Mutant_F_1D_1
ETA: 53 min
Summarizing Mutant_F_1D_2
ETA: 49 min
Summarizing Mutant_F_1D_3
ETA: 46 min
Summarizing Mutant_M_1D_1
ETA: 43 min
Summarizing Mutant_M_1D_2
ETA: 41 min
Summarizing Mutant_M_1D_3
ETA: 38 min
Summarizing Mutant_F_28D_1
[1] 7
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
In addition: Warning message:
In colnames(counts@junction.counts)[9:ncol(counts@junction.counts)] <- rownames(targets) :
  number of items to replace is not a multiple of replacement length**

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicFeatures_1.42.2 GenomicRanges_1.42.0   GenomeInfoDb_1.26.4    ASpli_2.0.0            AnnotationDbi_1.52.0  
 [6] IRanges_2.24.1         S4Vectors_0.28.1       Biobase_2.50.0         BiocGenerics_0.36.0    edgeR_3.32.1          
[11] limma_3.46.0          

loaded via a namespace (and not attached):
  [1] colorspace_2.0-0            ellipsis_0.3.1              biovizBase_1.38.0           htmlTable_2.1.0            
  [5] XVector_0.30.0              base64enc_0.1-3             dichromat_2.0-0             rstudioapi_0.13            
  [9] DT_0.17                     bit64_4.0.5                 fansi_0.4.2                 xml2_1.3.2                 
 [13] splines_4.0.4               cachem_1.0.4                knitr_1.31                  Formula_1.2-4              
 [17] Rsamtools_2.6.0             cluster_2.1.1               dbplyr_2.1.0                png_0.1-7                  
 [21] BiocManager_1.30.10         compiler_4.0.4              httr_1.4.2                  backports_1.2.1            
 [25] lazyeval_0.2.2              assertthat_0.2.1            Matrix_1.3-2                fastmap_1.1.0              
 [29] htmltools_0.5.1.1           prettyunits_1.1.1           tools_4.0.4                 igraph_1.2.6               
 [33] gtable_0.3.0                glue_1.4.2                  GenomeInfoDbData_1.2.4      dplyr_1.0.5                
 [37] rappdirs_0.3.3              tinytex_0.30                Rcpp_1.0.6                  vctrs_0.3.6                
 [41] Biostrings_2.58.0           rtracklayer_1.50.0          xfun_0.22                   stringr_1.4.0              
 [45] lifecycle_1.0.0             ensembldb_2.14.0            XML_3.99-0.6                zlibbioc_1.36.0            
 [49] MASS_7.3-53.1               scales_1.1.1                BiocStyle_2.18.1            BSgenome_1.58.0            
 [53] VariantAnnotation_1.36.0    ProtGenerics_1.22.0         hms_1.0.0                   MatrixGenerics_1.2.1       
 [57] SummarizedExperiment_1.20.0 AnnotationFilter_1.14.0     RColorBrewer_1.1-2          yaml_2.2.1                 
 [61] curl_4.3                    memoise_2.0.0               gridExtra_2.3               ggplot2_3.3.3              
 [65] UpSetR_1.4.0                biomaRt_2.46.3              rpart_4.1-15                latticeExtra_0.6-29        
 [69] stringi_1.5.3               RSQLite_2.2.4               checkmate_2.0.0             BiocParallel_1.24.1        
 [73] rlang_0.4.10                pkgconfig_2.0.3             matrixStats_0.58.0          bitops_1.0-6               
 [77] evaluate_0.14               lattice_0.20-41             purrr_0.3.4                 htmlwidgets_1.5.3          
 [81] GenomicAlignments_1.26.0    bit_4.0.4                   tidyselect_1.1.0            plyr_1.8.6                 
 [85] magrittr_2.0.1              R6_2.5.0                    generics_0.1.0              Hmisc_4.5-0                
 [89] DelayedArray_0.16.2         DBI_1.1.1                   pillar_1.5.1                foreign_0.8-81             
 [93] survival_3.2-10             RCurl_1.98-1.3              nnet_7.3-15                 tibble_3.1.0               
 [97] crayon_1.4.1                utf8_1.2.1                  BiocFileCache_1.14.0        rmarkdown_2.7              
[101] jpeg_0.1-8.1                progress_1.2.2              locfit_1.5-9.4              grid_4.0.4                 
[105] data.table_1.14.0           blob_1.2.1                  digest_0.6.27               tidyr_1.1.3                
[109] openssl_1.4.3               munsell_0.5.0               Gviz_1.34.1                 askpass_1.1

Aspli ASpli • 1.9k views

ADD COMMENT • link 3.1 years ago • updated 3.0 years ago jbono ▴ 10

score 0 · Answer 1 · 2021-03-30

0

Entering edit mode

achernomoretz • 0

@b6a1dc8b

Last seen 3.0 years ago

Buenos Aires

Hi jbono,

The second and third warnings report some oddities concerning 'FBgn0002781','FBgn0013680','FBgn0013675','FBgn0262952','FBgn0013684' gene annotations. Could you please try removing them from the gtf file and re-run the analysis?

You can easily create an auxiliary foo.gtf file without the offending genes using the grep command:

   grep -iv 'FBgn0002781\|FBgn0013680\|FBgn0013675\|FBgn0262952\|FBgn0013684' dmel-all-r6.38.gtf > foo.gtf

and then proceed with 'foo.gtf' instead of the original file.

Best Ariel

ADD COMMENT • link 3.1 years ago achernomoretz • 0

0

Entering edit mode

Hi Ariel,

Thanks for the advice! I took out those genes as suggested. Some of the warning messages from 'makeTxDbFromGFF' went away, but I still got the same error message when running gbCounts: Error in .subset(x, j) : only 0's may be mixed with negative subscripts In addition: Warning message: In colnames(counts@junction.counts)[9:ncol(counts@junction.counts)] <- rownames(targets) : number of items to replace is not a multiple of replacement length.

I've listed the code below:

Dmel.6.38.TxDb.edited=makeTxDbFromGFF(
+     file="dmel-all-r6.gt.edited.gtf",
+     format="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
  ignored.
2: In for (i in seq_along(snames)) { :
  closing unused connection 3 (dmel-all-r6.38.edited.gtf)
> saveDb(Dmel.6.38.TxDb.edited,file="Dmel.6.38.TxDb.edited.sqlite")
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: dmel-all-r6.gt.edited.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 35345
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-03-31 10:57:52 -0600 (Wed, 31 Mar 2021)
# GenomicFeatures version at creation time: 1.42.2
# RSQLite version at creation time: 2.2.4
# DBSCHEMAVERSION: 1.2
> features <- binGenome( Dmel.6.38.TxDb.edited )
* Number of extracted Genes = 17868
* Number of extracted Exon Bins = 80610
* Number of extracted intron bins = 72241
* Number of extracted trascripts = 35345
* Number of extracted junctions = 60405
* Number of AS bins (not include external) = 9544
* Number of AS bins (include external) = 9554
* Classified as: 
    ES bins = 2427  (25%)
    IR bins = 1267  (13%)
    Alt5'ss bins = 1497 (16%)
    Alt3'ss bins = 1612 (17%)
    Multiple AS bins = 2741 (29%)
    classified as:
            ES bins = 530   (19%)
            IR bins = 491   (18%)
            Alt5'ss bins = 885  (32%)
            Alt3'ss bins = 724  (26%)

> targets=read.csv("Targets.csv")
> getConditions(targets)
[1] "Mutant_F_1D"   "Mutant_M_1D"   "Mutant_F_28D"  "Mutant_M_28D"  "Control_F_1D"  "Control_M_1D"  "Control_F_28D"
[8] "Control_M_28D"
> gbcounts <- gbCounts( features = features,
+                       targets = targets,
+                       minReadLength = 100, maxISize = 50000,
+                       strandMode=0)
Summarizing Mutant_F_1D_1
ETA: 53 min
Summarizing Mutant_F_1D_2
ETA: 50 min
Summarizing Mutant_F_1D_3
ETA: 46 min
Summarizing Mutant_M_1D_1
ETA: 44 min
Summarizing Mutant_M_1D_2
ETA: 40 min
Summarizing Mutant_M_1D_3
ETA: 38 min
Summarizing Mutant_F_28D_1
[1] 7
Error in .subset(x, j) : only 0's may be mixed with negative subscripts
In addition: Warning message:
In colnames(counts@junction.counts)[9:ncol(counts@junction.counts)] <- rownames(targets) :
  number of items to replace is not a multiple of replacement length

ADD REPLY • link 3.1 years ago jbono ▴ 10

0

Entering edit mode

Hi Jeremy, It seems that there could be some problems with junction information in Mutant_F_28D_1 bam file. Is this file different from the rest from a QC point of view? Could you eventually share it with me in order to see what´s going on specifically with that file? Ariel

ADD REPLY • link 3.1 years ago achernomoretz • 0

0

Entering edit mode

Hi Ariel,

I also remade the bam file just in case but still got the error. I don't think there is a big difference from a QC point of view and the file has worked for a few other workflows. I can send you the file over email unless there is a way to do it here. Thanks for your help!

Jeremy

ADD REPLY • link 3.1 years ago jbono ▴ 10

0

Entering edit mode

Hi Jeremy,

I could not reproduce the error you reported with the bam you sent me....could you run the code I used and tell me how it goes?

library(GenomicFeatures)
library(ASpli)
dme      <- makeTxDbFromGFF(file='/data1/genomeData/dme/foo.gtf')
features <- binGenome( dme )
targets  <- data.frame(row.names = 'C28F1_1',
                       bam = 'C28F1_1.fq.gz.subjunc.sorted.BAM',
                       f1  = 'C28F1')

gb <- gbCounts(features, targets, minReadLength = 100, maxISize = 50000, strandMode = 0)
# Summarizing C28F1_1
# Read summarization by gene completed
# Read summarization by bin completed
# Read summarization by ei1 region completed
# Read summarization by ie2 region completed
# Junction summarization completed 

gb
# Object of class ASpliCounts 
# Gene counts: 17868 genes analysed. Access using countsg(object) 
# Gene RD: 17868 genes analysed. Access using rdsg(object) 
# Bin counts: 143297 bins analysed. Access using countsb(object) 
# Bin RD: 143297 bins analysed. Access using rdsb(object) 
# Junction counts: 59250 junctions analysed. Access using countsj(object)

ADD REPLY • link 3.1 years ago achernomoretz • 0

0

Entering edit mode

Hi Ariel,

That also worked fine for me (see below). Just to be sure I also reran the original analysis and got the same error. I also tried the test below with the next bam file in the sequence just in case that was actually causing the problem, but that worked fine as well.

> dme      <- makeTxDbFromGFF(file="dmel-all-r6.gt.edited.gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
  ignored.
> features_test <- binGenome( dme )
* Number of extracted Genes = 17868
* Number of extracted Exon Bins = 80610
* Number of extracted intron bins = 72241
* Number of extracted trascripts = 35345
* Number of extracted junctions = 60405
* Number of AS bins (not include external) = 9544
* Number of AS bins (include external) = 9554
* Classified as: 
    ES bins = 2427  (25%)
    IR bins = 1267  (13%)
    Alt5'ss bins = 1497 (16%)
    Alt3'ss bins = 1612 (17%)
    Multiple AS bins = 2741 (29%)
    classified as:
            ES bins = 530   (19%)
            IR bins = 491   (18%)
            Alt5'ss bins = 885  (32%)
            Alt3'ss bins = 724  (26%)

> targets_test  <- data.frame(row.names = 'C28F1_1',
+                             bam = 'C28F1_1.fq.gz.subjunc.sorted.BAM',
+                             f1  = 'C28F1')
> gb_test <- gbCounts(features_test, targets_test, minReadLength = 100, maxISize = 50000, strandMode = 0)
Summarizing C28F1_1
Read summarization by gene completed
Read summarization by bin completed
Read summarization by ei1 region completed
Read summarization by ie2 region completed
Junction summarization completed
[1] 1
> gb_test
Object of class ASpliCounts 
Gene counts: 17868 genes analysed. Access using countsg(object) 
Gene RD: 17868 genes analysed. Access using rdsg(object) 
Bin counts: 143297 bins analysed. Access using countsb(object) 
Bin RD: 143297 bins analysed. Access using rdsb(object) 
Junction counts: 59250 junctions analysed. Access using countsj(object) 

> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tximport_1.18.0              IsoformSwitchAnalyzeR_1.12.0 ggplot2_3.3.3                DEXSeq_1.36.0               
 [5] RColorBrewer_1.1-2           DESeq2_1.30.1                BiocParallel_1.24.1          GenomicFeatures_1.42.3      
 [9] GenomicAlignments_1.26.0     Rsamtools_2.6.0              Biostrings_2.58.0            XVector_0.30.0              
[13] SummarizedExperiment_1.20.0  MatrixGenerics_1.2.1         matrixStats_0.58.0           GenomicRanges_1.42.0        
[17] GenomeInfoDb_1.26.4          GenomeInfoDbData_1.2.4       ASpli_2.0.0                  AnnotationDbi_1.52.0        
[21] IRanges_2.24.1               S4Vectors_0.28.1             Biobase_2.50.0               BiocGenerics_0.36.0         
[25] Rsubread_2.4.3               edgeR_3.32.1                 limma_3.46.0                

loaded via a namespace (and not attached):
  [1] backports_1.2.1               AnnotationHub_2.22.0          Hmisc_4.5-0                  
  [4] DRIMSeq_1.18.0                BiocFileCache_1.14.0          plyr_1.8.6                   
  [7] igraph_1.2.6                  lazyeval_0.2.2                tximeta_1.8.4                
 [10] splines_4.0.4                 digest_0.6.27                 ensembldb_2.14.0             
 [13] htmltools_0.5.1.1             fansi_0.4.2                   magrittr_2.0.1               
 [16] checkmate_2.0.0               memoise_2.0.0                 BSgenome_1.58.0              
 [19] cluster_2.1.1                 readr_1.4.0                   annotate_1.68.0              
 [22] askpass_1.1                   prettyunits_1.1.1             jpeg_0.1-8.1                 
 [25] colorspace_2.0-0              blob_1.2.1                    rappdirs_0.3.3               
 [28] xfun_0.22                     dplyr_1.0.5                   jsonlite_1.7.2               
 [31] crayon_1.4.1                  RCurl_1.98-1.3                genefilter_1.72.1            
 [34] survival_3.2-10               VariantAnnotation_1.36.0      glue_1.4.2                   
 [37] gtable_0.3.0                  zlibbioc_1.36.0               UpSetR_1.4.0                 
 [40] DelayedArray_0.16.3           scales_1.1.1                  futile.options_1.0.1         
 [43] DBI_1.1.1                     Rcpp_1.0.6                    xtable_1.8-4                 
 [46] progress_1.2.2                htmlTable_2.1.0               foreign_0.8-81               
 [49] bit_4.0.4                     Formula_1.2-4                 DT_0.17                      
 [52] htmlwidgets_1.5.3             httr_1.4.2                    ellipsis_0.3.1               
 [55] pkgconfig_2.0.3               XML_3.99-0.6                  Gviz_1.34.1                  
 [58] nnet_7.3-15                   dbplyr_2.1.0                  locfit_1.5-9.4               
 [61] utf8_1.2.1                    later_1.1.0.1                 tidyselect_1.1.0             
 [64] rlang_0.4.10                  reshape2_1.4.4                BiocVersion_3.12.0           
 [67] munsell_0.5.0                 tools_4.0.4                   cachem_1.0.4                 
 [70] generics_0.1.0                RSQLite_2.2.5                 evaluate_0.14                
 [73] stringr_1.4.0                 fastmap_1.1.0                 yaml_2.2.1                   
 [76] knitr_1.31                    bit64_4.0.5                   purrr_0.3.4                  
 [79] AnnotationFilter_1.14.0       mime_0.10                     formatR_1.8                  
 [82] xml2_1.3.2                    biomaRt_2.46.3                BiocStyle_2.18.1             
 [85] compiler_4.0.4                rstudioapi_0.13               interactiveDisplayBase_1.28.0
 [88] curl_4.3                      png_0.1-7                     tibble_3.1.0                 
 [91] statmod_1.4.35                geneplotter_1.68.0            stringi_1.5.3                
 [94] futile.logger_1.4.3           lattice_0.20-41               ProtGenerics_1.22.0          
 [97] Matrix_1.3-2                  vctrs_0.3.7                   pillar_1.5.1                 
[100] lifecycle_1.0.0               BiocManager_1.30.12           data.table_1.14.0            
[103] bitops_1.0-6                  httpuv_1.5.5                  rtracklayer_1.50.0           
[106] R6_2.5.0                      latticeExtra_0.6-29           hwriter_1.3.2                
[109] promises_1.2.0.1              gridExtra_2.3                 lambda.r_1.2.4               
[112] dichromat_2.0-0               MASS_7.3-53.1                 assertthat_0.2.1             
[115] openssl_1.4.3                 withr_2.4.1                   hms_1.0.0                    
[118] VennDiagram_1.6.20            grid_4.0.4                    rpart_4.1-15                 
[121] tidyr_1.1.3                   rmarkdown_2.7                 biovizBase_1.38.0            
[124] shiny_1.6.0                   base64enc_0.1-3               tinytex_0.31

ADD REPLY • link 3.1 years ago jbono ▴ 10

0

Entering edit mode

Could you send me your Targets.csv file?

ADD REPLY • link 3.1 years ago achernomoretz • 0

0

Entering edit mode

Sorry, I forgot to respond here that I sent the file via email. Please let me know if you didn't receive it, thanks!

ADD REPLY • link 3.1 years ago jbono ▴ 10

0

Entering edit mode

Hi Jeremy. We got it and we are working on the issue. Thanks for your patience. A

ADD REPLY • link 3.1 years ago achernomoretz • 0

score 0 · Answer 2 · 2021-04-16

Hi Jeremy,

Thanks for your patience. Finally, we could reproduce the error... It was produced by a single rDNA aligned junction (rDNA.11456.11467) detected in the problematic BAM that was not correctly parsed by ASpli.

We will be pushing today a new version of ASpli to the BioC devel branch [https://bioconductor.org/packages/devel/bioc/html/ASpli.html]. This version (it will be version 2.1.2) correctly handles your dataset and should be available for installation in a couple of days.

In the meantime, a fast and dirty workaround could be just keeping standard chromosome data in your bams. To do that you could use samtools:

samtools view -b input.bam  2L 2R 3L 3R 4  X Y > output.bam

Regards

Ariel