Question

SpliceWiz collateData error

0

Entering edit mode

chiara.caprioli • 0

@aca36346

Last seen 3 months ago

Italy

Hi,

I am running SpliceWiz following these indications here https://bioconductor.org/packages/devel/bioc/vignettes/SpliceWiz/inst/doc/SW_QuickStart.html#Quick-Start.

I obtained my reference by running

require(AnnotationHub)
ah = AnnotationHub()
query(ah, c("Homo Sapiens", "release-94"))

ref_path = file.path(tempdir(), "Reference")
buildRef(
    reference_path = ref_path,
    fasta = "AH65745",
    gtf = "AH64631",
    genome_type = "hg38" # take into account mappability exclusions
)

and provided BAMs sorted by read name (using samtools collate). Everything works fine until the following step:

nxtse_path <- file.path(tempdir(), "NxtSE_output")
collateData(
    Experiment = expr,
    reference_path = ref_path,
    output_path = nxtse_path
)
Error in stop_if_wrong_length("'seqnames'", ans_len) : 'seqnames' must have the length of the object to construct (1) or length 1
11.
stop(wmsg(what, " must have the length of the object ", "to construct (", ans_len, ") or length 1"))
10.
stop_if_wrong_length("'seqnames'", ans_len)
9.
new_GRanges("GRanges", seqnames = seqnames, ranges = ranges, strand = strand, mcols = mcols, seqinfo = seqinfo)
8.
GRanges(seqnames = as.character(seqnames), ranges = IRanges(start = start, end = start + 1), strand = "+")
7.
eval(substitute(expr), data, enclos = parent.frame())
6.
eval(substitute(expr), data, enclos = parent.frame())
5.
with.default(junc.common.unanno, GRanges(seqnames = as.character(seqnames), ranges = IRanges(start = start, end = start + 1), strand = "+"))
4.
with(junc.common.unanno, GRanges(seqnames = as.character(seqnames), ranges = IRanges(start = start, end = start + 1), strand = "+"))
3.
.collateData_junc_annotate(2, reference_path, norm_output_path, lowMemoryMode)
2.
.collateData_annotate(reference_path, norm_output_path, stranded, novelSplicing, lowMemoryMode, minSamplesWithJunc = novelSplicing_minSamples, minSamplesAboveJuncThreshold = novelSplicing_minSamplesAboveThreshold, novelSplicing_requireOneAnnotatedSJ = novelSplicing_requireOneAnnotatedSJ)
1.
collateData(Experiment = expr, reference_path = ref_path, output_path = nxtse_path)


sessionInfo( )
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fstcore_0.9.14      AnnotationHub_3.6.0 BiocFileCache_2.6.0 dbplyr_2.2.1        BiocGenerics_0.44.0 pheatmap_1.0.12    
 [7] forcats_0.5.1       stringr_1.4.0       dplyr_1.0.9         purrr_0.3.4         readr_2.1.2         tidyr_1.2.0        
[13] tibble_3.1.7        ggplot2_3.3.6       tidyverse_1.3.1     SpliceWiz_1.0.2     NxtIRFdata_1.4.0   

loaded via a namespace (and not attached):
  [1] readxl_1.4.0                  backports_1.4.1               lazyeval_0.2.2                shinydashboard_0.7.2         
  [5] splines_4.2.0                 BiocParallel_1.31.8           GenomeInfoDb_1.34.3           digest_0.6.29                
  [9] foreach_1.5.2                 ca_0.71.1                     htmltools_0.5.2               viridis_0.6.2                
 [13] fansi_1.0.3                   magrittr_2.0.3                memoise_2.0.1                 BSgenome_1.66.1              
 [17] tzdb_0.3.0                    shinyFiles_0.9.3              Biostrings_2.66.0             annotate_1.76.0              
 [21] modelr_0.1.8                  matrixStats_0.62.0            R.utils_2.11.0                prettyunits_1.1.1            
 [25] colorspace_2.0-3              rvest_1.0.2                   blob_1.2.3                    rappdirs_0.3.3               
 [29] haven_2.5.0                   xfun_0.31                     crayon_1.5.1                  RCurl_1.98-1.7               
 [33] jsonlite_1.8.0                genefilter_1.79.0             survival_3.3-1                iterators_1.0.14             
 [37] glue_1.6.2                    registry_0.5-1                gtable_0.3.0                  zlibbioc_1.44.0              
 [41] XVector_0.38.0                webshot_0.5.4                 DelayedArray_0.23.1           Rhdf5lib_1.20.0              
 [45] HDF5Array_1.26.0              scales_1.2.0                  DBI_1.1.3                     Rcpp_1.0.8.3                 
 [49] viridisLite_0.4.0             xtable_1.8-4                  progress_1.2.2                bit_4.0.4                    
 [53] stats4_4.2.0                  DT_0.26                       htmlwidgets_1.5.4             httr_1.4.3                   
 [57] RColorBrewer_1.1-3            ellipsis_0.3.2                pkgconfig_2.0.3               XML_3.99-0.10                
 [61] R.methodsS3_1.8.2             sass_0.4.1                    utf8_1.2.2                    tidyselect_1.1.2             
 [65] rlang_1.0.6                   later_1.3.0                   AnnotationDbi_1.60.0          cellranger_1.1.0             
 [69] munsell_0.5.0                 BiocVersion_3.16.0            tools_4.2.0                   cachem_1.0.6                 
 [73] cli_3.4.1                     generics_0.1.2                RSQLite_2.2.14                broom_0.8.0                  
 [77] fastmap_1.1.0                 heatmaply_1.4.2               yaml_2.3.5                    ompBAM_1.2.0                 
 [81] knitr_1.39                    bit64_4.0.5                   fs_1.5.2                      KEGGREST_1.38.0              
 [85] dendextend_1.16.0             sparseMatrixStats_1.10.0      mime_0.12                     rhandsontable_0.3.8          
 [89] R.oo_1.25.0                   xml2_1.3.3                    compiler_4.2.0                rstudioapi_0.13              
 [93] plotly_4.10.0                 filelock_1.0.2                curl_4.3.2                    png_0.1-7                    
 [97] interactiveDisplayBase_1.36.0 reprex_2.0.1                  stringi_1.7.6                 bslib_0.3.1                  
[101] lattice_0.20-45               Matrix_1.5-1                  vctrs_0.4.1                   pillar_1.7.0                 
[105] lifecycle_1.0.1               rhdf5filters_1.10.0           BiocManager_1.30.18           jquerylib_0.1.4              
[109] data.table_1.14.2             bitops_1.0-7                  seriation_1.4.1               httpuv_1.6.5                 
[113] rtracklayer_1.57.0            GenomicRanges_1.50.1          R6_2.5.1                      BiocIO_1.8.0                 
[117] promises_1.2.0.1              TSP_1.2-1                     gridExtra_2.3                 IRanges_2.32.0               
[121] codetools_0.2-18              assertthat_0.2.1              rhdf5_2.42.0                  SummarizedExperiment_1.28.0  
[125] rjson_0.2.21                  withr_2.5.0                   shinyWidgets_0.7.6            GenomicAlignments_1.34.0     
[129] Rsamtools_2.14.0              S4Vectors_0.36.0              GenomeInfoDbData_1.2.8        parallel_4.2.0               
[133] hms_1.1.1                     fst_0.9.8                     grid_4.2.0                    DelayedMatrixStats_1.20.0    
[137] MatrixGenerics_1.10.0         lubridate_1.8.0               Biobase_2.58.0                shiny_1.7.1                  
[141] restfulr_0.0.15

I understand this is the output of the internal function new_GRanges() from the GenomicRanges package, however I can't figure out what's wrong with my input files.

Thank you

SpliceWiz • 706 views

ADD COMMENT • link updated 14 months ago by a.wong ▴ 10 • written 15 months ago by chiara.caprioli • 0

1

Entering edit mode

Hi Chiara,

Yes, the issue appears to be in collateData(), whereby there is a sanity check missing. After obtaining a list of novel junction reads, it removes those with seqnames that do not match that of the provided reference. But then it doesn't check whether there are any remaining after filtering before proceeding further.

I will patch this in an upcoming update.

Another user encountered this issue when using Gencode-aligned BAM files and Ensembl reference. In the meantime, please try creating a new reference using the same reference FASTA / GTF files as that used to align your BAM files. I hope that should resolve this issue.

ADD REPLY • link 15 months ago a.wong ▴ 10

0

Entering edit mode

Hi again,

while this issue was solved following your suggestions, I am facing another problem once I get my data collated. Specifically, after running the following:

collateData(
    Experiment = expr,
    reference_path = paste0(path_temp, 'Reference'),
    output_path = nxtse_path,
    IRMode = "SpliceOver"
)

se <- makeSE(nxtse_path, RemoveOverlapping = T)

I obtain a matrix with only zero values, which of course stops downstream analyses.

se@assays@data$Coverage

<417341 x 6> matrix of class DelayedMatrix and type "double":
          [,1] [,2] [,3] [,4] [,5] [,6]
     [1,]    0    0    0    0    0    0
     [2,]    0    0    0    0    0    0
     [3,]    0    0    0    0    0    0
     [4,]    0    0    0    0    0    0
     [5,]    0    0    0    0    0    0
      ...    .    .    .    .    .    .
[417337,]    0    0    0    0    0    0
[417338,]    0    0    0    0    0    0
[417339,]    0    0    0    0    0    0
[417340,]    0    0    0    0    0    0
[417341,]    0    0    0    0    0    0

Same output when setting

RemoveOverlapping = F

Is there anything specific I should check in my BAMs?

Thanks a lot

ADD REPLY • link 14 months ago chiara.caprioli • 0

0

Entering edit mode

It is difficult to be sure that matrix is all zeros, as the first and last block of values are IR (IR-ratio based introns) and RI (retained intron in binary comparison with annotated splice event). Most introns are not retained so it is expected there is many zeros. Furthermore, "Coverage" is (for IR) the percent of intron covered with reads, (and for other ASEs) is "participation ratio" in my biorxiv preprint. These values are not used in differential analysis but are only used for the Participation filter.

It is best to look at assay(se, "Included") and assay(se, "Excluded") which are the read counts belonging to included and excluded isoforms, respectively. To count which events have all non-zero values, you could try for example:

sum(DelayedArray::rowMins(assay(se, "Included")) > 0)

To filter out events with insufficient read counts, one might construct a Depth filter:

f1 <- ASEFilter(filterClass = "Data", filterType = "Depth", minimum = 10)

This filter will essentially remove ASEs that have depth of transcript coverage (splice + IR + inexact splice) < 10. To apply this filter:

se_filtered <- se[applyFilters(se, f1),]

The new se_filtered object can then be used in downstream analysis. Hope this helps.

PS: of course, to quality check the BAM files to see whether the spliced reads match that of the reference, you could view the data frame generated by sampleQC(se). These are not yet thoroughly documented, but Fraction_Splice_Reads will give the fraction of reads for which a splice junction was found, and Unanno_Jn_Fraction is the fraction of reads with unannotated splice junctions.

ADD REPLY • link 14 months ago a.wong ▴ 10

0

Entering edit mode

Another common pitfall is using hg38 reference to analyse BAM files aligned to hg19.

ADD REPLY • link 14 months ago a.wong ▴ 10

score 0 · Answer 1 · 2023-01-19

0

Entering edit mode

chiara.caprioli • 0

@aca36346

Last seen 3 months ago

Italy

For anyone interested, see this post https://github.com/alexchwong/SpliceWiz/issues/33

ADD COMMENT • link 15 months ago chiara.caprioli • 0