Entering edit mode
Hi,
I am running SpliceWiz following these indications here https://bioconductor.org/packages/devel/bioc/vignettes/SpliceWiz/inst/doc/SW_QuickStart.html#Quick-Start.
I obtained my reference by running
require(AnnotationHub)
ah = AnnotationHub()
query(ah, c("Homo Sapiens", "release-94"))
ref_path = file.path(tempdir(), "Reference")
buildRef(
reference_path = ref_path,
fasta = "AH65745",
gtf = "AH64631",
genome_type = "hg38" # take into account mappability exclusions
)
and provided BAMs sorted by read name (using samtools collate). Everything works fine until the following step:
nxtse_path <- file.path(tempdir(), "NxtSE_output")
collateData(
Experiment = expr,
reference_path = ref_path,
output_path = nxtse_path
)
Error in stop_if_wrong_length("'seqnames'", ans_len) : 'seqnames' must have the length of the object to construct (1) or length 1
11.
stop(wmsg(what, " must have the length of the object ", "to construct (", ans_len, ") or length 1"))
10.
stop_if_wrong_length("'seqnames'", ans_len)
9.
new_GRanges("GRanges", seqnames = seqnames, ranges = ranges, strand = strand, mcols = mcols, seqinfo = seqinfo)
8.
GRanges(seqnames = as.character(seqnames), ranges = IRanges(start = start, end = start + 1), strand = "+")
7.
eval(substitute(expr), data, enclos = parent.frame())
6.
eval(substitute(expr), data, enclos = parent.frame())
5.
with.default(junc.common.unanno, GRanges(seqnames = as.character(seqnames), ranges = IRanges(start = start, end = start + 1), strand = "+"))
4.
with(junc.common.unanno, GRanges(seqnames = as.character(seqnames), ranges = IRanges(start = start, end = start + 1), strand = "+"))
3.
.collateData_junc_annotate(2, reference_path, norm_output_path, lowMemoryMode)
2.
.collateData_annotate(reference_path, norm_output_path, stranded, novelSplicing, lowMemoryMode, minSamplesWithJunc = novelSplicing_minSamples, minSamplesAboveJuncThreshold = novelSplicing_minSamplesAboveThreshold, novelSplicing_requireOneAnnotatedSJ = novelSplicing_requireOneAnnotatedSJ)
1.
collateData(Experiment = expr, reference_path = ref_path, output_path = nxtse_path)
sessionInfo( )
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] fstcore_0.9.14 AnnotationHub_3.6.0 BiocFileCache_2.6.0 dbplyr_2.2.1 BiocGenerics_0.44.0 pheatmap_1.0.12
[7] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[13] tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.1 SpliceWiz_1.0.2 NxtIRFdata_1.4.0
loaded via a namespace (and not attached):
[1] readxl_1.4.0 backports_1.4.1 lazyeval_0.2.2 shinydashboard_0.7.2
[5] splines_4.2.0 BiocParallel_1.31.8 GenomeInfoDb_1.34.3 digest_0.6.29
[9] foreach_1.5.2 ca_0.71.1 htmltools_0.5.2 viridis_0.6.2
[13] fansi_1.0.3 magrittr_2.0.3 memoise_2.0.1 BSgenome_1.66.1
[17] tzdb_0.3.0 shinyFiles_0.9.3 Biostrings_2.66.0 annotate_1.76.0
[21] modelr_0.1.8 matrixStats_0.62.0 R.utils_2.11.0 prettyunits_1.1.1
[25] colorspace_2.0-3 rvest_1.0.2 blob_1.2.3 rappdirs_0.3.3
[29] haven_2.5.0 xfun_0.31 crayon_1.5.1 RCurl_1.98-1.7
[33] jsonlite_1.8.0 genefilter_1.79.0 survival_3.3-1 iterators_1.0.14
[37] glue_1.6.2 registry_0.5-1 gtable_0.3.0 zlibbioc_1.44.0
[41] XVector_0.38.0 webshot_0.5.4 DelayedArray_0.23.1 Rhdf5lib_1.20.0
[45] HDF5Array_1.26.0 scales_1.2.0 DBI_1.1.3 Rcpp_1.0.8.3
[49] viridisLite_0.4.0 xtable_1.8-4 progress_1.2.2 bit_4.0.4
[53] stats4_4.2.0 DT_0.26 htmlwidgets_1.5.4 httr_1.4.3
[57] RColorBrewer_1.1-3 ellipsis_0.3.2 pkgconfig_2.0.3 XML_3.99-0.10
[61] R.methodsS3_1.8.2 sass_0.4.1 utf8_1.2.2 tidyselect_1.1.2
[65] rlang_1.0.6 later_1.3.0 AnnotationDbi_1.60.0 cellranger_1.1.0
[69] munsell_0.5.0 BiocVersion_3.16.0 tools_4.2.0 cachem_1.0.6
[73] cli_3.4.1 generics_0.1.2 RSQLite_2.2.14 broom_0.8.0
[77] fastmap_1.1.0 heatmaply_1.4.2 yaml_2.3.5 ompBAM_1.2.0
[81] knitr_1.39 bit64_4.0.5 fs_1.5.2 KEGGREST_1.38.0
[85] dendextend_1.16.0 sparseMatrixStats_1.10.0 mime_0.12 rhandsontable_0.3.8
[89] R.oo_1.25.0 xml2_1.3.3 compiler_4.2.0 rstudioapi_0.13
[93] plotly_4.10.0 filelock_1.0.2 curl_4.3.2 png_0.1-7
[97] interactiveDisplayBase_1.36.0 reprex_2.0.1 stringi_1.7.6 bslib_0.3.1
[101] lattice_0.20-45 Matrix_1.5-1 vctrs_0.4.1 pillar_1.7.0
[105] lifecycle_1.0.1 rhdf5filters_1.10.0 BiocManager_1.30.18 jquerylib_0.1.4
[109] data.table_1.14.2 bitops_1.0-7 seriation_1.4.1 httpuv_1.6.5
[113] rtracklayer_1.57.0 GenomicRanges_1.50.1 R6_2.5.1 BiocIO_1.8.0
[117] promises_1.2.0.1 TSP_1.2-1 gridExtra_2.3 IRanges_2.32.0
[121] codetools_0.2-18 assertthat_0.2.1 rhdf5_2.42.0 SummarizedExperiment_1.28.0
[125] rjson_0.2.21 withr_2.5.0 shinyWidgets_0.7.6 GenomicAlignments_1.34.0
[129] Rsamtools_2.14.0 S4Vectors_0.36.0 GenomeInfoDbData_1.2.8 parallel_4.2.0
[133] hms_1.1.1 fst_0.9.8 grid_4.2.0 DelayedMatrixStats_1.20.0
[137] MatrixGenerics_1.10.0 lubridate_1.8.0 Biobase_2.58.0 shiny_1.7.1
[141] restfulr_0.0.15
I understand this is the output of the internal function new_GRanges() from the GenomicRanges package, however I can't figure out what's wrong with my input files.
Thank you
Hi Chiara,
Yes, the issue appears to be in collateData(), whereby there is a sanity check missing. After obtaining a list of novel junction reads, it removes those with seqnames that do not match that of the provided reference. But then it doesn't check whether there are any remaining after filtering before proceeding further.
I will patch this in an upcoming update.
Another user encountered this issue when using Gencode-aligned BAM files and Ensembl reference. In the meantime, please try creating a new reference using the same reference FASTA / GTF files as that used to align your BAM files. I hope that should resolve this issue.
Hi again,
while this issue was solved following your suggestions, I am facing another problem once I get my data collated. Specifically, after running the following:
I obtain a matrix with only zero values, which of course stops downstream analyses.
Same output when setting
Is there anything specific I should check in my BAMs?
Thanks a lot
It is difficult to be sure that matrix is all zeros, as the first and last block of values are IR (IR-ratio based introns) and RI (retained intron in binary comparison with annotated splice event). Most introns are not retained so it is expected there is many zeros. Furthermore, "Coverage" is (for IR) the percent of intron covered with reads, (and for other ASEs) is "participation ratio" in my biorxiv preprint. These values are not used in differential analysis but are only used for the
Participation
filter.It is best to look at
assay(se, "Included")
andassay(se, "Excluded")
which are the read counts belonging to included and excluded isoforms, respectively. To count which events have all non-zero values, you could try for example:To filter out events with insufficient read counts, one might construct a Depth filter:
This filter will essentially remove ASEs that have depth of transcript coverage (splice + IR + inexact splice) < 10. To apply this filter:
The new
se_filtered
object can then be used in downstream analysis. Hope this helps.PS: of course, to quality check the BAM files to see whether the spliced reads match that of the reference, you could view the data frame generated by
sampleQC(se)
. These are not yet thoroughly documented, butFraction_Splice_Reads
will give the fraction of reads for which a splice junction was found, andUnanno_Jn_Fraction
is the fraction of reads with unannotated splice junctions.Another common pitfall is using hg38 reference to analyse BAM files aligned to hg19.