Hi everyone,
I have a question regarding the coverage filter that can be applied to rMATS results (=alternative splicing analysis) using the function filterByCoverage() of the Bioconductor package maser. filterByCoverage() can be used to remove low coverage events. Per default, it removes splicing events with an average coverage of <= 5 reads. The average is calculated over all samples in the comparison. Before filtering, rMATS results are loaded into R using the function maser(). Two example skipping events (SE) as output by rMATS are as follows:
ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen PValue FDR IncLevel1 IncLevel2 IncLevelDifference
8025 "ENSG00000134759.14" "ELP2" chr18 + 36139418 36139613 36138269 36138426 36141136 36141201 8025 4,9,15 0,0,0 6,5,6 4,4,1 310 115 8.14875944499e-10 6.8150582549049824e-06 1.0,1.0,1.0 0.358,0.317,0.69 0.545
23616 "ENSG00000267750.6" "RUNDC3A-AS1" chr17 - 44304980 44305043 44302516 44304207 44307291 44307500 23616 0,0,0 13,6,2 3,3,4 8,2,6 178 115 1.19718082014e-06 0.00020620498447021453 0.0,0.0,0.0 0.195,0.492,0.301 -0.329
Each group in the comparison has three replicates. The numbers of reads supporting inclusion (per sample) are given in columns IJC_SAMPLE_1 and IJC_SAMPLE_2 for groups 1 and 2, respectively. The numbers of reads supporting skipping are given in columns SJC_SAMPLE_1 and SJC_SAMPLE_2. While rMATS provides inclusion and skipping read counts per sample per event, maser() only reads the inclusion counts (columns IJC_SAMPLE_1 and IJC_SAMPLE_2) and ignores the skipping counts. As a consequence of not reading in the skipping counts, they are not used when events are filtered by coverage.
> library(maser)
# load rMATS results found in directory rmats_results_dir
> rmats <- maser(rmats_result_dir, cond_labels = c("group1","group2"), ftype = "JCEC")
# list all SE events (shows both events listed above)
> summary(rmats,type="SE")
ID GeneID geneSymbol PValue FDR
1 8025 ENSG00000134759.14 ELP2 8.148759e-10 6.815058e-06
2 23616 ENSG00000267750.6 RUNDC3A-AS1 1.197181e-06 2.062050e-04
IncLevelDifference PSI_1 PSI_2 Chr Strand exon_target
1 0.545 1,1,1 0.358,0.317,0.69 chr18 + 36139419-36139613
2 -0.329 0,0,0 0.195,0.492,0.301 chr17 - 44304981-44305043
exon_upstream exon_downstream
1 36138270-36138426 36141137-36141201
2 44302517-44304207 44307292-44307500
# show counts available for SE events in rmats
# (lists only inclusion counts)
> slot(rmats, "SE_counts")
group1_1 group1_2 group1_3 group2_1 group2_2 group2_3
8025 4 9 15 6 5 6
23616 0 0 0 3 3 4
# filter events, remove events with average coverage <=5
> rmats_filt <- filterByCoverage(rmats,avg_reads=5)
# list all SE events remaining after filtering for coverage
> summary(rmats_filt,type="SE")
ID GeneID geneSymbol PValue FDR
1 8025 ENSG00000134759.14 ELP2 8.148759e-10 6.815058e-06
IncLevelDifference PSI_1 PSI_2 Chr Strand exon_target
1 0.545 1,1,1 0.358,0.317,0.69 chr18 + 36139419-36139613
exon_upstream exon_downstream
1 36138270-36138426 36141137-36141201
When applying the maser coverage filter, the first event above is kept, while the second one is removed. Had we instead filtered purely based on skipping coverage, the first events would have been removed, while the second one would have been kept.
So my question is: What is the reason to ignore the reads supporting the skipping of an exon when filtering events for a minimal average coverage? What makes reads supporting inclusion more important than reads supporting skipping? Using only inclusion coverage to filter low coverage events seems to bias the results.
I'm looking forward to hearing your thoughts. Thanks a lot and best,
Anke.
> sessionInfo( )
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] maser_1.20.0 GenomicRanges_1.54.1 GenomeInfoDb_1.38.2
[4] IRanges_2.36.0 S4Vectors_0.40.2 BiocGenerics_0.48.1
[7] ggplot2_3.4.4
loaded via a namespace (and not attached):
[1] DBI_1.1.3 bitops_1.0-7
[3] deldir_2.0-2 gridExtra_2.3
[5] biomaRt_2.58.0 rlang_1.1.2
[7] magrittr_2.0.3 biovizBase_1.50.0
[9] matrixStats_1.2.0 compiler_4.3.2
[11] RSQLite_2.3.4 GenomicFeatures_1.54.1
[13] png_0.1-8 vctrs_0.6.5
[15] ProtGenerics_1.34.0 stringr_1.5.1
[17] pkgconfig_2.0.3 crayon_1.5.2
[19] fastmap_1.1.1 backports_1.4.1
[21] dbplyr_2.4.0 XVector_0.42.0
[23] utf8_1.2.4 Rsamtools_2.18.0
[25] rmarkdown_2.25 bit_4.0.5
[27] xfun_0.41 zlibbioc_1.48.0
[29] cachem_1.0.8 progress_1.2.3
[31] blob_1.2.4 DelayedArray_0.28.0
[33] BiocParallel_1.36.0 jpeg_0.1-10
[35] parallel_4.3.2 prettyunits_1.2.0
[37] cluster_2.1.6 VariantAnnotation_1.48.1
[39] R6_2.5.1 stringi_1.8.3
[41] RColorBrewer_1.1-3 rtracklayer_1.62.0
[43] rpart_4.1.23 Gviz_1.46.1
[45] Rcpp_1.0.11 SummarizedExperiment_1.32.0
[47] knitr_1.45 base64enc_0.1-3
[49] Matrix_1.6-4 nnet_7.3-19
[51] tidyselect_1.2.0 dichromat_2.0-0.1
[53] rstudioapi_0.15.0 abind_1.4-5
[55] yaml_2.3.8 codetools_0.2-19
[57] curl_5.2.0 lattice_0.22-5
[59] tibble_3.2.1 Biobase_2.62.0
[61] withr_2.5.2 KEGGREST_1.42.0
[63] evaluate_0.23 foreign_0.8-86
[65] BiocFileCache_2.10.1 xml2_1.3.6
[67] Biostrings_2.70.1 pillar_1.9.0
[69] filelock_1.0.3 MatrixGenerics_1.14.0
[71] DT_0.31 checkmate_2.3.1
[73] generics_0.1.3 RCurl_1.98-1.13
[75] ensembldb_2.26.0 hms_1.1.3
[77] munsell_0.5.0 scales_1.3.0
[79] glue_1.6.2 lazyeval_0.2.2
[81] Hmisc_5.1-1 tools_4.3.2
[83] interp_1.1-5 BiocIO_1.12.0
[85] data.table_1.14.10 BSgenome_1.70.1
[87] GenomicAlignments_1.38.0 XML_3.99-0.16
[89] grid_4.3.2 latticeExtra_0.6-30
[91] AnnotationDbi_1.64.1 colorspace_2.1-0
[93] GenomeInfoDbData_1.2.11 htmlTable_2.4.2
[95] restfulr_0.0.15 Formula_1.2-5
[97] cli_3.6.2 rappdirs_0.3.3
[99] fansi_1.0.6 S4Arrays_1.2.0
[101] dplyr_1.1.4 AnnotationFilter_1.26.0
[103] gtable_0.3.4 digest_0.6.33
[105] SparseArray_1.2.2 rjson_0.2.21
[107] htmlwidgets_1.6.4 memoise_2.0.1
[109] htmltools_0.5.7 lifecycle_1.0.4
[111] httr_1.4.7 bit64_4.0.5