Hello I am beginning in bioinformatics and I am analyzing RNAseq data mapped to a genome using RSEM+STAR and also only STAR. In RSEM log the mapping percentages for one sample look like this:
Number of input reads | 60070942 Average input read length | 200 UNIQUE READS: Uniquely mapped reads number | 34093455 Uniquely mapped reads % | 56.76% Average mapped length | 190.57 Number of splices: Total | 3020472 Number of splices: Annotated (sjdb) | 1918800 Number of splices: GT/AG | 2970169 Number of splices: GC/AG | 13559 Number of splices: AT/AC | 1783 Number of splices: Non-canonical | 34961 Mismatch rate per base, % | 1.17% Deletion rate per base | 0.03% Deletion average length | 2.10 Insertion rate per base | 0.02% Insertion average length | 1.78 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 2860068 % of reads mapped to multiple loci | 4.76% Number of reads mapped to too many loci | 57489 % of reads mapped to too many loci | 0.10%
Since I had forgotten to output the BAM file, I mapped again using only STAR. However when I run featureCounts, for the same sample:
featureCounts(files = BAM1, isPairedEnd = TRUE, allowMultiOverlap = TRUE, countMultiMappingReads = TRUE, annot.ext = annotation1, isGTFAnnotationFile = TRUE, GTF.featureType = c("gene"), GTF.attrType = c("ID"), useMetaFeatures = TRUE, nthreads = 8) Process BAM file sample1Aligned.out.bam... || || Paired-end reads are included. || || Total alignments : 47486220 || || Successfully assigned alignments : 13191085 (27.8%)
And I also added a few more genes in the annotation and mapped again using RSEM, and the log looks the same as mentioned above. This time I outputted the .genome.bam file, and for featureCounts:
featureCounts(files = BAM2, isPairedEnd = TRUE, allowMultiOverlap = TRUE, countMultiMappingReads = TRUE, annot.ext = annotation2, isGTFAnnotationFile = TRUE, GTF.featureType = c("gene"), GTF.attrType = c("ID"), useMetaFeatures = TRUE, nthreads = 8) Process BAM file sample1.genome.bam... || || Paired-end reads are included. || || Total alignments : 26163546 || || Successfully assigned alignments : 3046125 (11.6%)
I am having a hard time understanding why these percentages are so different between each other (if the logs are the same), and also different from the logs. I read here that in the logs we have reads and in featureCounts we have alignments, and for multiple mapping reads we can have more than one alignment. Ok, but it also says that the featureCounts is always higher which is not the case.. I am very confused. I am posting this here because here I found this other post I mentioned, but I can also post it in somewhere else (where?).
sessionInfo: R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.6 LTS Matrix products: default BLAS: /usr/local/lib/R/lib/libRblas.so LAPACK: /usr/local/lib/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8  LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  grid stats graphics grDevices utils datasets methods base other attached packages:  png_0.1-8 ggdendro_0.1.23 ggrepel_0.9.3 dplyr_1.1.2 dendextend_1.17.1 Rsubread_2.8.2 ggfortify_0.4.16 ggplot2_3.4.2 limma_3.50.3 stringr_1.5.0 loaded via a namespace (and not attached):  Rcpp_1.0.10 pillar_1.9.0 compiler_4.1.2 viridis_0.6.3 tools_4.1.2 digest_0.6.31 evaluate_0.21 viridisLite_0.4.2 lattice_0.21-8 lifecycle_1.0.3 tibble_3.2.1  gtable_0.3.3 pkgconfig_2.0.3 rlang_1.1.1 Matrix_1.5-4.1 cli_3.6.1 rstudioapi_0.14 yaml_2.3.7 xfun_0.39 fastmap_1.1.1 gridExtra_2.3 withr_2.5.0  knitr_1.43 generics_0.1.3 vctrs_0.6.2 tidyselect_1.2.0 glue_1.6.2 R6_2.5.1 fansi_1.0.4 rmarkdown_2.22 tidyr_1.3.0 purrr_1.0.1 magrittr_2.0.3  MASS_7.3-60 scales_1.2.1 htmltools_0.5.5 colorspace_2.1-0 utf8_1.2.3 stringi_1.7.12 munsell_0.5.0