Question

understanding differences in RSEM, STAR outputs and featureCounts

0

Entering edit mode

vitoriastavis • 0

@191d355d

Last seen 8 months ago

Poland

Hello I am beginning in bioinformatics and I am analyzing RNAseq data mapped to a genome using RSEM+STAR and also only STAR. In RSEM log the mapping percentages for one sample look like this:

Number of input reads | 60070942
                      Average input read length |   200
                                    UNIQUE READS:
                   Uniquely mapped reads number |   34093455
                        Uniquely mapped reads % |   56.76%
                          Average mapped length |   190.57
                       Number of splices: Total |   3020472
            Number of splices: Annotated (sjdb) |   1918800
                       Number of splices: GT/AG |   2970169
                       Number of splices: GC/AG |   13559
                       Number of splices: AT/AC |   1783
               Number of splices: Non-canonical |   34961
                      Mismatch rate per base, % |   1.17%
                         Deletion rate per base |   0.03%
                        Deletion average length |   2.10
                        Insertion rate per base |   0.02%
                       Insertion average length |   1.78
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   2860068
             % of reads mapped to multiple loci |   4.76%
        Number of reads mapped to too many loci |   57489
             % of reads mapped to too many loci |   0.10%

Since I had forgotten to output the BAM file, I mapped again using only STAR. However when I run featureCounts, for the same sample:

    featureCounts(files = BAM1, isPairedEnd = TRUE, allowMultiOverlap = TRUE,
                                countMultiMappingReads = TRUE, annot.ext = annotation1,
                                isGTFAnnotationFile = TRUE, GTF.featureType = c("gene"),
                                GTF.attrType = c("ID"), useMetaFeatures = TRUE, nthreads = 8)


Process BAM file sample1Aligned.out.bam...                                     ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 47486220                                             ||
||    Successfully assigned alignments : 13191085 (27.8%)

And I also added a few more genes in the annotation and mapped again using RSEM, and the log looks the same as mentioned above. This time I outputted the .genome.bam file, and for featureCounts:

    featureCounts(files = BAM2, isPairedEnd = TRUE, allowMultiOverlap = TRUE,
                                countMultiMappingReads = TRUE, annot.ext = annotation2,
                                isGTFAnnotationFile = TRUE, GTF.featureType = c("gene"),
                                GTF.attrType = c("ID"), useMetaFeatures = TRUE, nthreads = 8)

Process BAM file sample1.genome.bam...                                         ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 26163546                                             ||
||    Successfully assigned alignments : 3046125 (11.6%)

I am having a hard time understanding why these percentages are so different between each other (if the logs are the same), and also different from the logs. I read here that in the logs we have reads and in featureCounts we have alignments, and for multiple mapping reads we can have more than one alignment. Ok, but it also says that the featureCounts is always higher which is not the case.. I am very confused. I am posting this here because here I found this other post I mentioned, but I can also post it in somewhere else (where?).

sessionInfo:
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] png_0.1-8         ggdendro_0.1.23   ggrepel_0.9.3     dplyr_1.1.2       dendextend_1.17.1 Rsubread_2.8.2    ggfortify_0.4.16  ggplot2_3.4.2     limma_3.50.3      stringr_1.5.0    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10       pillar_1.9.0      compiler_4.1.2    viridis_0.6.3     tools_4.1.2       digest_0.6.31     evaluate_0.21     viridisLite_0.4.2 lattice_0.21-8    lifecycle_1.0.3   tibble_3.2.1     
[12] gtable_0.3.3      pkgconfig_2.0.3   rlang_1.1.1       Matrix_1.5-4.1    cli_3.6.1         rstudioapi_0.14   yaml_2.3.7        xfun_0.39         fastmap_1.1.1     gridExtra_2.3     withr_2.5.0      
[23] knitr_1.43        generics_0.1.3    vctrs_0.6.2       tidyselect_1.2.0  glue_1.6.2        R6_2.5.1          fansi_1.0.4       rmarkdown_2.22    tidyr_1.3.0       purrr_1.0.1       magrittr_2.0.3   
[34] MASS_7.3-60       scales_1.2.1      htmltools_0.5.5   colorspace_2.1-0  utf8_1.2.3        stringi_1.7.12    munsell_0.5.0

RNASeqData Rsubread mapping featurecounts • 1.0k views

ADD COMMENT • link updated 8 months ago by James W. MacDonald 65k • written 8 months ago by vitoriastavis • 0

0

Entering edit mode

You might also include the output from running samtools idxstats on each BAM file.

ADD REPLY • link 8 months ago James W. MacDonald 65k

0

Entering edit mode

Oh wait. I meant samtools flagstat.

ADD REPLY • link 8 months ago James W. MacDonald 65k