Question

RSubread output significantly different from bash Subread output

0

Entering edit mode

Aimee ▴ 10

@3d966bb9

Last seen 14 months ago

United States

Hello, I would like some assistance in figuring out why Rsubread and the bash subread are giving different feature count outputs. I came across this issue because I wrote a wrapper script for a pipeline used by a colleague, and during the validation process we have been trying to resolve why the wrapper feature count table differs from hers. The differences drastically affect the statistical analysis by DeSeq2. Through much testing, I found that our bam file outputs from STAR were exactly the same, and the issue arises when we run the files through featureCounts. The primary distinction is that her method uses Rsubread and the wrapper uses subread in bash. If I run her bam files through bash subread, our feature count tables are now identical, confirming that the divergence is occurring at the featureCounts step.

Neither of us was using the most updated versions of Rsubread and subread (Rsubread 2.18.0 and subread 2.0.1), and I recognized that this alone could cause an issue if those versions were not aligned with each other. So I downloaded Rsubread 2.20.0 and Subread 2.0.6 (unfortunately, I could not download 2.0.7 because it would require admin privileges, and 2.0.6 was the latest available on bioconda). I then ran the same 14 bam files (forward paired reads only) through Rsubread and bash Subread using the following scripts for each (same gtf file as well):

Subread 2.0.6 script:

featureCounts -t exon -g 'gene_id' -s 2 -o FeatureCounts_original.txt -a ${ref_dir}/${gtf_file} outputs/STAR/*/*.bam

Rsubread 2.20.0 script:

featureCounts(files=bamfiles, annot.ext=gtf, isGTFAnnotationFile=TRUE, isPairedEnd=FALSE, strandSpecific=2)

As far as I can tell by the manual, the parameters used between these two are the same, and both of the run logs indicate the same settings in the Settings box.

Of possible interest: the Rsubread output from version 2.18.0 was identical to that from 2.20.0, and the bash Subread output from version 2.0.1 was identical to that from 2.0.6.

In comparing the resulting feature count tables from the R version and bash version, there are major difference in the number of alignments assigned per sample and in total. Rsubread successfully assigned 205,861,649 more alignments, with an average of 14,704,404 more per sample. Here is a breakdown comparing the two outputs (information compiled from the run logs):

Comparison of successfully assigned reads Rsubread vs. Subread

My team is very confused about which method is the truth, as we would draw vastly different conclusions about the populations of genes that are significantly differentially regulated from each method. Do you see anything that I am doing incorrectly or have suggestions for why the R and bash results are not in agreement? I am happy to provide more information as well. Thank you.

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /apps/x86_64/R/4.4.0/lib64/R/lib/libRblas.so
LAPACK: /apps/x86_64/R/4.4.0/lib64/R/lib/libRlapack.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.4.0

Subread Rsubread • 1.6k views

ADD COMMENT • link updated 14 months ago by ATpoint ★ 5.0k • written 14 months ago by Aimee ▴ 10

score 3 · Accepted Answer · 2024-12-07

The difference (as ATpoint has correctly pointed out) is that Rsubread::featureCounts counts multi-mapping reads by default whereas the command-line featureCounts does not. This would have been reported to you in the output settings box when the commands were run. Rsubread::featureCounts would have given you a settings box saying:

||      Multimapping reads : counted                                          ||

whereas the command-line version would have given you a settings box saying:

||      Multimapping reads : not counted                                      ||

so the settings should have been quite explicit.

We recommend that you use countMultiMappingReads=TRUE and allowMultiOverlap=FALSE for gene-level RNA-seq read counting, and that is why the defaults are set that way in Rsubread. With these settings, featureCounts will count a multi-mapping read only if it overlaps exactly one gene. The default settings mean that it counts multi-mapping reads only if they would be counted in exactly the same way by an EM algorithm. This not only makes logical sense, it has also proved to give good results in many thousands of RNA-seq analyses.

The different defaults in the two program versions are for historical reasons. Originally the default was countMultiMappingReads=FALSE in both versions. When experience showed that countMultiMappingReads=TRUE was better, it was easy to change the R code default, but changing the command-line default requires changes to the C code, which is more work. Remember that this is free software that is maintained long-term without any ongoing funding to do so.

I do not understand why you say you have "paired reads" but are setting isPairedEnd=FALSE. Using paired-end reads would cut down substantially on the proportion of multi-mapping reads. If you have paired-end reads, then both ends will originate from the same strand. I can't think of any reason for counting one end only from each pair.

score 2 · Accepted Answer · 2024-12-06

2

Entering edit mode

ATpoint ★ 5.0k

@atpoint-13662

Last seen 11 hours ago

Germany

The R version counts multimappers (countMultiMappingReads is TRUE by default) while the command line version does not.

ADD COMMENT • link 14 months ago ATpoint ★ 5.0k

0

Entering edit mode

Thank you! I will try rerunning with that parameter and see if it fixes the discrepancy! I'll report back.

Update: This completely resolved the issue! The feature count tables are identical now. Thank you again SO MUCH! I agree--It makes no sense that (1) the defaults are different between the two and (2) that this setting in particular would be set to TRUE by default.

ADD REPLY • link 14 months ago Aimee ▴ 10