Question: Rsubread featureCounts with strand information
gravatar for rubi
12 months ago by
rubi80 wrote:


I have RNAseq data with strand information and I'm using Rsubread's featureCounts function to sum reads onto the GTF annotation.


This is the command I'm using:



my.gtf is gencode.v25.primary_assembly.annotation.gtf BTW.


When I specify strandSpecific=1 I'm getting a weird behavior where, for an example gene, it's reporting many more reads mapping to an antisense gene encoded on the - strand than to an overlapping protein coding gene encoded on the + strand (260 vs. 12 reads). This is neither supported when I count the number of reads per strand for that genomic interval using samtools view nor when I view this region with IGV.

If I don't specify strandSpecific I am getting nearly the reverse outcome: most of the reads are assigned to the protein coding gene (2 vs. 878).

Any idea what is this?


> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Rsubread_1.24.0     snpEnrichment_1.7.0 doParallel_1.0.10  
[4] iterators_1.0.8     foreach_1.4.3      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7         lattice_0.20-34     codetools_0.2-15   
 [4] assertthat_0.1      grid_3.3.1          plyr_1.8.4         
 [7] gtable_0.2.0        scales_0.4.1        ggplot2_2.2.0      
[10] zlibbioc_1.20.0     lazyeval_0.2.0      Matrix_1.2-7.1     
[13] snpStats_1.24.0     splines_3.3.1       tools_3.3.1        
[16] munsell_0.4.3       survival_2.40-1     BiocGenerics_0.20.0
[19] colorspace_1.2-7    tibble_1.2         



ADD COMMENTlink modified 12 months ago by Aaron Lun17k • written 12 months ago by rubi80
gravatar for Aaron Lun
12 months ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

I'll stick my neck out and guess that your data is reversely stranded. If you're using Illumina's TruSeq stranded protocol, the first read of each pair should contain the sequence of the template strand, i.e., it is in the opposite direction of the coding sequence. Thus, you should set strandSpecific=2 in featureCounts

Incidentally, this would explain your observations for the "weird gene". Let's say that the antisense gene is not expressed much, while the protein-coding gene is highly expressed, which explains the results when you don't use strand-specific counting. (This difference is exacerbated by the fact that, even if the antisense gene were expressed, read pairs couldn't get uniquely assigned to it due to the presence of the protein-coding gene.) However, when you set (probably incorrectly) strandSpecific=1, read pairs derived from the protein coding gene get assigned to the antisense gene, and vice versa, resulting in a switch in the count sizes.

When I'm not sure about the strandedness of the data, I examine the BAM files on a genome browser to check whether the first read in each pair in the same or opposite direction as the coding strand for a handful of genes. This will tell you quickly what's going on.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Aaron Lun17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 219 users visited in the last hour