Search
Question: Rsubread featureCounts with strand information
0
gravatar for rubi
11 months ago by
rubi70
rubi70 wrote:

Hi,

I have RNAseq data with strand information and I'm using Rsubread's featureCounts function to sum reads onto the GTF annotation.

 

This is the command I'm using:

featureCounts(files=my.bam,annot.ext=my.gtf,isGTFAnnotationFile=T,useMetaFeatures=T,isPairedEnd=T)

 

my.gtf is gencode.v25.primary_assembly.annotation.gtf BTW.

 

When I specify strandSpecific=1 I'm getting a weird behavior where, for an example gene, it's reporting many more reads mapping to an antisense gene encoded on the - strand than to an overlapping protein coding gene encoded on the + strand (260 vs. 12 reads). This is neither supported when I count the number of reads per strand for that genomic interval using samtools view nor when I view this region with IGV.

If I don't specify strandSpecific I am getting nearly the reverse outcome: most of the reads are assigned to the protein coding gene (2 vs. 878).

Any idea what is this?

 

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
[1] C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Rsubread_1.24.0     snpEnrichment_1.7.0 doParallel_1.0.10  
[4] iterators_1.0.8     foreach_1.4.3      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7         lattice_0.20-34     codetools_0.2-15   
 [4] assertthat_0.1      grid_3.3.1          plyr_1.8.4         
 [7] gtable_0.2.0        scales_0.4.1        ggplot2_2.2.0      
[10] zlibbioc_1.20.0     lazyeval_0.2.0      Matrix_1.2-7.1     
[13] snpStats_1.24.0     splines_3.3.1       tools_3.3.1        
[16] munsell_0.4.3       survival_2.40-1     BiocGenerics_0.20.0
[19] colorspace_1.2-7    tibble_1.2         

 

 

ADD COMMENTlink modified 10 months ago by Aaron Lun17k • written 11 months ago by rubi70
3
gravatar for Aaron Lun
10 months ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

I'll stick my neck out and guess that your data is reversely stranded. If you're using Illumina's TruSeq stranded protocol, the first read of each pair should contain the sequence of the template strand, i.e., it is in the opposite direction of the coding sequence. Thus, you should set strandSpecific=2 in featureCounts

Incidentally, this would explain your observations for the "weird gene". Let's say that the antisense gene is not expressed much, while the protein-coding gene is highly expressed, which explains the results when you don't use strand-specific counting. (This difference is exacerbated by the fact that, even if the antisense gene were expressed, read pairs couldn't get uniquely assigned to it due to the presence of the protein-coding gene.) However, when you set (probably incorrectly) strandSpecific=1, read pairs derived from the protein coding gene get assigned to the antisense gene, and vice versa, resulting in a switch in the count sizes.

When I'm not sure about the strandedness of the data, I examine the BAM files on a genome browser to check whether the first read in each pair in the same or opposite direction as the coding strand for a handful of genes. This will tell you quickly what's going on.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Aaron Lun17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 134 users visited in the last hour