Rsamtools summarizeOverlaps produces low counts
1
1
Entering edit mode
liorglic ▴ 10
@liorglic-22928
Last seen 4.0 years ago

Hello,
I am preparing for RNA-seq data analysis using DESeq2.
Following the instructions in the vignete, I started by aligning RNA-seq reads to the genome using STAR, and then used the following code to count reads per gene, for only one bam file:

filenames <- c("/path/to/sorted.bam")
bamfiles <- BamFileList(filenames)
txdb <- makeTxDbFromGFF("/path/to/ann.gtf", circ_seqs = character())
ebg <- exonsBy(txdb, by="gene")
se <- summarizeOverlaps(features = ebg, reads = bamfiles, mode = 'Union', singleEnd = TRUE)

Read counts per gene seem very low. For example, when I run sum(assay(se)), I only get ~2 million reads. This is surprising since the bam file contains ~59 million alignments. I ran Qualimap with the same bam and GTF, which resulted in ~44 million reads mapping to genes, mostly in exons.
Does anyone have an idea what could cause this problem? Or maybe how to debug it, e.g, how can I find out how many reads were discarded and why? How come Rsamtools ends up with only a fraction of the reads mapped to genes found by Qualimap?
Thanks!

Rsamtools summarizeOverlaps RNA-seq • 1.0k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

You might try

se <- summarizeOverlaps(ebg, bamFiles, singleEnd = TRUE, preprocess.reads = invertStrand)

Which in my experience is what you have to do with STAR alignments, particularly if you actually have paired-end reads.

ADD COMMENT
2
Entering edit mode

I would add that James's answer assumes your RNA-seq libraries were prepared with a stranded protocol, such as Illumina's TruSeq Stranded mRNA, which is the default setting for the argument ignore.strand=FALSE in the call to summarizeOverlaps(). However, if your RNA-seq libraries were prepared with a non-stranded protocol, such as Illumina's TruSeq RNA Library Prep Kit v2, then you should set ignore.strand=TRUE and you don't need the argument preprocess.reads. You can find more information on using summarizeOverlaps() with single-end RNA-seq data prepared with stranded and non-stranded protocols in this other thread.

ADD REPLY
0
Entering edit mode

Thanks a lot! This solved the problem.
I think this argument is necessary not because the alignment was performed with STAR, but because the library is reverse-strand-specific. I've actually tried setting ignore.strand=FALSE, but this didn't solve the problem.

ADD REPLY

Login before adding your answer.

Traffic: 722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6