Question

Rsamtools summarizeOverlaps produces low counts

1

Entering edit mode

liorglic ▴ 10

@liorglic-22928

Last seen 4.0 years ago

Hello,
I am preparing for RNA-seq data analysis using DESeq2.
Following the instructions in the vignete, I started by aligning RNA-seq reads to the genome using STAR, and then used the following code to count reads per gene, for only one bam file:

filenames <- c("/path/to/sorted.bam")
bamfiles <- BamFileList(filenames)
txdb <- makeTxDbFromGFF("/path/to/ann.gtf", circ_seqs = character())
ebg <- exonsBy(txdb, by="gene")
se <- summarizeOverlaps(features = ebg, reads = bamfiles, mode = 'Union', singleEnd = TRUE)

Read counts per gene seem very low. For example, when I run sum(assay(se)), I only get ~2 million reads. This is surprising since the bam file contains ~59 million alignments. I ran Qualimap with the same bam and GTF, which resulted in ~44 million reads mapping to genes, mostly in exons.
Does anyone have an idea what could cause this problem? Or maybe how to debug it, e.g, how can I find out how many reads were discarded and why? How come Rsamtools ends up with only a fraction of the reads mapped to genes found by Qualimap?
Thanks!

Rsamtools summarizeOverlaps RNA-seq • 1.0k views

ADD COMMENT • link updated 4.0 years ago by James W. MacDonald 65k • written 4.0 years ago by liorglic ▴ 10

score 1 · Answer 1 · 2020-04-10

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

You might try

se <- summarizeOverlaps(ebg, bamFiles, singleEnd = TRUE, preprocess.reads = invertStrand)

Which in my experience is what you have to do with STAR alignments, particularly if you actually have paired-end reads.

ADD COMMENT • link 4.0 years ago James W. MacDonald 65k

2

Entering edit mode

I would add that James's answer assumes your RNA-seq libraries were prepared with a stranded protocol, such as Illumina's TruSeq Stranded mRNA, which is the default setting for the argument ignore.strand=FALSE in the call to summarizeOverlaps(). However, if your RNA-seq libraries were prepared with a non-stranded protocol, such as Illumina's TruSeq RNA Library Prep Kit v2, then you should set ignore.strand=TRUE and you don't need the argument preprocess.reads. You can find more information on using summarizeOverlaps() with single-end RNA-seq data prepared with stranded and non-stranded protocols in this other thread.

ADD REPLY • link 4.0 years ago Robert Castelo ★ 3.2k

0

Entering edit mode

Thanks a lot! This solved the problem.
I think this argument is necessary not because the alignment was performed with STAR, but because the library is reverse-strand-specific. I've actually tried setting ignore.strand=FALSE, but this didn't solve the problem.

ADD REPLY • link 4.0 years ago liorglic ▴ 10