Search
Question: SummarizeOverlaps runs much faster than estimated
0
2.3 years ago by
karolina160 wrote:

Hi,

I am trying to count reads with summarizeOverlaps function. I am working with murine genome and GTF file dowloaded from the website (ftp://ftp.ensembl.org/pub/release-83/gtf/mus_musculus/). My code is:

gtffile <- file.path(dir,"Mus_musculus.GRCm38.83.gtf")
(txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character()))
(ebg <- exonsBy(txdb, by="gene"))

mode="Union",
singleEnd=TRUE,
ignore.strand=FALSE)

Counting takes only 10 minutes for all bam files which makes me suspicious since it's much longer for human genome, approx. 30 minutes per bam file with 30 milion reads. I have 12 bam files with approx. 10 milion reads per file. I get a  count table but when I continue with differential expression analysis I get low numbers of differentially expressed genes (between 20-200, depends on the condition).  I guess there is something not ok about counting step. Could anyone help?

modified 2.3 years ago • written 2.3 years ago by karolina160
1
2.3 years ago by
Mike Smith2.7k
EMBL Heidelberg / de.NBI
Mike Smith2.7k wrote:

I'm not sure the run time is outside what you would expect.  The gtf file you're using has 1.5 million entries, while the equivalent human file has 2.5 million, so I'd expect it to be faster working with mouse data.  Combine that with using only 1/3 of the reads and a reduction in time doesn't seem weird.  (Unless I'm reading this wrong and it's 10 minutes for 12 mouse bams vs 30 minutes for 1 human bam - that is outside what I'd expect).

If you're seeing very little difference in between your groups, one thing worth checking is the proportion of reads that you're finding overlap with features.  You don't necessary expect every read to map to a feature, but the proportion should be pretty high.  If this number is low, or similarly you have lots of features that have no reads mapping to them, you might want to try setting ignore.strand=TRUE.  If the numbers suddenly go up then your reads are aligned to the opposite strand to your annotation and you'll want to flip the strand in your txdb object.

0
2.3 years ago by
karolina160 wrote:

Dear Mike,

Thank you for your answer. Unfortunately counting takes only 10 minutes in total for all bam files. To me it seems really weird. I set ignore.strand=FALSE because I was informed it was a stranded experiment by the lab that sequenced my samples. I checked the number of reads that had overlap and it seems pretty low for all 12 bam files.

colSums(assay(se))
GC027189 GC027194 GC027197 GC027190 GC027193 GC027198 GC027191 GC027195 GC027199
1218552   547055   497315   646214   658831   595220  1574572   545063   443403
GC027192 GC027196 GC027200
634400   628173   512950
2

It's probably is a stranded library, but there are different protocols which can produce data where the read comes from the transcript strand or the opposite strand.  You can read more about this here (http://onetipperday.sterding.com/2012/07/how-to-tell-which-library-type-to-use.html)

When I had stranded data where the reads came from the opposite strand to the annotation I saw very similar transcript overlap rates to you (~5-10%, assuming 10m reads per library).  Setting ignore.strand=TRUE is an easy way to check the hypothesis as you should see those numbers shoot up.

Of course this loses the useful strand information, so if this really is the case you can flip the annotation strand and then run summarizeOverlaps again.  Something like this worked for me:

ebg <- exonsBy(txdb, by="gene")

tmp <- unlist(ebg, use.names = FALSE)
strand(tmp)<- ifelse(strand(tmp) == '+', '-', '+')
ebgInverted <- relist(tmp, ebg)

se <- summarizeOverlaps(features=ebgInverted, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=FALSE)


If this really is what has happened, you may also want to look at your alignment step, as some aligners such as tophat can be configured to use the strandedness of the library to improve mapping - but you again need to know which protocol has been used.

To me the processing time doesn't seem surprising -- the BAM files are not large and the reads are single-end.

Some suggestions for debugging. Read in an entire bam file GenomicAlignments::readGAlignments() to make sure that you're getting the reads that you think you should be. Add ScanBamParam() with what="flag" and use bamFlagAsBitMatrix() or bamFlagTest() to make sure that the alignments look sensible. Choose a specific region, maybe range(egb)[1] or something more informative, and read records overlapping that location via readGAlignments() and ScanBamParam(which=...). Invoke the Union() or countOverlaps() functions directly to explore how the selected reads are being counted.

Flipping strand annotation indeed increased the number of overlaps.

colSums(assay(se))
GC027189.gcap_dev.bam GC027194.gcap_dev.bam GC027197.gcap_dev.bam
20079678              16739574              15305205
GC027190.gcap_dev.bam GC027193.gcap_dev.bam GC027198.gcap_dev.bam
15229181              16365496              15979181
GC027191.gcap_dev.bam GC027195.gcap_dev.bam GC027199.gcap_dev.bam
45133256              15802480              12456766
GC027192.gcap_dev.bam GC027196.gcap_dev.bam GC027200.gcap_dev.bam
17114915              16770177              13470745 

For alignment indeed tophat was used.