#counting reads
se <- summarizeOverlaps(features=ebg, reads=bamfiles,
mode="Union",
singleEnd=TRUE,
ignore.strand=FALSE)
Counting takes only 10 minutes for all bam files which makes me suspicious since it's much longer for human genome, approx. 30 minutes per bam file with 30 milion reads. I have 12 bam files with approx. 10 milion reads per file. I get a count table but when I continue with differential expression analysis I get low numbers of differentially expressed genes (between 20-200, depends on the condition). I guess there is something not ok about counting step. Could anyone help?
I'm not sure the run time is outside what you would expect. The gtf file you're using has 1.5 million entries, while the equivalent human file has 2.5 million, so I'd expect it to be faster working with mouse data. Combine that with using only 1/3 of the reads and a reduction in time doesn't seem weird. (Unless I'm reading this wrong and it's 10 minutes for 12 mouse bams vs 30 minutes for 1 human bam - that is outside what I'd expect).
If you're seeing very little difference in between your groups, one thing worth checking is the proportion of reads that you're finding overlap with features. You don't necessary expect every read to map to a feature, but the proportion should be pretty high. If this number is low, or similarly you have lots of features that have no reads mapping to them, you might want to try setting ignore.strand=TRUE. If the numbers suddenly go up then your reads are aligned to the opposite strand to your annotation and you'll want to flip the strand in your txdb object.
Thank you for your answer. Unfortunately counting takes only 10 minutes in total for all bam files. To me it seems really weird. I set ignore.strand=FALSE because I was informed it was a stranded experiment by the lab that sequenced my samples. I checked the number of reads that had overlap and it seems pretty low for all 12 bam files.
When I had stranded data where the reads came from the opposite strand to the annotation I saw very similar transcript overlap rates to you (~5-10%, assuming 10m reads per library). Setting ignore.strand=TRUE is an easy way to check the hypothesis as you should see those numbers shoot up.
Of course this loses the useful strand information, so if this really is the case you can flip the annotation strand and then run summarizeOverlaps again. Something like this worked for me:
If this really is what has happened, you may also want to look at your alignment step, as some aligners such as tophat can be configured to use the strandedness of the library to improve mapping - but you again need to know which protocol has been used.
To me the processing time doesn't seem surprising -- the BAM files are not large and the reads are single-end.
Some suggestions for debugging. Read in an entire bam file GenomicAlignments::readGAlignments() to make sure that you're getting the reads that you think you should be. Add ScanBamParam() with what="flag" and use bamFlagAsBitMatrix() or bamFlagTest() to make sure that the alignments look sensible. Choose a specific region, maybe range(egb)[1] or something more informative, and read records overlapping that location via readGAlignments() and ScanBamParam(which=...). Invoke the Union() or countOverlaps() functions directly to explore how the selected reads are being counted.
Consider whether your replies should be an 'answer' or a 'comment'.
It's probably is a stranded library, but there are different protocols which can produce data where the read comes from the transcript strand or the opposite strand. You can read more about this here (http://onetipperday.sterding.com/2012/07/how-to-tell-which-library-type-to-use.html)
When I had stranded data where the reads came from the opposite strand to the annotation I saw very similar transcript overlap rates to you (~5-10%, assuming 10m reads per library). Setting
ignore.strand=TRUE
is an easy way to check the hypothesis as you should see those numbers shoot up.Of course this loses the useful strand information, so if this really is the case you can flip the annotation strand and then run
summarizeOverlaps
again. Something like this worked for me:If this really is what has happened, you may also want to look at your alignment step, as some aligners such as tophat can be configured to use the strandedness of the library to improve mapping - but you again need to know which protocol has been used.
To me the processing time doesn't seem surprising -- the BAM files are not large and the reads are single-end.
Some suggestions for debugging. Read in an entire bam file
GenomicAlignments::readGAlignments()
to make sure that you're getting the reads that you think you should be. AddScanBamParam()
withwhat="flag"
and usebamFlagAsBitMatrix()
orbamFlagTest()
to make sure that the alignments look sensible. Choose a specific region, mayberange(egb)[1]
or something more informative, and read records overlapping that location viareadGAlignments()
andScanBamParam(which=...)
. Invoke theUnion()
orcountOverlaps()
functions directly to explore how the selected reads are being counted.Consider whether your replies should be an 'answer' or a 'comment'.
Flipping strand annotation indeed increased the number of overlaps.
For alignment indeed tophat was used.