Hi all,
I am having a weird issue (weird to me being a new R user) using featurecounts to generate counts for genes but also for individual exons. I have successfully mapped gene counts to their respective genes and the total reads mapped and annotated to genes match those generated by a bioinformatics department so its a good quality control.
I also want to generate counts for individual exons so that I can test the knockdown of one of our genes at specific exons in a mouse model. The issue is that the exon counts do not add up or match I am seeing with total read mapped to genes. The code I am using is as follows:
For annotating genes:
fc <- featureCounts(bamDir, isGTFAnnotationFile = TRUE, annot.ext = "Mus_musculus.GRCm38.94.chr.gtf", isPairedEnd = FALSE)
Count result for gene of interest =
ENSMUSG00000053080 = | 1158 |
This matches results obtained from bioinformatcis department for the gene above as well as all other genes (with +/- of 5 counts, I am guessing for parameters they have used in their script which I am not )...
However, when I attempt to annotate exons specifically, I am running in to trouble for two reasons. The total counts generated and also the amount of exons being incorrect (possible duplicates?)
The code I am using is as follows:
fc <- featureCounts(bamDir, isGTFAnnotationFile = TRUE, annot.ext = "Mus_musculus.GRCm38.94.chr.gtf", isPairedEnd = FALSE, useMetaFeatures = FALSE) OR
fc <- featureCounts(bamDir, isGTFAnnotationFile = TRUE, annot.ext = "Mus_musculus.GRCm38.94.chr.gtf", isPairedEnd = FALSE, useMetaFeatures = "exons")
These both produce the following results:
ENSMUSG00000053080 | 3 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 71 |
ENSMUSG00000053080 | 55 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 | 8 |
ENSMUSG00000053080 | 0 |
ENSMUSG00000053080 |
0 |
This is icnredible confusing as there are only 5 exons for this gene, and clearly the counts are far too low to reflect the total counts annotated to the gene as a whole.. Can anybody advice what I might be doing wrong here and why this result is being produced?
And finally, is there an intuitive way of plotting counts per exons in an easy to follow way, maybe thorugh ggplo2 just plotting counts on y axis by exon number on x axis? or a way of passing this festurecount result to ggbio?
Thank you for your help!
Hi Wei,
Thank you so much for your quick response! I really appreciate the clear explanation. May I ask one quick follow up question please.. Given that reads span exon junctions, looking at the annotation slot, similar start sites have different site ends... So how does one tackle this to produce a total number of counts for the correct amount of exons in the gene.. Is it to merge the duplicates?
Or is this specifically what flattenGTF will do by merging non-overlapping exon bins? If so, do I need to pass the flattened GTF file which is a GFF? because passing the GTF path after flattening GTF produces the same result with duplicated exons...
Thank you!!
I would recommend you to merge the duplicated or overlapping exons. This is what
flattenGTF
function does. Output of this function is adata.frame
object that contains a SAF format annotation. Take a look at the help page forflattenGTF
for more information. You can then feed thedata.frame
object tofeatureCounts
for counting.Or you may directly use the inbuilt annotation for GRCm38/mm10 in
featureCounts
. The inbuilt annotation has all overlapping/duplicated exons merged. By default,featureCounts
uses the inbuilt mm10 annotation to count reads (annot.inbuilt = "mm10"
).thank you so much for your assistance! Really appreciate this, it has been so helpful!