Hi,
I have been using featureCounts to obtain both exon- and gene-level read counts (reads were aligned with STAR). For one particular gene (ARID5B, which has 12 exons, 5 unique to one isoform, 2 unique to another isoform and 5 shared), I find that the read count summed over the 12 exons is greater than the gene-based read count. This is not posssible as featureCounts uses the exon-union method for gene-level counting. Below are the relevant parameter settings for featureCounts:
gene-based count:
annot.ext="/home/inah/RefGTF/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf",
isGTFAnnotationFile=TRUE,
GTF.featureType="exon", GTF.attrType="gene_id", useMetaFeatures=TRUE,
allowMultiOverlap=TRUE,
minOverlap=1,
largestOverlap=TRUE,
strandSpecific=2,
isPairedEnd=TRUE
exon-level counts:
annot.ext="/home/inah/RefGTF/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf",
isGTFAnnotationFile=TRUE,
GTF.featureType="exon", GTF.attrType="exon_id", useMetaFeatures=TRUE,
allowMultiOverlap=TRUE,
minOverlap=1,
largestOverlap=TRUE,
strandSpecific=2,
isPairedEnd=TRUE
For the exon-level counts, I set useMetaFeatures=TRUE because if it is FALSE, then it looks like the count matrix contains multiple (identical) rows for exons which are shared by several isoforms (when TRUE only one of these rows is present).
Can someone give me a hint why my exon-sum count is higher than the gene-level count, is there something wrong with my parameter settings for the exon-level counting?
Thanks, Ina
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_1.24.1
Yes this should make total exonic counts be less than total counts for genes. However this will result in the loss of exon-spanning reads and your exonic counting result wouldn't be as accurate.
What is the problem with getting more counts for exons? You have to count all the reads originating from an exon no matter they are exon-spanning reads or reads falling entirely within the exon.