Question

read count summed over exons is greater tham the gene-level read count using featureCounts

0

Entering edit mode

inah ▴ 10

@inah-13176

Last seen 6.7 years ago

Hi,
I have been using featureCounts to obtain both exon- and gene-level read counts (reads were aligned with STAR). For one particular gene (ARID5B, which has 12 exons, 5 unique to one isoform, 2 unique to another isoform and 5 shared), I find that the read count summed over the 12 exons is greater than the gene-based read count. This is not posssible as featureCounts uses the exon-union method for gene-level counting. Below are the relevant parameter settings for featureCounts:

gene-based count:

annot.ext="/home/inah/RefGTF/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf",
isGTFAnnotationFile=TRUE,
GTF.featureType="exon", GTF.attrType="gene_id", useMetaFeatures=TRUE,
allowMultiOverlap=TRUE,
minOverlap=1,
largestOverlap=TRUE,
strandSpecific=2,
isPairedEnd=TRUE

exon-level counts:

annot.ext="/home/inah/RefGTF/GRCh38/annotation/Homo_sapiens.GRCh38.85.gtf",
isGTFAnnotationFile=TRUE,
GTF.featureType="exon", GTF.attrType="exon_id", useMetaFeatures=TRUE,
allowMultiOverlap=TRUE,
minOverlap=1,
largestOverlap=TRUE,
strandSpecific=2,
isPairedEnd=TRUE

For the exon-level counts, I set useMetaFeatures=TRUE because if it is FALSE, then it looks like the count matrix contains multiple (identical) rows for exons which are shared by several isoforms (when TRUE only one of these rows is present).

Can someone give me a hint why my exon-sum count is higher than the gene-level count, is there something wrong with my parameter settings for the exon-level counting?

Thanks, Ina

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] Rsubread_1.24.1

rsubread featurecounts exon mRNAseq • 3.2k views

ADD COMMENT • link updated 7.7 years ago by Wei Shi ★ 3.6k • written 7.7 years ago by inah ▴ 10

score 0 · Answer 1 · 2017-06-08

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 9 weeks ago

Australia/Melbourne

Hi Ina, this is not unexpected since exon-spanning reads (reads overlapping more than one exon) were counted more than once in your exon-level counting but they were counted only once in your gene-level counting. These reads should be counted more than once in your exon-level counting since they originate from multiple exons and each overlapping exon should receive a count. Your commands seem fine.

ADD COMMENT • link 7.7 years ago Wei Shi ★ 3.6k

0

Entering edit mode

To be more specific, is this because in the command, both "useMetaFeatures" and "allowMultiOverlap" are set true? If one only set "useMetaFeatures" true and "allowMultiOverlap" false, then the summed counts over exonic should be smaller than gene level? (in manual, it says if at meta feature level, exon spanning reads will only count once even they overlap with multiple exons)

ADD REPLY • link 6.7 years ago minervajunjun • 0

0

Entering edit mode

Yes this should make total exonic counts be less than total counts for genes. However this will result in the loss of exon-spanning reads and your exonic counting result wouldn't be as accurate.

What is the problem with getting more counts for exons? You have to count all the reads originating from an exon no matter they are exon-spanning reads or reads falling entirely within the exon.

ADD REPLY • link 6.7 years ago Wei Shi ★ 3.6k