Question

Subread FeatureCounts for Exon level quantification: Length estimate for Exons as low as 1 base-pair

0

Entering edit mode

jennyl.smith12 • 0

@jennylsmith12-16139

Last seen 3.3 years ago

Hello,

I am working on quantifying RNASeq data at the exon level, which is intended for use in the DEXSeq package. I followed the documentation provided in the DEXSeq tutorial (http://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.pdf, page 23), as well as the commands on Subread to DEXSeq helper code (https://github.com/vivekbhr/Subread_to_DEXSeq).

First to produce the GTF File, I used the following code:

GTF="/Reference_Data/GRCh37/gtf/Homo_sapiens.GRCh37.87.gtf"

python2 dexseq_prepare_annotation2.py -f Homo_sapiens.GRCh37.87_collapsed.gtf $GTF \ Homo_sapiens.GRCh37.87_collapsed.gff

Then I ran subread on the command line as submitted jobs on our shared HPC. We use the SLURM job scheduler.

#Program:featureCounts v1.5.3; Command:

featureCounts -p -B -s 2 -T 16 -Q 10 -f -O -F GTF -a Homo_sapiens.GRCh37.87_collapsed.gtf -t exon -g gene_id  \ -o sample_withJunctionsOnGenome_dupsFlagged_stranded.exon.cts  \ sample_withJunctionsOnGenome_dupsFlagged.bam

I have two questions:

1) Why would I have exons for a gene that are only a few base-pairs long? I have some entries, in the example output below, where the length is calculated as only 1 base-pair. This does not seem representative of the true expression of a real gene, which would not have exons of such short length. Could it be there was some error in the commands run above or in the way the GTF was produced?

2) The number of assigned features for this quantification was particularly low. I have two sets of RNASeq data; 1) created with Poly-A enrichment in the library prep and 2) produced with the ribodepletion (RBS) library protocol. Both sets had paired-end, strand-specific, 75bp reads, and were processed through identical bioinformatic pipeline from raw FASTQ to BAM. I did check that the orientation of the paired reads were the same for both library preparation protocols. It was these BAMs that I input to Subread featurecounts() function.

For my poly-A library, I was getting on average 60% of my total reads assigned to features using the same commands as above, and same GTF reference.

For my RBS library, I am getting on average 30-40% of my total reads assigned to features, again using the same commands as above and same GTF reference. The vast majority of reads, from the summary files produced by Subread and then summarized into one figure by MultiQC, for all samples were classified as "Unassigned_NoFeatures".

I was thinking that this might make sense due to many additional non-protein coding RNAs that the Ribodepletion protocol will sequence. Has anyone seen this type of difference before? Again, should I be optimizing my commands for ribodeplete RNASeq that I am not aware of?

Thank you so much,

Jenny

Example of my output for one sample is below:

Geneid	Chr	Start	End	Strand	Length	Sample_withJunctionsOnGenome_dupsFlagged.bam
ENSG00000223972	1	11869	11871	+	3	0
ENSG00000223972	1	11872	11873	+	2	0
ENSG00000223972	1	11874	12009	+	136	1
ENSG00000223972	1	12010	12057	+	48	2
ENSG00000223972	1	12058	12178	+	121	3
ENSG00000223972	1	12179	12227	+	49	9
ENSG00000223972	1	12595	12612	+	18	1
ENSG00000223972	1	12613	12697	+	85	20
ENSG00000223972	1	12698	12721	+	24	10
ENSG00000223972	1	12975	13052	+	78	1
ENSG00000223972	1	13221	13224	+	4	3
ENSG00000223972	1	13225	13374	+	150	3
ENSG00000223972	1	13375	13402	+	28	0
ENSG00000223972	1	13403	13452	+	50	10
ENSG00000223972	1	13453	13655	+	203	7
ENSG00000223972	1	13656	13660	+	5	1
ENSG00000223972	1	13661	13670	+	10	1
ENSG00000223972	1	13671	14409	+	739	1
ENSG00000223972	1	14410	14412	+	3	0
ENSG00000227232	1	14363	14403	-	41	0
ENSG00000227232	1	14404	14410	-	7	0
ENSG00000227232	1	14411	14501	-	91	9
ENSG00000227232	1	14502	14502	-	1	8
ENSG00000227232	1	14503	14829	-	327	51

subread featurecounts dexseq rnaseq • 3.2k views

ADD COMMENT • link updated 6.0 years ago by Gordon Smyth 51k • written 6.0 years ago by jennyl.smith12 • 0

score 2 · Answer 1 · 2018-07-15

In answer to your first question, that's what dexseq_prepare_annotation2.py does. It collapses the original exons into smaller segments according to the intersections between different exons. If two original exons overlap except for one base, then it will output a collapsed exon consisting of that one base. This is presumably explained in the tutorial or documentation.

score 1 · Answer 2 · 2018-07-15

Have you looked into your GTF file to see if these exons are indeed very short? featureCounts just uses your provided annotation for counting and it does not modify your annotation.

The counting percentage for your polyA sample from featureCounts sounds reasonable. I dont have much experience with RBS sequencing, but I cant think of any ways to improve your counting since apparently most of your reads do not overlap exons.