Search
Question: Subread FeatureCounts for Exon level quantification: Length estimate for Exons as low as 1 base-pair
0
gravatar for jennyl.smith12
4 weeks ago by
jennyl.smith120 wrote:

Hello, 

I am working on quantifying RNASeq data at the exon level, which is intended for use in the DEXSeq package. I followed the documentation provided in the DEXSeq tutorial (http://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.pdf, page 23), as well as the commands on Subread to DEXSeq helper code (https://github.com/vivekbhr/Subread_to_DEXSeq)

First to produce the GTF File, I used the following code: 

GTF="/Reference_Data/GRCh37/gtf/Homo_sapiens.GRCh37.87.gtf"

python2 dexseq_prepare_annotation2.py -f Homo_sapiens.GRCh37.87_collapsed.gtf $GTF \ Homo_sapiens.GRCh37.87_collapsed.gff

 

Then I ran subread on the command line as submitted jobs on our shared HPC. We use the SLURM job scheduler. 

#Program:featureCounts v1.5.3; Command:

featureCounts -p -B -s 2 -T 16 -Q 10 -f -O -F GTF -a Homo_sapiens.GRCh37.87_collapsed.gtf -t exon -g gene_id  \ -o sample_withJunctionsOnGenome_dupsFlagged_stranded.exon.cts  \ sample_withJunctionsOnGenome_dupsFlagged.bam

 

I have two questions:

1) Why would I have exons for a gene that are only a few base-pairs long? I have some entries, in the example output below, where the length is calculated as only 1 base-pair.  This does not seem representative of the true expression of a real gene, which would not have exons of such short length. Could it be there was some error in the commands run above or in the way the GTF was produced? 

2) The number of assigned features for this quantification was particularly low.  I have two sets of RNASeq data; 1) created with Poly-A enrichment in the library prep and 2) produced with the ribodepletion (RBS) library protocol. Both sets had paired-end, strand-specific, 75bp reads, and were processed through identical bioinformatic pipeline from raw FASTQ to BAM.  I did check that the orientation of the paired reads were the same for both library preparation protocols.  It was these BAMs that I input to Subread featurecounts() function.  

For my poly-A library, I was getting on average 60% of my total reads assigned to features using the same commands as above, and same GTF reference. 

For my RBS library, I am getting on average 30-40% of my total reads assigned to features, again using the same commands as above and same GTF reference. The vast majority of reads, from the summary files produced by Subread and then summarized into one figure by MultiQC, for all samples were classified as "Unassigned_NoFeatures".  

I was thinking that this might make sense due to many additional non-protein coding RNAs that the Ribodepletion protocol will sequence.  Has anyone seen this type of difference before? Again, should I be optimizing my commands for ribodeplete RNASeq that I am not aware of?  

 

Thank you so much, 

Jenny

 

 

Example of my output for one sample is below:

Geneid Chr Start End Strand Length Sample_withJunctionsOnGenome_dupsFlagged.bam
ENSG00000223972 1 11869 11871 + 3 0
ENSG00000223972 1 11872 11873 + 2 0
ENSG00000223972 1 11874 12009 + 136 1
ENSG00000223972 1 12010 12057 + 48 2
ENSG00000223972 1 12058 12178 + 121 3
ENSG00000223972 1 12179 12227 + 49 9
ENSG00000223972 1 12595 12612 + 18 1
ENSG00000223972 1 12613 12697 + 85 20
ENSG00000223972 1 12698 12721 + 24 10
ENSG00000223972 1 12975 13052 + 78 1
ENSG00000223972 1 13221 13224 + 4 3
ENSG00000223972 1 13225 13374 + 150 3
ENSG00000223972 1 13375 13402 + 28 0
ENSG00000223972 1 13403 13452 + 50 10
ENSG00000223972 1 13453 13655 + 203 7
ENSG00000223972 1 13656 13660 + 5 1
ENSG00000223972 1 13661 13670 + 10 1
ENSG00000223972 1 13671 14409 + 739 1
ENSG00000223972 1 14410 14412 + 3 0
ENSG00000227232 1 14363 14403 - 41 0
ENSG00000227232 1 14404 14410 - 7 0
ENSG00000227232 1 14411 14501 - 91 9
ENSG00000227232 1 14502 14502 - 1 8
ENSG00000227232 1 14503 14829 - 327 51

 

 

 

ADD COMMENTlink modified 4 weeks ago by Gordon Smyth34k • written 4 weeks ago by jennyl.smith120
2
gravatar for Gordon Smyth
4 weeks ago by
Gordon Smyth34k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth34k wrote:

In answer to your first question, that's what dexseq_prepare_annotation2.py does. It collapses the original exons into smaller segments according to the intersections between different exons. If two original exons overlap except for one base, then it will output a collapsed exon consisting of that one base. This is presumably explained in the tutorial or documentation.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Gordon Smyth34k
1
gravatar for Wei Shi
4 weeks ago by
Wei Shi2.9k
Australia
Wei Shi2.9k wrote:

Have you looked into your GTF file to see if these exons are indeed very short? featureCounts just uses your provided annotation for counting and it does not modify your annotation.

The counting percentage for your polyA sample from featureCounts sounds reasonable. I dont have much experience with RBS sequencing, but I cant think of any ways to improve your counting since apparently most of your reads do not overlap exons.

ADD COMMENTlink written 4 weeks ago by Wei Shi2.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 323 users visited in the last hour