featureCount - very less Successfully assigned reads
Entering edit mode
amoltej ▴ 10
Last seen 5.7 years ago

hello all, 

I am using featurecount for differential expression analysis. After running feature count I found out there are very less number of reads assigned successfully (33%). where as my SAM file (aligned by STAR) showing 82% mapped reads. I tried both counting by exon and gene feature. Featurecount code that I have used - 

featureCounts -M -p -O -t exon -a /home/amol/work/Cuffcompare/cuffcmp.combined.gtf -o Bcyt.count3 /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam

Summery of feature count - 

 Input files : 1 BAM file                                       ||
||                           P /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam  ||
||                                                                            ||
||             Output file : Bcyt.count3                                      ||
||             Annotations : /home/amol/work/Cuffcompare/cuffcmp.combined ... ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : no                                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : counted                                          ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file /home/amol/work/Cuffcompare/cuffcmp.combined.gtf ...  ||
||    Features : 84237                                                        ||
||    Meta-features : 16590                                                   ||
||    Chromosomes : 4940                                                      ||
||                                                                            ||
|| Process BAM file /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam...          ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Found reads that are not properly paired.                               ||
||    (missing mate or the mate is not the next read)                         ||
||    2580055 reads have missing mates.                                       ||
||    Input was converted to a format accepted by featureCounts.              ||
||    Total fragments : 5314957                                               ||
||    Successfully assigned fragments : 1771088 (33.3%)                       ||
||    Running time : 2.54 minutes  


can somebody please tell me whats going wrong? 

thank you 



featurecounts rsubread • 15k views
Entering edit mode

Perhaps I'm being pedantic, but this isn't really a Bioconductor question. It seems like you're using the standalone C version of featureCounts, rather than that in Rsubread. A more appropriate forum might be the Subread mailing list at https://groups.google.com/forum/#!forum/subread.

Entering edit mode

Thanks Aaron for directing me! 

But does this really matters because background is same, process is same. and you are understanding why it is happening. whether it is C version or R version how would it be different? 


Entering edit mode

From what I understand, there could be some slight differences with versioning and/or default parameter settings between the Rsubread and standalone C implementations. So, it's worthwhile posting to the appropriate forum.

Entering edit mode
Last seen 1 day ago
United States

I think you might have a fundamental misunderstanding of what STAR and featureCounts are doing.

The former is aligning reads to the genome, and the latter is counting which of those aligned reads appear to overlap known genes. It's expected that the percentage of reads that align to the genome will be larger than the percentage that align to genes, as genes make up only about 3% of the genome.

In addition, unless you did polyA selection as part of the sample processing, you likely had lots of transcripts that still contained intronic sequences (in our experience, non-polyA selected mRNA tends to be 'polluted' with about as many intronic reads as exonic reads for Ion Proton data, and about 50% more reads for HiSeq data), which featureCounts isn't going to count. As an example here are results from two bamfiles I have in hand (HiSeq data):

Rsubread featureCounts()

|| Process BAM file ../data/batch1/Sample_7749.bam...                         ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 26203397                                              ||
||    Successfully assigned fragments : 15261620 (58.2%)                      ||
||    Running time : 0.88 minutes                                             ||
||                                                                            ||
|| Process BAM file ../data/batch1/Sample_7752.bam...                         ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 18057482                                              ||
||    Successfully assigned fragments : 11065817 (61.3%)                      ||
||    Running time : 0.56 minutes                                             ||


Now checking overlaps for those two using summarizeOverlaps():

> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
Loading required package: GenomicFeatures
Loading required package: GenomicRanges
> gns <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
> library(Rsamtools)
> library(GenomicAlignments)
> bfl <- BamFileList(fls[1:2])
> olaps <- summarizeOverlaps(gns, bfl)
> colSums(assays(olaps)$counts)
Sample_7749.bam Sample_7752.bam
       22143603        15200876       

> 22143603/26203397*100
[1] 84.50661
> 15200876/18057482*100
[1] 84.18049

So I go from about 60% to 84% if I include introns (modulo differences between the genes mapped by TxDb.Mmusculus.UCSC.mm10.knownGene and the mm10 file that comes with Rsubread). But still there are other aligned reads that are not within the known genes I align against, which can be real genes that we don't know about, or reads with mismatches that match perfectly to some other genomic location, or DNA that magically slipped through the mRNA purification step, etc.

Entering edit mode

Dear James, 

Thank you for detailed description. I did do poly A selection while library preparation. and if you look carefully at the code then you would notice that I have used cuffcompare generated GTF file which has (to my knowledge) all the transcripts newly mapped as well as previously reported. That's why  I was expecting high percentage of successfully assigned reads. 

do you have any comment on that?



Entering edit mode

You only have just over 5M reads, which is really not enough reads to accurately detect new transcripts, so I wouldn't be so sure that what you got from cuffcompare is that useful. In my experience (and in the experience of the SEQC/MAQC III consortium), you need really deep sequencing for transcript detection.

Is this a species without an existing genome that you could align to?


Login before adding your answer.

Traffic: 251 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6