Question: featureCount - very less Successfully assigned reads
0
gravatar for amoltej
4.8 years ago by
amoltej10
Australia
amoltej10 wrote:

hello all, 

I am using featurecount for differential expression analysis. After running feature count I found out there are very less number of reads assigned successfully (33%). where as my SAM file (aligned by STAR) showing 82% mapped reads. I tried both counting by exon and gene feature. Featurecount code that I have used - 

featureCounts -M -p -O -t exon -a /home/amol/work/Cuffcompare/cuffcmp.combined.gtf -o Bcyt.count3 /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam

Summery of feature count - 

 Input files : 1 BAM file                                       ||
||                           P /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam  ||
||                                                                            ||
||             Output file : Bcyt.count3                                      ||
||             Annotations : /home/amol/work/Cuffcompare/cuffcmp.combined ... ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : no                                               ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : counted                                          ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file /home/amol/work/Cuffcompare/cuffcmp.combined.gtf ...  ||
||    Features : 84237                                                        ||
||    Meta-features : 16590                                                   ||
||    Chromosomes : 4940                                                      ||
||                                                                            ||
|| Process BAM file /home/amol/work/STAR/Bcyt/Aligned.nsorted.bam...          ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Found reads that are not properly paired.                               ||
||    (missing mate or the mate is not the next read)                         ||
||    2580055 reads have missing mates.                                       ||
||    Input was converted to a format accepted by featureCounts.              ||
||    Total fragments : 5314957                                               ||
||    Successfully assigned fragments : 1771088 (33.3%)                       ||
||    Running time : 2.54 minutes  

 

can somebody please tell me whats going wrong? 

thank you 

amol

 

rsubread featurecounts • 8.3k views
ADD COMMENTlink modified 4.8 years ago by James W. MacDonald51k • written 4.8 years ago by amoltej10

Perhaps I'm being pedantic, but this isn't really a Bioconductor question. It seems like you're using the standalone C version of featureCounts, rather than that in Rsubread. A more appropriate forum might be the Subread mailing list at https://groups.google.com/forum/#!forum/subread.

ADD REPLYlink written 4.8 years ago by Aaron Lun25k

Thanks Aaron for directing me! 

But does this really matters because background is same, process is same. and you are understanding why it is happening. whether it is C version or R version how would it be different? 

 

ADD REPLYlink written 4.8 years ago by amoltej10

From what I understand, there could be some slight differences with versioning and/or default parameter settings between the Rsubread and standalone C implementations. So, it's worthwhile posting to the appropriate forum.

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Aaron Lun25k
Answer: featureCount - very less Successfully assigned reads
6
gravatar for James W. MacDonald
4.8 years ago by
United States
James W. MacDonald51k wrote:

I think you might have a fundamental misunderstanding of what STAR and featureCounts are doing.

The former is aligning reads to the genome, and the latter is counting which of those aligned reads appear to overlap known genes. It's expected that the percentage of reads that align to the genome will be larger than the percentage that align to genes, as genes make up only about 3% of the genome.

In addition, unless you did polyA selection as part of the sample processing, you likely had lots of transcripts that still contained intronic sequences (in our experience, non-polyA selected mRNA tends to be 'polluted' with about as many intronic reads as exonic reads for Ion Proton data, and about 50% more reads for HiSeq data), which featureCounts isn't going to count. As an example here are results from two bamfiles I have in hand (HiSeq data):

Rsubread featureCounts()

|| Process BAM file ../data/batch1/Sample_7749.bam...                         ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 26203397                                              ||
||    Successfully assigned fragments : 15261620 (58.2%)                      ||
||    Running time : 0.88 minutes                                             ||
||                                                                            ||
|| Process BAM file ../data/batch1/Sample_7752.bam...                         ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||    Total fragments : 18057482                                              ||
||    Successfully assigned fragments : 11065817 (61.3%)                      ||
||    Running time : 0.56 minutes                                             ||

 

Now checking overlaps for those two using summarizeOverlaps():

> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
Loading required package: GenomicFeatures
Loading required package: GenomicRanges
> gns <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
> library(Rsamtools)
> library(GenomicAlignments)
> bfl <- BamFileList(fls[1:2])
> olaps <- summarizeOverlaps(gns, bfl)
> colSums(assays(olaps)$counts)
Sample_7749.bam Sample_7752.bam
       22143603        15200876       

> 22143603/26203397*100
[1] 84.50661
> 15200876/18057482*100
[1] 84.18049

So I go from about 60% to 84% if I include introns (modulo differences between the genes mapped by TxDb.Mmusculus.UCSC.mm10.knownGene and the mm10 file that comes with Rsubread). But still there are other aligned reads that are not within the known genes I align against, which can be real genes that we don't know about, or reads with mismatches that match perfectly to some other genomic location, or DNA that magically slipped through the mRNA purification step, etc.

ADD COMMENTlink written 4.8 years ago by James W. MacDonald51k

Dear James, 

Thank you for detailed description. I did do poly A selection while library preparation. and if you look carefully at the code then you would notice that I have used cuffcompare generated GTF file which has (to my knowledge) all the transcripts newly mapped as well as previously reported. That's why  I was expecting high percentage of successfully assigned reads. 

do you have any comment on that?

Thanks 

Amol

ADD REPLYlink written 4.8 years ago by amoltej10

You only have just over 5M reads, which is really not enough reads to accurately detect new transcripts, so I wouldn't be so sure that what you got from cuffcompare is that useful. In my experience (and in the experience of the SEQC/MAQC III consortium), you need really deep sequencing for transcript detection.

Is this a species without an existing genome that you could align to?

ADD REPLYlink written 4.8 years ago by James W. MacDonald51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 308 users visited in the last hour