Unassigned reads in Rsubread FeatureCounts
2
0
Entering edit mode
am39 • 0
@am39-10874
Last seen 3.9 years ago

I'm using the featureCounts function of Rsubread to assign aligned reads to features. Something like ~20% of my reads are unassigned (in the "NoFeature") category. Is there any way to see which reads these are? (Maybe it's possible in command-line subread, outside of R?) I'd like to be able to look at my unassigned reads to understand if they're contamination or from an un-annotated part of the genome or what.

Thank you!

Rsubread • 2.8k views
ADD COMMENT
2
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 2 days ago
Australia/Melbourne/Olivia Newton-John …

The reportReads parameter allows you to output counting results for each read and then you can identify those unassigned reads.

ADD COMMENT
0
Entering edit mode

Thank you - that's what I needed!

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 10 hours ago
WEHI, Melbourne, Australia

The NoFeature reads are from un-annotated parts of the genome. They can't be "contamination" because then they wouldn't be aligned to the genome and hence wouldn't be counted by featureCounts in the first place.

The R version of Rsubread has the same functionality as the command-line.

ADD COMMENT
0
Entering edit mode

Right, sorry I wasn't thinking clearly.

Nevertheless, I'd like to take a look at what those unsassigned reads are (i.e., which un-annotated regions of the genome they are from) - is there anywhere in the output I can find those unassigned reads? I don't see it in the R documentation of the featureCounts function.

ADD REPLY
0
Entering edit mode

I'd say that you can have human DNA contamination and those reads would align to the human genome, typically to intergenic regions (outside the boundaries of annotated genes) and, to a lesser extent, to intronic regions.

ADD REPLY
0
Entering edit mode

How much the DNA or rRNA contamination would affect the read quantification by FeatureCounts?

In my case, Total alignments are 31032304, and Successfully assigned alignments are 13809791(44.5%).

As I understand, removing the contaminants(from the bam file) can increase the % of Successfully assigned alignments as # of Total alignments may go down. However, # of Successfully assigned alignments will remain the same. Please correct me if wrong as I am new to this.

Hisat2 gave the alignment stats as

27183499 reads; of these:

27183499 (100.00%) were paired; of these:
4206782 (15.48%) aligned concordantly 0 times
22447624 (82.58%) aligned concordantly exactly 1 time
529093 (1.95%) aligned concordantly >1 times

----
4206782 pairs aligned concordantly 0 times; of these:
  847251 (20.14%) aligned discordantly 1 time
----
3359531 pairs aligned 0 times concordantly or discordantly; of these:
  6719062 mates make up the pairs; of these:
    3826578 (56.95%) aligned 0 times
    2621370 (39.01%) aligned exactly 1 time
    271114 (4.03%) aligned >1 times

92.96% overall alignment rate

Thanks in advance Ekta

ADD REPLY
0
Entering edit mode

featureCounts does not know if a read is from contamination or not - it simply compares the mapping coordinates of a read against the chromosomal coordinates of all gene exons and assigns the read to the overlapping gene if it finds one. If a contamination read does not overlap any exon, then it will not be assigned to any gene. But if it does overlap an exon it may be assigned to the corresponding gene.

ADD REPLY
0
Entering edit mode

I'd say the question is how much the DNA or rRNA contamination would affect sequencing your target molecules and, consequently, the quantification by FeatureCounts or any such software, but I'd say that to answer that question you need to diagnose the origin of the reads with software such as FastQ Screen, which is outside de realm of Bioconductor. Maybe others can point out to whether there is Bioconductor software for that purpose.

ADD REPLY

Login before adding your answer.

Traffic: 754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6