Question

Unassigned reads in Rsubread FeatureCounts

0

Entering edit mode

am39 • 0

@am39-10874

Last seen 5.7 years ago

I'm using the featureCounts function of Rsubread to assign aligned reads to features. Something like ~20% of my reads are unassigned (in the "NoFeature") category. Is there any way to see which reads these are? (Maybe it's possible in command-line subread, outside of R?) I'd like to be able to look at my unassigned reads to understand if they're contamination or from an un-annotated part of the genome or what.

Thank you!

Rsubread • 4.5k views

ADD COMMENT • link updated 3.7 years ago by Robert Castelo ★ 3.4k • written 5.7 years ago by am39 • 0

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

The NoFeature reads are from un-annotated parts of the genome. They can't be "contamination" because then they wouldn't be aligned to the genome and hence wouldn't be counted by featureCounts in the first place.

The R version of Rsubread has the same functionality as the command-line.

ADD COMMENT • link 5.7 years ago Gordon Smyth 53k

0

Entering edit mode

Right, sorry I wasn't thinking clearly.

Nevertheless, I'd like to take a look at what those unsassigned reads are (i.e., which un-annotated regions of the genome they are from) - is there anywhere in the output I can find those unassigned reads? I don't see it in the R documentation of the featureCounts function.

ADD REPLY • link 5.7 years ago am39 • 0

0

Entering edit mode

I'd say that you can have human DNA contamination and those reads would align to the human genome, typically to intergenic regions (outside the boundaries of annotated genes) and, to a lesser extent, to intronic regions.

ADD REPLY • link 5.7 years ago Robert Castelo ★ 3.4k

0

Entering edit mode

How much the DNA or rRNA contamination would affect the read quantification by FeatureCounts?

In my case, Total alignments are 31032304, and Successfully assigned alignments are 13809791(44.5%).

As I understand, removing the contaminants(from the bam file) can increase the % of Successfully assigned alignments as # of Total alignments may go down. However, # of Successfully assigned alignments will remain the same. Please correct me if wrong as I am new to this.

Hisat2 gave the alignment stats as

27183499 reads; of these:

27183499 (100.00%) were paired; of these:
4206782 (15.48%) aligned concordantly 0 times
22447624 (82.58%) aligned concordantly exactly 1 time
529093 (1.95%) aligned concordantly >1 times

----
4206782 pairs aligned concordantly 0 times; of these:
  847251 (20.14%) aligned discordantly 1 time
----
3359531 pairs aligned 0 times concordantly or discordantly; of these:
  6719062 mates make up the pairs; of these:
    3826578 (56.95%) aligned 0 times
    2621370 (39.01%) aligned exactly 1 time
    271114 (4.03%) aligned >1 times

92.96% overall alignment rate

Thanks in advance Ekta

ADD REPLY • link 3.7 years ago Ekta • 0

0

Entering edit mode

featureCounts does not know if a read is from contamination or not - it simply compares the mapping coordinates of a read against the chromosomal coordinates of all gene exons and assigns the read to the overlapping gene if it finds one. If a contamination read does not overlap any exon, then it will not be assigned to any gene. But if it does overlap an exon it may be assigned to the corresponding gene.

ADD REPLY • link 3.7 years ago Wei Shi ★ 3.6k

0

Entering edit mode

I'd say the question is how much the DNA or rRNA contamination would affect sequencing your target molecules and, consequently, the quantification by FeatureCounts or any such software, but I'd say that to answer that question you need to diagnose the origin of the reads with software such as FastQ Screen, which is outside de realm of Bioconductor. Maybe others can point out to whether there is Bioconductor software for that purpose.

ADD REPLY • link 3.7 years ago Robert Castelo ★ 3.4k

score 2 · Accepted Answer · 2020-05-06

2

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 6 days ago

Australia/Melbourne

The reportReads parameter allows you to output counting results for each read and then you can identify those unassigned reads.