Reads in Unassigned_NoFeatures category produced by featureCounts
4
3
Entering edit mode
Likai Mao ▴ 40
@likai-mao-9454
Last seen 8.1 years ago
QIMR, Brisbane, Australia

Hi, I have some paired-end RNA samples with similar number of sequenced reads. However, one of them had low number of assigned reads and high number of Unassigned_NoFeatures reads.

Assigned    8,752,014
Unassigned_Ambiguity    4,523,883
Unassigned_MultiMapping    0
Unassigned_NoFeatures    49,769,215
Unassigned_Unmapped    6,918,731
Unassigned_MappingQuality    0
Unassigned_FragmentLength    0
Unassigned_Chimera    0
Unassigned_Secondary    19,635,817
Unassigned_Nonjunction    0
Unassigned_Duplicate    0

This category means “The fragment mapped to a region that is not annotated in the annotation file”. To confirm it, I selected 100K reads marked “Unassigned_NoFeatures” and get their mapped coordinates from BAM file using "bedtools bamtobed" 2.24.0. They were then overlapped with all exons annotated in the GTF file used to run featureCounts using "bedtools intersect" 2.25.0 (I somehow used different versions, which doesn't look affecting the analysis though). These overlapped but not assigned reads should be due to their strands (could there be any other reasons?).

The strands of overlapping reads and exons were then checked. I found when mapping quality >3, all strands are forward strands, which doesn't meet with the setting (reverse strand) used to run featureCounts. So these reads should be unsigned. However, there were some overlaps meeting the setting. Interestingly, all of them had mapping quality 3 (mapping was done by STAR). I wonder how these reads were also not assigned. Two of them are listed below (output of bedtools intersect).

chr1    630896    630987    MG00HS09:702:C78TLACXX:7:1101:10090:91329/1    3    -    chr1    630895    630958    ENSG00000276171.1    .    +    ENSEMBL    exon    .    gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; exon_number 1; exon_id "ENSE00003717538.1"; level 3; tag "basic"; transcript_support_level "NA";
chr1    630865    630966    MG00HS09:702:C78TLACXX:7:1101:10090:91329/2    3    +    chr1    630895    630958    ENSG00000276171.1    .    +    ENSEMBL    exon    .    gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; exon_number 1; exon_id "ENSE00003717538.1"; level 3; tag "basic"; transcript_support_level "NA";

Note that the above is a read pair. First 6 columns are from BAM file (converted to BED format by "bedtools bamtobed"), latter columns are from GTF file. Read 1 was reversely ("-" in column 6) mapped to a forward gene and read 2 forwardly ("+") mapped to the same gene. This looks to be a perfect reverse strand. So why they were not assigned? Or is there anything wrong?

Such "strange" unassigned reads are the minority (110) anyway. Most unassigned reads (6170) did have forward strands. Except for the above "1-+" and "2++", other strange unassigned reads had "1+-" and "2--". Correctly unassigned reads were 1++, 1--, 2+- and 2-+. The command I used for featureCounts was:

isPairedEnd=TRUE, strandSpecific=2, countPrimaryAlignmentsOnly=TRUE, annot.ext='xxx.gtf', isGTFAnnotationFile=TRUE

BTW, this sample has highest ambiguity of strand: reverse strand ~56% (forward strand ~33%), while for all other samples reverse strands are higher than 70%, as inferred by program “infer_experiment.py” in package RSeQC, which may contribute to its high unassigning rate.

 

rsubread • 7.1k views
ADD COMMENT
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 10 days ago
Australia/Melbourne/Olivia Newton-John …

Can you provide CIGAR info for the two reads included in your post? Your second read mapped to a 102bp region but your first read mapped to a shorter region of 92bp. This makes me wonder if there is a gap in your second read that may result in this read being unassigned to chr1:630895. Your first read does not overlap this coordinate.

ADD COMMENT
0
Entering edit mode

I extracted the two lines in the BAM file and list below. Note they are shown in "inline" format (you can scroll horizontally to see the rest of the whole lines). Same format was also used for the "bedtools intersect" output above (so there are 10 GTF fields after the beginning 6 converted BAM fields).

MG00HS09:702:C78TLACXX:7:1101:10090:91329    419    chr1    630866    3    101M    =    630897    122    CAGCTAAGCACCCTAATCAACTGGCTTCAATCTACTTCTCCCGCCGCCGGGAAAAAAGGCGGGAGAAGCCCCGGCAGGTTTGAAGCTGCTTCTTCGAATTT    @CCFFFFFHHGHHJIJJIJIIIGHIJIIIIJIJIJJJIIFIFIJJDHIJHE?BAEEB@DABDD@BBDDCDDDDDD5>B?CDBDDDDDDDDDCACC@<8>AD    NH:i:2    HI:i:2    AS:i:176    NM:i:0    MD:Z:101

MG00HS09:702:C78TLACXX:7:1101:10090:91329    339    chr1    630897    3    10S91M    =    630866    -122    GGTCTTCAACCTCCTCCCCCCCCCGCGGGAAAAAAAGGGGGGAGAAGCCCCGGCAGGTTTGAAGCTGCTTCTTCGAATTTGCAATTCAATATGAAAATCAC    ##############################?AA;5'9,'E?)77JJIJJIIJIDJJJIIJJJJIJHEJJIJJJIIJIIIJJJJJJJIJHHHHFFDEDFBBB    NH:i:2    HI:i:2    AS:i:176    NM:i:7    MD:Z:2A2T1T3G4C2G8C62

ADD REPLY
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 10 days ago
Australia/Melbourne/Olivia Newton-John …

Thanks for providing further information. The first read in the pair indeed overlaps with the exon. But this read (and the other read in the pair) was marked as a secondary alignment. Since you counted primary alignments only ('countPrimaryAlignmentsOnly' was set to 'TRUE' in your command), featureCounts would not count it. Could you let us know how you extracted reads from the 'Unassigned_NoFeatures' category? What is the version of Rsubread you use?

 

ADD COMMENT
0
Entering edit mode
Likai Mao ▴ 40
@likai-mao-9454
Last seen 8.1 years ago
QIMR, Brisbane, Australia

I actually used Subread (1.5.0 p1) to create the counts this time. I posted here just because I am used to it. The command line was same as the R version above:

featureCounts -R -p --primary -a xxx.gtf -s 2 -o count.txt

I extracted 'Unassigned_NoFeatures' lines from the output file (xxx.bam.featureCounts) produced by option '-R'. To find the overlaps between exons and rnofeature eads, both GTF and BAM files were converted to BED format using 'gtf2bed' (a program of BEDOPS) and 'bedtools bamtobed', respectively. BAM converted BED file was filtered using a Perl script to keep only lines with 'Unassigned_NoFeatures' reads. The exon and nofeature read overlaps were then obtained by command "bedtools intersect -a bam.bed -b gtf.bed -wa -wb". Please feel free to let me know if you need further details.

ADD COMMENT
0
Entering edit mode

Thanks Likai. We do need to have more info to work out what went wrong. Could you please provide the records in your gtf annotation file that overlaps the read pair of concern? Could you also let us know which genome you mapped your data to?

ADD REPLY
0
Entering edit mode

No worries. The related records in GTF file is as below.

chr1    ENSEMBL transcript      630896  630958  .       +       .       gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; level 3; tag "basic"; transcript_support_level "NA";
chr1    ENSEMBL exon    630896  630958  .       +       .       gene_id "ENSG00000276171.1"; transcript_id "ENST00000617238.1"; gene_type "miRNA"; gene_status "NOVEL"; gene_name "AC114498.1"; transcript_type "miRNA"; transcript_status "NOVEL"; transcript_name "AC114498.1-201"; exon_number 1; exon_id "ENSE00003717538.1"; level 3; tag "basic"; transcript_support_level "NA";

The GTF is from GENCODE (v24). The reference genome is also from GENCODE (GRCh38.p5) for best compatibility.

ADD REPLY
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 10 days ago
Australia/Melbourne/Olivia Newton-John …

The read pair of concern ("MG00HS09:702:C78TLACXX:7:1101:10090:91329") has two reported alignments - a primary one and a secondary one (which is shown in your post). The "xxx.bam.featureCounts" file includes both the primary one and the secondary one and these two alignments have identical read name. When you looked for the "Unassigned_NoFeature" alignments, the one you got was the primary alignment (the secondary alignment was in the category of  "Unassigned_Secondary"). However when you searched this read in your bam file, you got both primary and secondary alignments and then you found that the secondary alignment overlap with that exon but this secondary alignment wasn't the one you found originally from the "xxx.bam.featureCounts" file.

If you search the name of this read pair ("MG00HS09:702:C78TLACXX:7:1101:10090:91329") in your "xxx.bam.featureCounts" file, you should find two records - one is in the "Unassigned_NoFeature" category and the other in the "Unassigned_Secondary" category. The alignment you were investigating was the one assigned to the "Unassigned_Secondary" category, which overlaps with your exon but was not counted because you only counted primary alignments.

ADD COMMENT
0
Entering edit mode

Thanks Wei for your detailed explanation. It's very clear now. I remember I did find another 2 lines with same read name in BAM file mapped to different regions of reference genome, which should be primary alignments.

ADD REPLY

Login before adding your answer.

Traffic: 864 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6