Hisat2 vs featureCounts summary statistics
2
1
Entering edit mode
@koen-van-den-berge-6369
Last seen 5 months ago
Ghent University, Belgium

I am processing paired-end RNA-Seq data. I map them using Hisat2.0.3.Next I convert SAM files to BAM, sort and index the BAM file using samtools 1.3. I subsequently use the sorted BAM files to aggregate mapped reads based on annotation using featureCounts (Rsubread package v1.20.6).

After mapping the reads using Hisat2, we get some summary statistics like this:

20358841 reads; of these:
  20358841 (100.00%) were paired; of these:
    5046482 (24.79%) aligned concordantly 0 times
    12600084 (61.89%) aligned concordantly exactly 1 time
    2712275 (13.32%) aligned concordantly >1 times
    ----
    5046482 pairs aligned concordantly 0 times; of these:
      57755 (1.14%) aligned discordantly 1 time
    ----
    4988727 pairs aligned 0 times concordantly or discordantly; of these:
      9977454 mates make up the pairs; of these:
        8723119 (87.43%) aligned 0 times
        914540 (9.17%) aligned exactly 1 time
        339795 (3.41%) aligned >1 times
78.58% overall alignment rate

 

 

After sorting and indexing, featureCounts on this BAM file tells me

Process BAM file xxxx.bam...                                        ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 21770038                                              ||
||    Successfully assigned fragments : 10375129 (47.7%)                  

 

Which shows that the total fragments (i.e. paired-end reads) do not correspond with the number of paired-end reads that is given in the summary statistics from Hisat2. featureCounts tells me I have 21M fragments and Hisat2 tells me I have 20M fragments. How can this be possible?

 

featurecounts hisat2 rna-seq mapping • 6.4k views
ADD COMMENT
3
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 3 months ago
Australia/Melbourne/Olivia Newton-John …

The total number of fragments reported by featureCounts is the total number of alignments included in your bam files. If there are fragments that were reported more than once in your bam file, the total number of alignments will be greater than the total number of fragments included in your data.

ADD COMMENT
0
Entering edit mode

Thank you for your reply, Wei.

So the larger number of fragments reported by featureCounts may be the result of multi-mapping reads?

However, similar, the number of counted fragments is higher than the number of concordantly mapped reads (exactly one time as in the example above) for some of the samples. This happens when we are not counting multi-mapping fragments, not counting multi-overlapping fragments and only counting fragments where both reads are mapped. How can this be explained?

Best,

Koen

ADD REPLY
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 3 months ago
Australia/Melbourne/Olivia Newton-John …

Can you provide the parameters you used when running featureCounts?

The discrepancy might be due to how you define "concordantly mapped reads". Although you count fragments where both ends are mapped, the fragments with paired end distance greater than the nominal template length are counted by featureCounts as well and this will give you larger counts. If you use -d and -D options to specify the minimum and maximum fragment length, you will get reduced number of counts. So this is all dependent on how you count the reads.

ADD COMMENT
0
Entering edit mode

Dear Wei, thanks again.

I used featureCounts (Rsubread_1.20.6) with the following arguments:

featureCounts ( f i l e s = bamFiles , annot . ext=”xxx.gff3”, isGTFAnnotationFile = TRUE, useMetaFeatures = TRUE, isPairedEnd=TRUE, requireBothEndsMapped = TRUE, GTF . featureType = ”CDS” , GTF. attrType = ”ID” , allowMultiOverlap=FALSE) 

The genome is an in-house draft for a non-model organism.

For the mapping, Hisat2 has been used with default parameters, and I basically compared the number of concordantly mapped reads reported there with the number of assigned fragments in featureCounts.

ADD REPLY
0
Entering edit mode

Thanks for the information. I think you will need to understand how Hisat2 defines "concordantly mapped reads". What is the limit on the distance between the two reads from the same pair?

If you set 'checkFragLength=TRUE'  in your featureCounts command, you will get less assigned reads.

ADD REPLY

Login before adding your answer.

Traffic: 792 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6