Question

FeatureCounts counts wrongly when quantifying mixed paired and unpaired reads

0

Entering edit mode

k.vitting.seerup ▴ 120

@kvittingseerup-7956

Last seen 2.4 years ago

European Union

Summary

It seems featureCounts under certain circumstances ignores the isPairedEnd = TRUE argument.

Background

I have been using featureCounts successfully for many years but recently ran into a problem in files where I have a mixture of paired and unpaired strand specific data (Produced by Trimmomatic mapped by Hisat2 (v 2.0.1-beta)). To illustrate the problem consider the two bam files (and their indexes) here. Both sam files contain a subset of chr1 consisting of the first 5 million basepairs (chr1-1-5.000.000) in mm10. The only difference between the files is that one were mapped using only the paired data (onlypaired.bam) and the other also includes the unpaired data (pairedand_unpaired.bam).

Analysis goal Quantify the sense/anti-sense ratio of reads mapping to genes. For the example files we are specifically interested in the ENSMUSG00000025902.13 gene located at chr1:4490931-4497354.

Samtools approach: As far as I can tell the files are identical with regards to mapping and paired flags and the flags works fine.

samtools view -c -f 1 only_paired.bam chr1:4490931-4497354 2573 paired reads (and a total of 2657 reads in pairedandunpaired.bam)

When considering ENSMUSG00000025902.13 the the strand of the first-mate seems specific in both files:

samtools view -f 64 only_paired.bam chr1:4490931-4497354 | sam2bed | cut -f 6 | sort | uniq -c
     11 -
   1272 +

samtools view -f 64 paired_and_unpaired.bam chr1:4490931-4497354 | sam2bed | cut -f 6 | sort | uniq -c
     11 -
   1272 +

So far so good. Now lets try with featureCount from the R package Rsubread

Rsubread::featureCount approach:

library(Rsubread)

### Create regions to quantify
# sense
x <- capture.output( # prevent the subread printout
    senseQuantPaired <- featureCounts(
        files = bamFiles,
        annot.ext=senseAnnot,
        isPairedEnd = TRUE,
        strandSpecific = 1,
    )
)
# anit-sense paired
x <- capture.output( # prevent the subread printout
    antisenseQuantPaired <- featureCounts(
        files = bamFiles,
        annot.ext=antisenseAnnot,
        isPairedEnd = TRUE,
        strandSpecific = 1,
    )
)
# anit-sense unpaired
x <- capture.output( # prevent the subread printout
    antisenseQuantUnpaired <- featureCounts(
        files = bamFiles,
        annot.ext=antisenseAnnot,
        isPairedEnd = FALSE,
        strandSpecific = 1,
    )
)

x <- capture.output( # prevent the subread printout
    antisenseQuantUnpaired <- featureCounts(
        files = bamFiles,
        annot.ext=antisenseAnnot,
        isPairedEnd = FALSE,
        nthreads=1,
        strandSpecific = 1,
    )
)

### Massage output into df
data.frame(
    libType = c('only_paired','paired_and_unpaired'),
    sense = senseQuant$counts[1,],
    antisensePaired = antisenseQuantPaired$counts[1,],
    antisenseUnpaired = antisenseQuantUnpaired$counts[1,],
    row.names = NULL
)
              libType sense antisensePaired antisenseUnpaired
1         only_paired  1283              11              1283
2 paired_and_unpaired  1341            1309              1309

From the "sense" column we see that both methods agree on the number of reads mapping to the '+' strand - results supported by the samtools approach above. The problem is the number of reads mapping to the '-' strand. When considering the file only containing paired reads very few reads map to the '-' strand (again just like the samtools approach) but for the file also containing unpaired reads featureCounts finds a lot of reads. Actually the same number of reads as if we quantified the "pairedandunpaired.bam" file as unpaired data. The read-count can also be rescued by filtering the pairedandunpaired file with the "-f 1" flag via samtools (not shown).

Summary

So it seems featureCounts under certain circumstances ignores the isPairedEnd = TRUEargument...? And it seems to have something to do with the paired/unpaired reads (composition) in the first 5e6 since if I just extract the ENSMUSG00000025902.13 gene (aka ignore the first ~4490000 nt) the minus-strand quantification is correct.

Hope this get fixed soon.

Cheers Kristoffer

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.32.4

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0        magrittr_1.5      usethis_1.4.0     devtools_2.0.1    pkgload_1.0.2     R6_2.4.0         
 [7] rlang_0.3.1       tools_3.5.1       pkgbuild_1.0.2    sessioninfo_1.1.1 cli_1.0.1         withr_2.1.2      
[13] remotes_2.0.2     yaml_2.2.0        assertthat_0.2.0  digest_0.6.18     rprojroot_1.3-2   crayon_1.3.4     
[19] processx_3.2.0    callr_3.0.0       base64enc_0.1-3   fs_1.2.6          ps_1.2.1          testthat_2.0.1   
[25] memoise_1.1.0     glue_1.3.0        compiler_3.5.1    desc_1.2.0        backports_1.1.2   prettyunits_1.0.2

rsubread featureCounts • 5.6k views

ADD COMMENT • link updated 3.8 years ago by Sarah_piggy ▴ 10 • written 6.8 years ago by k.vitting.seerup ▴ 120

1

Entering edit mode

Hi Kristoffer,

I'm not sure what is in the paired_and_unpaired.bam file. Does it contain both single-end and paired-end reads? You can figure it out by running

$ samtools view paired_and_unpaired.bam | awk 'and($2, 1) == 1' | wc -l

and

$ samtools view paired_and_unpaired.bam | awk 'and($2, 1) == 0' | wc -l

If both commands give non-zero numbers then your BAM file is a mixture of single- and paired-end reads. FeatureCounts assumes that a BAM file can contain only one type of reads, either single-end or paired-end, but never the mixture of them. It then checks the type of the reads in an input file by looking at the first read in it. If it sees a single-end read (ie without the 0x01 FLAG) at the beginning of the input file, it treats all the reads in it as single-end.

You can also allow screen output from featureCounts and check the type of reads in the screen output.

If you see S paired_and_unpaired.bam in the screen output, it means that featureCounts treats all reads as single-end.

If you see P paired_and_unpaired.bam in the screen output, it means that featureCounts treats all reads as paired-end.

Cheers, Yang

ADD REPLY • link 6.7 years ago Yang Liao ▴ 450

0

Entering edit mode

Hi Yang

Thanks for clarifying this - and yes you are right it contain a mixture of paired and unpaired reads. Why would featureCounts not support mixed paired or unpaired reads? Means that if I have paired end data that needs trimming I will throw away a lot of data - often 5-15%...

I think I was mislead by the "isPairedEnd" argument. Maybe adding a note explaining this to the documentation this would be helpfull for future users...

ADD REPLY • link 6.7 years ago k.vitting.seerup ▴ 120

0

Entering edit mode

One reason for featureCounts not allowing the mixture of single-end and paired-end reads in one SAM/BAM file is that the counts of single reads shouldn't be added to the counts of fragments because they are different things.

A normal read aligner reports unpaired read mapping results as paired-end reads. Namely, although only one end in the read-pair is mapped, it still has a 0x01 FLAG in the SAM/BAM file to indicate that the input is paired-end.

ADD REPLY • link 6.7 years ago Yang Liao ▴ 450

0

Entering edit mode

I think you might have mis-understood me due to me not being clear - sorry about that :-). What I am talking about is not mixing single-end and paire-end libraries. I am talking about the case where a paired end library where trimming of low quality reads etc (via e.g. Trimmomatic) causes the removal of one of the read mates but not the other so you end up with data that is a mix of paired and unpaired reads from the same paired experiment (that is what the example files used above is). In such a case the unpaired reads are equivalent of cases where only one mate mapped (as you mention in the last comment). Why would that not be supported by featureCounts?

ADD REPLY • link 6.7 years ago k.vitting.seerup ▴ 120

0

Entering edit mode

featureCounts can correctly process any bam files generated from the mapping of reads generated from a paired end library including those bam files that contain read pairs that have only one end mapped (the other end is not required to be included in the same file), as long as all the reads are marked as being generated from a read pair in their FLAG field. However your bam files including unpaired reads do not seem to have this FLAG set and featureCounts will treat them as single end reads. featureCounts identifies single end and paired end reads according to the SAM/BAM specification.

Note that you will get unintended counting results if you count paired end reads as single end - the second read in a pair will be incorrectly counted to the strand opposite to the strand the first read is counted to. When featureCounts identifies the reads as paired end reads it will flip the strand of the second read to make sure the two reads from the same pair are counted to the same strand.

Could you please also provide your full commands for trimming and aligning your raw reads (reads from FASTQ files) so we can see why paired end reads were marked as single end?

ADD REPLY • link 6.7 years ago Wei Shi ★ 3.6k

0

Entering edit mode

For trimming I am using Trimmomatic and it does not specifically matter which filtering you are doing but when you trim paired data the result is 4 files (see "Paired End Mode" section of the manual) two which contains the ordered paired reads and two files containing respectively the left and right reads which lost their mate due to the trimming.

For mapping I am using HISAT2 where the trimmed paired files are mapped via the "-1" and "-2" argument and the one without a mate after trimming are mapped with the "-U" ( see HISTAT options here)

So in summary if a mates is lost due to trimming they will not be marked as paired when you map them. Sorry if it is a naive question but could such read not be supported by counting them just like reads where only one mate was mapped?

ADD REPLY • link 6.7 years ago k.vitting.seerup ▴ 120

0

Entering edit mode

The problem isn't in the mapping, the problem is that featureCounts knows to expect read 2 to be in the opposite orientation as read 1. But if it doesn't know it's looking at read 2, and thinks its looking at read 1, the orientation will not what the software is expecting

I suppose what you could do is align the read2 singletons separately, and manually change the flags in the bam to indicate that those reads are read2 where read1 failed to map. (Flag 137 I believe). Then merged that bam with the paired one, and FeatureCounts will probably understand that.

But that seems like a lot of work for what have to be a very small % of your reads.

ADD REPLY • link 6.7 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

An easier and more complete solution would be to simply not use Trimmomatic, as suggested by Wei Shi in his answer.

ADD REPLY • link 6.5 years ago Gordon Smyth 53k

score 3 · Answer 1 · 2019-04-29

The problem you encountered with your read counting is due to the strand-specific counting of your singleton reads. You cannot perform a correct strand-specific counting for these reads because featureCounts does not know if they are the first read or the second read in a pair. If they are the first read featureCounts will use their strand as is to match feature strands. But if they are the second read featureCounts will flip the strand of the read before matching it to feature strands. However because you treated these reads as single end reads in your mapping, the paired end information was lost and therefore you cannot perform strand specific counting for these reads correctly.

You don't have to do read trimming before mapping. The Subread aligner (align function in Rsubread) automatically performs trimming while it maps reads to a reference. Paired end information is preserved for every read after mapping is completed, so you wont have any problems with the subsequent strand-specific read counting. The Rsubread workflow (align + featureCounts) was demonstrated to have an excellent performance in quantifying expression levels of genes. See the following paper that was published this year for more details:

https://www.ncbi.nlm.nih.gov/pubmed/30783653