Question

Rsubread, Error in featureCounts paired end

0

Entering edit mode

hauken_heyken ▴ 80

@hauken_heyken-13992

Last seen 18 months ago

Bergen

Hey, nice package.

I have an error using featureCounts with paired end reads: I mapped paired end reads with STAR (trimmed reads with fastp), and wanted to try out Rsubread to do the counting in R.

library(Rsubread)
# Define paths to RNA-seq bam file and gtf
a <- featureCounts(files = df$RNA[1], annot.ext = gtfAnno, isGTFAnnotationFile = T, isPairedEnd = T, requireBothEndsMapped=T, nthreads=30)

ERROR: both single-end and paired-end reads were found in the same input file!
FATAL Error: The program has to terminate and no counting file is generated.

What I don't get is why it cares if the file includes single end reads, when I say: requireBothEndsMapped=T ?

I'm in Bioc devel 3.10, Rsubread_1.35.13

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Rsubread_1.35.13            ggplot2_3.2.0               ORFik_1.5.7                 reshape2_1.4.3             
 [5] rtracklayer_1.45.1          GenomicAlignments_1.21.2    Rsamtools_2.1.2             Biostrings_2.53.0          
 [9] XVector_0.25.0              SummarizedExperiment_1.15.1 DelayedArray_0.11.0         BiocParallel_1.19.0        
[13] matrixStats_0.54.0          GenomicFeatures_1.37.1      AnnotationDbi_1.47.0        Biobase_2.45.0             
[17] GenomicRanges_1.37.8        GenomeInfoDb_1.21.1         IRanges_2.19.6              S4Vectors_0.23.6           
[21] BiocGenerics_0.31.2         data.table_1.12.2

Rsubread paired end featureCounts • 4.3k views

ADD COMMENT • link updated 4.8 years ago by Wei Shi ★ 3.6k • written 4.8 years ago by hauken_heyken ▴ 80

0

Entering edit mode

Dear Haakon,

I've edited the formatting of your question to make it more readable. You might find it helpful to read the tutorial on using markdown to format questions: https://support.bioconductor.org/p/117436 . Unfortunately, when you copy and paste R code and output, it is interpreted as markdown whether you want it to be or not!

Best
Gordon

ADD REPLY • link 4.8 years ago Gordon Smyth 50k

1

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 18 days ago

Australia/Melbourne/Olivia Newton-John …

As Gordon has replied it is better not to mix paired-end reads with single-end reads in the same file and it is really unnecessary to trim reads before mapping as read aligners will take care of this for you.

However we have modified featureCounts to make it be able to process bam files with mixed read types. The new version should be available in a couple of days.

ADD COMMENT • link 4.8 years ago Wei Shi ★ 3.6k

score 3 · Accepted Answer · 2019-07-04

IMO the problem is that trimming the reads with fastp has vandalized the sequence reads beyond recovery. From what you say, it appears that fastp has not just trimmed a few poor quality bases but has removed entire reads, so that some of the pair-end reads have become unpaired in an irregular fashion. This is not the first time we have seen the same problem with trimmers, see for example https://support.bioconductor.org/p/120193. I recently wrote (Chen et al, F1000Research 2016) that

we think that trimming poor quality segments is an old-fashioned step that is generally unnecessary given improved sequencing protocols and high quality robust aligners like Subread. Trimming is more likely to be harmful than helpful for a gene-level RNA-seq study. Indeed we think that encouraging entry-level users to make ad hoc edits of their sequence data is quite dangerous. It is far better to allow a quality-score-aware aligner like Subread (or STAR) to make decisions on a per-read basis.

Regarding the behaviour of featureCounts, it makes no sense really to have both single-end and pair-end reads in the same file (because a sequencing run must always produce one or the other, not both). If you tell featureCounts that the data is paired but actually it isn't, you shouldn't be surprised if it complains.

You might think featureCounts should just not count the single-end reads if you specify requireBothEndsMapped=TRUE but another way to think of it is that you've asked the program to check for something that is undefined. featureCounts currently gives the error message you see whenever single-end reads are found in a file that supposedly is paired-end. The Subread developers have discussed this issue but have felt that trying to rescue such datasets is too complicated and risk prone, given all the options and cases that featureCounts has to handle. A single end read is not entirely equivalent in information content to a paired-end read for which only one end can be mapped, and can't be treated as equivalent in all situations. It seems safer to ask the user to resolve the conflicts, e.g., by separating the single-end and pair-end reads into separate files or by avoiding the problem in the first place.

Trimming can be useful for some types of data or if there is a substantial problem with your sequencing. If you have regular RNA-seq though, then in our experience the best solution is to go back and align the original reads (with STAR or Rsubread) rather than the trimmed reads.