Question

featureCounts on read 1 of a paired-end experiment

0

Entering edit mode

RRnaSeq • 0

@rrnaseq-12083

Last seen 7.3 years ago

Hi all,

I'm new to RNAseq analysis and am facing what seems to be an unusual situation, so I would love any and all advice you folks may have. We ran a stranded illumina truseq experiment which was meant to be paired-end. However, after R1 was successfully generated we experienced technical difficulties w/the sequencer and R2 is mainly of no use.

Obviously the ideal move here would be to rerun the experiment, however this is not an option. So, the plan is to move forward treating R1 as a stranded single-end experiment - which brings me to my pair of questions: 1.) Is this approach methodologically sound or is it invalid to try to interpret R1 without R2 (barring the obvious caveat that we will suffer all the disadvantages single-end data have relative to paired-end data)

and 2.) What are the appropriate options to feed featureCounts with respect to strandedness and paired-endedness?

So far I've tried strandSpecific=0,strandSpecific=1 and strandSpecific=2 - my expectation given the protocol was that strandSpecific=2 (reverse stranded) would be appropriate, however the unstranded option (strandSpecific=0) gives double the assigned reads as either strandSpecific=1 or strandSpecific=2. Is this expected behavior?

I've also been leaving "isPairedEnd" as its default (isPairedEnd=FALSE), but the function seems to recognize that the data were meant to be paired-end and prints "Paired-end reads are included" as part of its output. Will this affect my results?

Thanks so much!

rnaseq • 1.0k views

ADD COMMENT • link updated 7.3 years ago by Aaron Lun ★ 28k • written 7.3 years ago by RRnaSeq • 0

score 0 · Answer 1 · 2016-12-28

I recall encountering a similar issue several years ago, where we got a lot of inter-chromosomal pairs in a ChIP-seq experiment due to some ligation occurring in the library preparation. In that situation, we just threw away the second read of the pair and treated the first read as if it were sequenced as single-end data. I don't think there should be any problems with doing this - a read sequence is a read sequence, so alignment and counting should be applicable even if you don't use the second read.

Note that we re-ran the alignment (with subread) using only the first read's FASTQ file, so that the various pair-related flags in the output BAM file were not set. If you do it like that, featureCounts shouldn't be able to determine whether the sequencing data was originally paired or not. This should ensure that featureCounts (and any other downstream programs) treats the data as single-end.

As for the strandedness, strandSpecific=2 should be the right setting for the TruSeq protocol, as the first read should correspond to the template sequence. I'm not sure why you're getting twice as many reads assigned with strandSpecific=0; perhaps the reads are not actually stranded? I would open up the BAM file on a genome viewer to check. Have a look at some genes and see if all the overlapping reads are on the same strand. If not, then the data are not stranded.

P.S. Add Rsubread to the tag of your post, otherwise the maintainers won't be notified.