I am currently trying to optimize RNAseq analysis for stress conditions, in which splicing dysregulation (among other issues) is a major obstacle to accurate mapping. I therefore thought to try genome/split-read alignment over direct transcriptome mapping. Using the subread package with existing heat-shock data, I obtained very good mapping (~95%) for all of my control samples, but significantly lower mapping (~83%) under stress conditions. I understand that 83% mapping is not bad, but I'm interested to know why these mapping results are so consistently different from control conditions.
The interesting bit: when I manually search the unmapped sequences using BLAT, most seem to be stress-related genes (small sample size, so far). More perplexingly, they have 100% sequence identity with a single gene. My question, then, is why featureCounts is failing to map these sequences? I've thought to change the stringency settings, but I doubt that will make any difference for perfect alignments.
- C. elegans RNAseq data, aligned to the WBcel215 (ce10) genome assembly
- Trimmed using Trimmomatic
- ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:25 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:2
subread-buildindex -o WBcel215_index *.fa for i in *.qc.fq; do subread-align -t 0 -T 4 -d 25 -i WBcel215_index -r $i -o $i.sr.bam; done
Example sequences (picked randomly by hand, believe it or not):