I am currently trying to optimize RNAseq analysis for stress conditions, in which splicing dysregulation (among other issues) is a major obstacle to accurate mapping. I therefore thought to try genome/split-read alignment over direct transcriptome mapping. Using the subread package with existing heat-shock data, I obtained very good mapping (~95%) for all of my control samples, but significantly lower mapping (~83%) under stress conditions. I understand that 83% mapping is not bad, but I'm interested to know why these mapping results are so consistently different from control conditions.
The interesting bit: when I manually search the unmapped sequences using BLAT, most seem to be stress-related genes (small sample size, so far). More perplexingly, they have 100% sequence identity with a single gene. My question, then, is why featureCounts is failing to map these sequences? I've thought to change the stringency settings, but I doubt that will make any difference for perfect alignments.
Background:
- C. elegans RNAseq data, aligned to the WBcel215 (ce10) genome assembly
- Trimmed using Trimmomatic
- ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:25 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:2
Subread code:
subread-buildindex -o WBcel215_index *.fa for i in *.qc.fq; do subread-align -t 0 -T 4 -d 25 -i WBcel215_index -r $i -o $i.sr.bam; done
Example sequences (picked randomly by hand, believe it or not):
- GTCAGCATATGTAGTGAATGTTTTGCAGGCTTTTGTTGGAATTCGAGTG
- GGAATTCGAGTGTTTCGATCAATCAGATTGGTCATCACACCTCCAGCTG
- ATTGTTAACAATGATCAAAAGTTTGCCATAAATCTCAATGTCTCGCAGT
- GTTGAAGTTGATTCACTGTTTGAAGGAATTGATCTATGCACAAAGATCA
- CTCAACTGGTTCCAGAGTTTTCCGGAATAGATCAGCACACAGTTCTTCG
I re-ran the alignment with STAR yesterday, and the output confirms that multi-mapping is the issue, thank you! Still left to wonder why multi-mapping is higher in data from stress conditions, but that's probably more of a biological question. Now I just need to find a way to rescue some of these multi-mapped reads, which are all from genes of interest.
I already showed you how to rescue the multi-mapped reads. It's all explained in the Subread documentation.