I have performed an RNA-seq experiment on leaves (five time points) and maturing seeds (three time points - from early maturation to dry seed) on a non-model plant species, and used the Salmon -> tximport -> DESeq2 analysis pipeline.
For the most part the leaf data looks pretty good. However, I have come across an interesting issue when it comes to the dry seed data (last seed time point). Although most genes are not DE (based on DESeq2), the vast, vast majority of reads are mapping to just a few DE genes. Over 50% of counts come from just three genes (associated with seed storage), and 75% of counts are derived from the top 15 most highly induced genes (also seed storage and LEAs).
There wasn't really a noticeable effect at first glance (50% of DE genes are up vs. down regulated in the seed, predictable genes and gene families were turned on or off), but it definitely seems like it could violate some underlying statistical assumption of RNA-seq. Is there anything to consider when trying to analyse data such as this?
Thanks for the quick response!
So all else being equal I should be able to trust the DESeq2 results and expression values (normalised counts, rlog), but Salmon TPM values (for example) may not an ideal measure of expression given this particular dataset?
I would say, you don’t want to use TPM directly to compute LFC if you have a few genes that absorb most of the counts and these appear to be DE. You want to use a normalization method that is robust.