Dear all,
In one of the RNA-Seq datasets I'm analysing, the knockdown/overexpression of a single gene is being compared to the empty vector. This comparison has been done for seven different genes. In every case, I'm getting over 5000 significantly differentially expressed genes. To me this seems like an excessive amount given that only one gene has been overexpressed/knocked down and that it's happening for all seven genes, so probably I'm doing something wrong. Does anyone have any advice/ideas on how I can find out how correct the results are or how to pinpoint where I went wrong?
The pipeline trims the reads with fastp, aligns them with STAR (I've also tried Salmon with the same result) and tests for differential expression with DESeq2.
Cheers, Liam
I didn't think that the details would be so useful, but here goes: Paired reads are being trimmed with fastp (-q 30, -c), aligned with STAR (no special parameters), then quantified with featureCounts (-Q 1 --ignoreDup, -s 0). R code as follows:
sessionInfo
Two suggestions are: take a look at the PCA plot to get a sense for your sample distances within and across group.
Also in the paper we discuss that with sufficient samples you may find many genes where LFC is not zero but these may not be of interest. We therefore recommend use of an specified LFC threshold as an argument to results().
Thanks for the help. Until now I'd rarely used LFC as a threshold as it hadn't changed the interpretation much, so it didn't cross my mind. Setting a LFC threshold of 1 reduces the number of genes to ~600, a much more managable number.