Hey everyone,
this is my first question in the forum.
i'm analyzing RNA-Seq data of Tomato.
some of the samples were treated with HeatShock, and comparing to the control groups it seems that
a small group of genes presenting enormous values of expression.
my question how does it effects other genes at the phase of normalization and comparisons (using DESeq2) ?
and what are the best way to handle it?
thanks
just to clarify,
we have 2 time point: T0 and T5 (right after heat shock, and 5 hours after heat shock respectively) , when the heat shock effect is much more significant in T0 when small subset of genes have great expression level.
is it possible that genes that showed positive fold change in T5, will show a negative at T0 because of the over expressed genes of T0 ?
Hi Michael,
I am working with Maor on this project, first of all thank you very much for the reply.
We have already spoken with the researcher that performed the experiment, What we are afraid of is that the high expression of the heat-shock response genes will cause a global decrease in the expression levels of the rest of the genes due to pure technical reasons during the sequencing.
Even when looking at the size factors, we see that the samples after heat-shock received size factor of around 0.8-0.9 (with the total library size being the same or greater than other samples), meaning that we do observe a phenomena where most of the genes were under-represented compared to the other samples.
We looked at the histograms of expression levels in all samples and we don't see a global shift or different counts distribution between our different samples.
In you experience, do you have any suggestions on how to test the effect of this on the data. We thought to try and remove the most up regulated genes (convert their counts to 0) and perform the analysis, just to test how significantly will this affect the results. What is you opinion on that? do you have any other suggestions?
This has been discussed on the support site before, so you can try to search for similar posts, but I can summarize:
If you suspect global increase or decrease in expression, there is no way to estimate size factors computationally. you need spike-ins for example. And if you had many spike-ins to estimate the global changes, you would pass the spike-ins to the controlGenes argument of estimateSizeFactors. Spike-ins have their own problems -- they are hard to control, giving imprecise size factors which impairs proper inference -- as discussed in the SEQC papers.
Also you would want to follow the advice above, turning off betaPrior, because the assumption that the bulk of the LFCs are roughly centered around zero is not the case.
Performing differential expression null hypothesis testing when all the genes change expression will not give very interesting/informative results, because you know already that the null is not nearly the case for most or all genes. I would think about potentially finding other ways to present the results.