Hello,
I have a few questions regarding the normalization performed by the DESeq2 package:
- I understand that the normalization is working under the assumption that most of the genes are not DE and only a small subset of genes are. I understand that this assumption is realistic in most experiments, but I would appreciate your input regardless the correctness of this assumption when performing analysis across different human tissues. Meaning, does it make sense to assume that most of the genes will have the same expression levels in different human tissues? or should I perform different normalization when dealing with this sort of data?
- In some RNA-Seq experiments we encounter situations in which some libraries are sequenced in a greater depth than others, with larger range of library sizes that we usually expect. In those cases, the size factor given to the deeply sequenced samples will be much higher than 1 and the factor assigned to the smaller samples will be much lower than 1. In those cases, is it better to subsample the larger libraries in order to prevent this wide range of size factors?
I am addressing this issue, since when a sample receives a very small size factor, it seems to me that we are artificially increasing the counts of all genes in this sample without any biological evidence to back that up. It seems to me less problematic to lower the counts of the bigger libraries, but the artificial addition to the small ones seems more of a problem to me. What do you think?
Thank you very much,
Olga.
I'm tagging onto this as I have a similar question. Hope this is the right place to ask it. I'd like to follow up with the idea of when a MA plot skew is "too much" for the DESeq2 size factors calculation to handle and what, if anything, can be done about that. I'll preface by saying that we tried to avoid such issues up front by only analyzing organisms whose possible coding sequences were >= 80% covered under both conditions analyzed. Attached are three examples:
Our more typical case, with most ratios along the x axis:
https://htcf.wustl.edu/files/zdw23YMw
A more skewed but acceptable (for us) example:
https://htcf.wustl.edu/files/zdw23YMw
An extreme example:
https://htcf.wustl.edu/files/NeAvbNX2
So, if anyone has any input on 1) if DESeq is appropriate for these various scenarios or 2) if anything can be done up front to make it so, I'd really appreciate the advice!!
Thanks,
Matt
DESeq2 (or any method that attempts to find scaling factors based on the observed data alone) can only work with what you give it. If all genes are up-regulated, no computational method can handle this, and I'd also question why do this experiment without having some kind of spiked-in control, where sometimes I see investigators spike-in RNA from another organism as an improvement over the ERCC spike-ins.
I'm not sure how the last plot is even possible. Did you specify certain control genes? DESeq2 (using median ratio normalization) will tend to center the LFCs on the y=0 line.
Yes it seems fine.