I'm working on a RNA-seq differential expression analysis on human samples, and have been trying to understand the effect of the various packages that allows the weighting of samples or features - e.g. voom's sample quality weights, or edgeR-robust's implementation of observation weights.
I'm doing this because I'm particularly concerned about the effect of extreme outliers in my analysis - i.e. samples with counts for a gene that are 2 orders of magnitude different to the average across the entire sample set. Based on my experience of previous differential analyses on this dataset, I'm worried about a large proportion of the significant genes being called as significant due to the effect of 2-3 of these hugely outlier samples. (The dataset consists of 48 samples, divided into 4 groups.)
I'm also wondering about batch effects in my data, so I've tried out the sva package as well, which - from my understanding - uses a similar sample weighting approach to account for confounding variables.
My question is: given that the weights (of all types) are generally used in the fitting of the GLM to call differential genes, is it reasonable to include the weights when plotting a PCA or MDS, to see the effect of the weights on the sample group separation?
To my mind, if you're doing the MDS/PCA on the unweighted count data, then you weight the data before doing the actual differential analysis, it should still be valid to include the weights at the MDS/PCA stage.
Thanks in advance for any suggestions and insights into the statistics!