I have an RNA-seq dataset where the largest source of variation appears to be due to library size (principal component 1, which captures ~55% of the variance correlates with library size,B), even after TMM normalisation. Library sizes ranged from 13,481,192 to 22,737,049 and was spread relatively evenly between experimental groups (Top graph).
We used RUVSeq to try remove this unwanted variation. We used RUVg setting k = 1 and using the 5000 least DE genes in an initial DE analysis using
edgeR as the negative control genes (as described in the RUVseq vignette). This generated a W1 term which tracked with PC1 and library size (C & D). PCA on the RUVseq normalised counts (F) revealed that this artefact was gone (G). We then continued with differential gene expression analysis including the W1 covariate in the design matrix.
Is there something we are missing here about why this isn't a valid way to do this? A reviewer of the manuscript has voiced concerns.