When examining the most variable genes for a clinical cancer dataset generated as batches on different days, I notice that the genes are mainly ones located on the X and Y chromosomes such as XIST, UTY, and ZFY. Is it common practice to subset the matrix to remove these before making MDS plots, as they may mask more subtle variations between RNA-seq experiments done on different days? The edgeR workflow uses a dataset of all female mouse samples, so doesn't have such as issue and the vignette's oral squamous cell carcinoma dataset doesn't make any clinical details public. Is using only a set of widely accepted housekeeping genes a better approach that a set of the most variable ones?
Is there a reason why you can't run
removeBatchEffect on the log-CPMs prior to visualization, to eliminate the batch effect? This would solve the problem more directly than trying to pick genes to discard. In your specific case, it sounds like you have a sex effect, which may or may not be related to the batch structure.
If blocking is not possible, I often remove highly variable genes that are not biologically interesting. This includes X/Y-chromosome genes in the presence of sex effects, variable immunoglobulin segments when studying B cells, and ribosomal proteins strongly affected by technical differences in library preparation. As long as the removal can be justified (by some other reason than "it was highly variable"), I don't see a problem.
I don't see the benefit of using a set of housekeeping genes for visualizing differences between samples. You'd just end up with a big homogeneous clump of samples in the middle of the MDS plot, without capturing any of the biological structure present in the expression profiles of DE genes.
Yes, it would make sense to filter out the sex-linked genes to see more clearly what the batch effect looks like with the sex inbalance removed.
In our practice, we frequently filter out sex-linked genes when (i) the experimental conditions are not themselves sex-linked in their effects but (ii) both male and female samples are included in the study. We do this for the whole analysis, not just temporarily for the MDS plot. Our working definition of sex-linked genes is XIST plus the Y chromosome.
Including sex as a predictor in the design matrix is the alternative, but removing sex-linked genes often works better for small-scale studies because the sex-linked genes are small in number and well defined.
In your case you have enough samples to take either approach, but I would be tempted to remove the sex-linked genes.
Like Aaron, I don't see any purpose in restricting to house-keeping genes. That would seem to defeat the purpose of the MDS plot.