Question

voomWithQualityWeights together with covariate in design matrix

0

Entering edit mode

Julien Roux ▴ 90

@julien-roux-2710

Last seen 6.6 years ago

Switzerland/Basel/University of Basel

Dear list,

I am analyzing an RNA-seq experiment with a relatively simple design: 4 treated samples vs. 4 controls. Each sample is a population of T-cells from a single mouse, isolated using FACS. There is no obvious batch effect (mice were raised together and cells were sorted on the same day). However the number of cells isolated for each sample was low (from 578 to 4068) so a low input library preparation kit (Clontech SMART-Seq v4) had to be used.

Unfortunately the number of cells used for library prep for samples in the treated condition was consistently lower than for samples in the control group (respectively ranging from 578 to 1468, vs. from 1775 to 4068 cells for controls). As far as I can tell, this probably had some influence on the dataset:

On a PCA I observe that the treated samples have a larger heterogeneity than control samples which group nicely on the first PCs.
The lower the number of cells used as input, the outer the treated samples lie on a hierarchical clustering tree (the control samples form a nice group, and the treated samples are branching out one after the other)
The number of detected genes is almost perfectly correlated to the number of cells used as input. In summary, it seems to me that the complexity of the samples is lower for samples with lower number of cells used as input (maybe the capture rate was lower for samples with lower input?). Of note, no other proxy (RIN score, RNA concentration, etc) is available, because the concentrations of the RNA extractions were too low to be measured on a Bioanalyzer pico chip.

I am tempted to include the number of cells used (or rather the log10 of the number of cells) as a numerical covariate in the design matrix. Of course because this is confounded with the treatment effect, I am expecting some true effects to be cancelled out, especially when (by chance?) the expression levels of a gene across samples follow the ranking of the covariate. Hopefully some true signal should remain, and there should be genes for which, even if there is a difference across groups, the expression levels across samples of each group should not follow the ranking of the covariate. What do you think?

Last question: when using the voomWithQualityWeights approach, I observe that the sample weights also perfectly correlate with the covariate. In this context does it make sense to still include the number of cells as a covariate in the model? This previous post on the list seems to support it (https://support.bioconductor.org/p/105601/ ), but any advice is appreciated!

Best, Julien

limma voom RNA-seq voomWithQualityWeights • 1.5k views

ADD COMMENT • link updated 6.6 years ago by Matthew Ritchie ▴ 1000 • written 6.6 years ago by Julien Roux ▴ 90

score 2 · Answer 1 · 2019-06-07

If the MDS / PCA plot from your experiment shows systematic separation of samples based on cell number (e.g. PC1 or PC2 shows a gradient of samples with high cell number on one side of the plot through to low cell number on the other side - try colour coding samples by cell number to make this easier to visualise) then you could certainly include this as an effect in the design matrix (i.e. cell number has a mean effect on gene expression).

But if it's not systematic, then perhaps cell number effects the variance, in which case the approach of voomWithQualityWeights may be more appropriate. For some simple examples of what you might look for in an MDS plot to motivate use of voomWithQualityWeights, see Figures 2A and 2B of Liu et al. Genom Data. 2016, 7: 144–147.