Dear list,
I am analyzing an RNA-seq experiment with a relatively simple design: 4 treated samples vs. 4 controls. Each sample is a population of T-cells from a single mouse, isolated using FACS. There is no obvious batch effect (mice were raised together and cells were sorted on the same day). However the number of cells isolated for each sample was low (from 578 to 4068) so a low input library preparation kit (Clontech SMART-Seq v4) had to be used.
Unfortunately the number of cells used for library prep for samples in the treated condition was consistently lower than for samples in the control group (respectively ranging from 578 to 1468, vs. from 1775 to 4068 cells for controls). As far as I can tell, this probably had some influence on the dataset:
On a PCA I observe that the treated samples have a larger heterogeneity than control samples which group nicely on the first PCs.
The lower the number of cells used as input, the outer the treated samples lie on a hierarchical clustering tree (the control samples form a nice group, and the treated samples are branching out one after the other)
The number of detected genes is almost perfectly correlated to the number of cells used as input. In summary, it seems to me that the complexity of the samples is lower for samples with lower number of cells used as input (maybe the capture rate was lower for samples with lower input?). Of note, no other proxy (RIN score, RNA concentration, etc) is available, because the concentrations of the RNA extractions were too low to be measured on a Bioanalyzer pico chip.
I am tempted to include the number of cells used (or rather the log10 of the number of cells) as a numerical covariate in the design matrix. Of course because this is confounded with the treatment effect, I am expecting some true effects to be cancelled out, especially when (by chance?) the expression levels of a gene across samples follow the ranking of the covariate. Hopefully some true signal should remain, and there should be genes for which, even if there is a difference across groups, the expression levels across samples of each group should not follow the ranking of the covariate. What do you think?
Last question: when using the voomWithQualityWeights
approach, I observe that the sample weights also perfectly correlate with the covariate. In this context does it make sense to still include the number of cells as a covariate in the model? This previous post on the list seems to support it (https://support.bioconductor.org/p/105601/ ), but any advice is appreciated!
Best, Julien
Thank you Matthew for the answer. Indeed when looking at the PCA the second approach might be more appropriate.
I am still a bit worried that the weights of samples of one condition are all lower than the weights of samples from the other condition. Can this introduce some bias in the analysis? Or it just results in a loss of power?