Dear Bioconductor,
I would like to ask about the normalization method I should follow regarding the issue of technical variation that a treatment could introduce on top of the extraction of RNA from different components of the cell.
Experimental design:
> samples group batch treatment Sample1 component1 1 treated Sample2 component1 1 untreated Sample3 component1 2 treated Sample4 component1 2 untreated Sample5 component1 3 treated Sample6 component1 3 untreated Sample7 component2 1 treated Sample8 component2 1 untreated Sample9 component2 2 treated Sample10 component2 2 untreated Sample11 component2 3 treated Sample12 component2 3 untreated Sample13 total 4 treated Sample14 total 4 untreated Sample15 total 5 treated Sample16 total 5 untreated Sample17 total 6 treated Sample18 total 6 untreated
Additionally, various kinds of spikeins (specific modified and nonmodified to the treatment) were used to evaluate the treatment. The specific modified / non modified were inserted before treatment and miRNA spikeins after.
Following "RNAseq analysis is easy as 123 with limma, Glimma and edgeR" and A: DESeq and Limma+Voom Normalization for RnaSeq Data Using Ercc SpikeInin order to assess if using spikeins for normalization would be informative I performed normalization with edgeR and limmavoom (quantile or spikeins), boxplots:
code used:
cpm < cpm(x)
keep.exprs < rowSums(cpm>1)>=3
x1 < x[keep.exprs,,keep.lib.sizes=FALSE]
x1_TMM < calcNormFactors(x1, method = "TMM")
x1_RLE < calcNormFactors(x1, method = "RLE")
x1_uQ < calcNormFactors(x1, method = "upperquartile")
x1_TMMwzp < calcNormFactors(x1, method = "TMMwzp")
# unormalized
lcpm < cpm(x1,log=TRUE,prior.count=5)
boxplot(lcpm, las=2, col=col, main="")
title(main="A. Unnormalized data",ylab="Logcpm")
#normalized
#TMM
lcpm_TMM < cpm(x1_TMM,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_TMM, las=2, col=col, main="")
title(main="B. TMM Normalized data",ylab="Logcpm")
#TMMwzp
lcpm_TMMwzp < cpm(x1_TMMwzp,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_TMMwzp, las=2, col=col, main="")
title(main="C. TMMwzp Normalized data",ylab="Logcpm")
#RLE
lcpm_RLE < cpm(x1_RLE,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_RLE, las=2, col=col, main="")
title(main="D. RLE Normalized data",ylab="Logcpm")
#upper Quartile
lcpm_uQ < cpm(x1_uQ,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_uQ, las=2, col=col, main="")
title(main="E. Upper Quartile Normalized data",ylab="Logcpm")
#limma voom quantile
design < model.matrix(~x1$samples$group+x1$samples$treatment)`
x1_voom < voom(x1,design = design,normalize.method = "quantile")
N < colSums(x1$counts)
#all spikeins
nf < calcNormFactors(spikeinsdf,lib.size = N)
spike_in_voom_all_spike_ins < voom(x1,design,lib.size = N*nf)
boxplot(spike_in_voom_all_spike_ins$E, las=2, col=col, main="")
title(main="G1. voomallspikeins Normalised data",ylab="Logcpm")
.
.
.
`
We want to do differential expression between treated, nontreated samples for each cell component but also between components:
total treated vs total nontreated, component1 treated vs component1 nontreated, component1 treated vs component2 treated...
Concerning the differential expression between component1component2, there is a technical issue in which the proportion of RNA amount isolated from 10 million cells differs. Even if we start from the same amount of RNA, final amount of the isolated RNA is higher in the component1 than in the component2. For library preparation 1μg of starting RNA was used but in terms of concentration, because of the small dimension of the component2 it would have been better to use only 0.1μg RNA, but this was not possible because of the library preperation protocol. In terms of "percentages", it is like having RNA from 1 cellular component1 and ~5 from cellular component2.
 Should I take in consideration this issue and provide an adjustment variable in the normalisation step or after?
 Regarding the boxplots, it seems that limma voom transformation with quantile normalization relatively performs better than the others. So, in your opinion should I proceed with this normalisation for downstream DE analysis?

In the afformentioned code, is this the correct way to create the model matrix? Between treatment and cell component do I have to use duplicate Correlation or blocking? Another way of to create the model matrix:
design < model.matrix(~0+batch+group:treatment)
limma voom transformation with TMM, RLE, upperQuartile normalization