I would like to ask about the normalization method I should follow regarding the issue of technical variation that a treatment could introduce on top of the extraction of RNA from different components of the cell.
> samples group batch treatment Sample1 component1 1 treated Sample2 component1 1 untreated Sample3 component1 2 treated Sample4 component1 2 untreated Sample5 component1 3 treated Sample6 component1 3 untreated Sample7 component2 1 treated Sample8 component2 1 untreated Sample9 component2 2 treated Sample10 component2 2 untreated Sample11 component2 3 treated Sample12 component2 3 untreated Sample13 total 4 treated Sample14 total 4 untreated Sample15 total 5 treated Sample16 total 5 untreated Sample17 total 6 treated Sample18 total 6 untreated
Additionally, various kinds of spike-ins (specific modified and non-modified to the treatment) were used to evaluate the treatment. The specific modified / non modified were inserted before treatment and miRNA spike-ins after.
Following "RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR" and A: DESeq and Limma+Voom Normalization for Rna-Seq Data Using Ercc Spike-Inin order to assess if using spike-ins for normalization would be informative I performed normalization with edgeR and limma-voom (quantile or spike-ins), boxplots:
cpm <- cpm(x) keep.exprs <- rowSums(cpm>1)>=3 x1 <- x[keep.exprs,,keep.lib.sizes=FALSE] x1_TMM <- calcNormFactors(x1, method = "TMM") x1_RLE <- calcNormFactors(x1, method = "RLE") x1_uQ <- calcNormFactors(x1, method = "upperquartile") x1_TMMwzp <- calcNormFactors(x1, method = "TMMwzp") # unormalized lcpm <- cpm(x1,log=TRUE,prior.count=5) boxplot(lcpm, las=2, col=col, main="") title(main="A. Unnormalized data",ylab="Log-cpm") #normalized #TMM lcpm_TMM <- cpm(x1_TMM,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5) boxplot(lcpm_TMM, las=2, col=col, main="") title(main="B. TMM Normalized data",ylab="Log-cpm") #TMMwzp lcpm_TMMwzp <- cpm(x1_TMMwzp,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5) boxplot(lcpm_TMMwzp, las=2, col=col, main="") title(main="C. TMMwzp Normalized data",ylab="Log-cpm") #RLE lcpm_RLE <- cpm(x1_RLE,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5) boxplot(lcpm_RLE, las=2, col=col, main="") title(main="D. RLE Normalized data",ylab="Log-cpm") #upper Quartile lcpm_uQ <- cpm(x1_uQ,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5) boxplot(lcpm_uQ, las=2, col=col, main="") title(main="E. Upper Quartile Normalized data",ylab="Log-cpm") #limma voom quantile design <- model.matrix(~x1$samples$group+x1$samples$treatment)` x1_voom <- voom(x1,design = design,normalize.method = "quantile") N <- colSums(x1$counts) #all spike-ins nf <- calcNormFactors(spikeinsdf,lib.size = N) spike_in_voom_all_spike_ins <- voom(x1,design,lib.size = N*nf) boxplot(spike_in_voom_all_spike_ins$E, las=2, col=col, main="") title(main="G1. voom-all-spike-ins Normalised data",ylab="Log-cpm") . . . `
We want to do differential expression between treated, non-treated samples for each cell component but also between components:
total treated vs total non-treated, component1 treated vs component1 non-treated, component1 treated vs component2 treated...
Concerning the differential expression between component1-component2, there is a technical issue in which the proportion of RNA amount isolated from 10 million cells differs. Even if we start from the same amount of RNA, final amount of the isolated RNA is higher in the component1 than in the component2. For library preparation 1μg of starting RNA was used but in terms of concentration, because of the small dimension of the component2 it would have been better to use only 0.1μg RNA, but this was not possible because of the library preperation protocol. In terms of "percentages", it is like having RNA from 1 cellular component1 and ~5 from cellular component2.
- Should I take in consideration this issue and provide an adjustment variable in the normalisation step or after?
- Regarding the boxplots, it seems that limma voom transformation with quantile normalization relatively performs better than the others. So, in your opinion should I proceed with this normalisation for downstream DE analysis?
In the afformentioned code, is this the correct way to create the model matrix? Between treatment and cell component do I have to use duplicate Correlation or blocking? Another way of to create the model matrix:
design <- model.matrix(~0+batch+group:treatment)