Question

Normalization for small non-coding RNA and multi-factor design

0

Entering edit mode

Konstantinos Yeles ▴ 90

@konstantinos-yeles-8961

Last seen 7 months ago

Italy

Dear Bioconductor,

I would like to ask about the normalization method I should follow regarding the issue of technical variation that a treatment could introduce on top of the extraction of RNA from different components of the cell.

Experimental design:

> samples
                   group batch treatment
Sample1       component1     1   treated
Sample2       component1     1 untreated
Sample3       component1     2   treated
Sample4       component1     2 untreated
Sample5       component1     3   treated
Sample6       component1     3 untreated
Sample7       component2     1   treated
Sample8       component2     1 untreated
Sample9       component2     2   treated
Sample10      component2     2 untreated
Sample11      component2     3   treated
Sample12      component2     3 untreated
Sample13           total     4   treated
Sample14           total     4 untreated
Sample15           total     5   treated
Sample16           total     5 untreated
Sample17           total     6   treated
Sample18           total     6 untreated

Additionally, various kinds of spike-ins (specific modified and non-modified to the treatment) were used to evaluate the treatment. The specific modified / non modified were inserted before treatment and miRNA spike-ins after.

Following "RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR" and A: DESeq and Limma+Voom Normalization for Rna-Seq Data Using Ercc Spike-Inin order to assess if using spike-ins for normalization would be informative I performed normalization with edgeR and limma-voom (quantile or spike-ins), boxplots:

code used:

cpm <- cpm(x)
keep.exprs <- rowSums(cpm>1)>=3
x1 <- x[keep.exprs,,keep.lib.sizes=FALSE]
x1_TMM <- calcNormFactors(x1, method = "TMM")
x1_RLE <- calcNormFactors(x1, method = "RLE")
x1_uQ <- calcNormFactors(x1, method = "upperquartile")
x1_TMMwzp <- calcNormFactors(x1, method = "TMMwzp")

# unormalized
lcpm <- cpm(x1,log=TRUE,prior.count=5)
boxplot(lcpm, las=2, col=col, main="")
title(main="A. Unnormalized data",ylab="Log-cpm")

#normalized
#TMM
lcpm_TMM <- cpm(x1_TMM,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_TMM, las=2, col=col, main="")
title(main="B. TMM Normalized data",ylab="Log-cpm")

#TMMwzp
lcpm_TMMwzp <- cpm(x1_TMMwzp,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_TMMwzp, las=2, col=col, main="")
title(main="C. TMMwzp Normalized data",ylab="Log-cpm")

#RLE
lcpm_RLE <- cpm(x1_RLE,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_RLE, las=2, col=col, main="")
title(main="D. RLE Normalized data",ylab="Log-cpm")

#upper Quartile
lcpm_uQ <- cpm(x1_uQ,normalized.lib.sizes = TRUE, log=TRUE, prior.count=5)
boxplot(lcpm_uQ, las=2, col=col, main="")
title(main="E. Upper Quartile Normalized data",ylab="Log-cpm")

#limma voom quantile
design <- model.matrix(~x1$samples$group+x1$samples$treatment)`
x1_voom <- voom(x1,design = design,normalize.method = "quantile")

N <- colSums(x1$counts)
#all spike-ins
nf <- calcNormFactors(spikeinsdf,lib.size = N)
spike_in_voom_all_spike_ins <- voom(x1,design,lib.size = N*nf)
boxplot(spike_in_voom_all_spike_ins$E, las=2, col=col, main="")
title(main="G1. voom-all-spike-ins Normalised data",ylab="Log-cpm")
.
.
.
`

We want to do differential expression between treated, non-treated samples for each cell component but also between components:

total treated vs total non-treated, component1 treated vs component1 non-treated, component1 treated vs component2 treated...

Concerning the differential expression between component1-component2, there is a technical issue in which the proportion of RNA amount isolated from 10 million cells differs. Even if we start from the same amount of RNA, final amount of the isolated RNA is higher in the component1 than in the component2. For library preparation 1μg of starting RNA was used but in terms of concentration, because of the small dimension of the component2 it would have been better to use only 0.1μg RNA, but this was not possible because of the library preperation protocol. In terms of "percentages", it is like having RNA from 1 cellular component1 and ~5 from cellular component2.

Should I take in consideration this issue and provide an adjustment variable in the normalisation step or after?
Regarding the boxplots, it seems that limma voom transformation with quantile normalization relatively performs better than the others. So, in your opinion should I proceed with this normalisation for downstream DE analysis?
In the afformentioned code, is this the correct way to create the model matrix? Between treatment and cell component do I have to use duplicate Correlation or blocking? Another way of to create the model matrix:
```
design <- model.matrix(~0+batch+group:treatment)
```

limma voom transformation with TMM, RLE, upperQuartile normalization

edger limma voom normalization limma • 2.2k views

ADD COMMENT • link updated 7.2 years ago by James W. MacDonald 68k • written 7.2 years ago by Konstantinos Yeles ▴ 90

score 2 · Answer 1 · 2018-12-11

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 3 hours ago

United States

Your question could be re-stated (uncharitably, perhaps) as 'How should I analyze my data?', which is in general beyond the scope of this support page, which is intended primarily to help people with technical aspects of using Bioconductor software.

Do note that anybody can post an answer on this site, so long as they took the time to register, and you are thus outsourcing the analysis of your data to random internet people that you don't know, and who may or may not know what they are talking about. This is not a winning strategy. Analyzing data for which the conventional normalization methods may not apply is a non-trivial task, and anybody who posts on this list who has the ability to help will almost surely not respond, because it's not really possible to give reasonable advice without having seen the data firsthand. So any advice you might get is likely to be worth exactly what you paid for it.

Your best bet is to find somebody local (or not local, for that matter) who has the required expertise to do the analysis for you. Failing that, you will need to gain the necessary expertise yourself.

ADD COMMENT • link 7.2 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James,

thank you for the answer and your suggestions on this matter. Sorry for the delayed reply.

My question is not directly linked to the Scopus of this group, but I have to "gain the necessary expertise by myself". As I couldn't find any publication regarding the issue of normalization between components of the cell (there are, but they don't describe anything about the normalization part (between components) I don't know if they didn't take it into account or performed another protocol for sequencing) I made this question to the experts of Bioinformatics/Biostatistics community I trust more.

If I have to restate my question then :

1) I created the design matrix following limma user guide: 9.7 Multi-level Experiments and the answer of this Q Limma Design for Paired Samples Question

I consider of using also

keep <- filterByExpr(dge, Blocked_Design_model)
dge <- dge[keep,,keep.lib.sizes=FALSE]

as Gordon Smyth kindly suggests in this Q LIMMA : group specific effect of a treatment while controlling for individual effect

my updated design matrix using TMM normalization and voom (this part remains an issue with the normalization between components and treated-nontreated) :

Treat <- factor(paste(x1_TMM$samples$group,substr(x1$samples$treat,1,3),sep="."))
design <- model.matrix(~0 + Treat)
colnames(design) <- levels(Treat)
rownames(design) <- colnames(x1$counts)



voom_TMM <- voom(x1_TMM,design = design,plot = TRUE)
corfit <- duplicateCorrelation(voom_TMM$E,design,block = voom_TMM$targets$batch)
fit <- lmFit(voom_TMM$E,design,block = voom_TMM$targets$batch,correlation = corfit$consensus.correlation)
cm <- makeContrasts(
  TrevsUntComp1 = 

component1.tre-component1.unt,

  TrevsUntComp2 = 

component2.tre-component2.unt,

  TrevsUntTot = tot.tre-tot.unt,
  TreComp1vsComp2 = 

component1.tre-component2.tre,

  UntComp1vsComp2 = 

component1.unt-component2.unt,

  levels = design
)
fit2 <- contrasts.fit(fit,cm)
fit2 <- eBayes(fit2,trend = FALSE,robust = TRUE)
Comp1_TvsU <- topTable(fit2,coef = " TrevsUntComp1",
                         number = nrow(fit2),adjust.method = "fdr",
                         sort.by = "p")

Is this correct? Do I have to consider another factor regarding samples from the total that are not paired?

2) I included the values of spike-ins in the final expression matrix as "controls" for DE. (I have read the posts/pubs about the problems regarding spike-ins, A: DESeq and Limma+Voom Normalization for Rna-Seq Data Using Ercc Spike-In, https://support.bioconductor.org/p/63339/#63340)
I found all of them DE with 3<logFC<5 and adj.p<0.0005 in the component1 treated vs non-treated,

same for component2 treated vs non-treated.

In the total treated vs non-treated only 4 were found DE with adj.p<0.05 and 1.9<LogFC<2.

In treated component1 vs 2 and non-treated comp1 vs comp2, I didn't find any. (In this one is the issue with normalisation between components)

From the results mentioned above, I consider of not doing any DE analysis between treated/non-treated samples (for several reasons).
But If I want to use a normalization factor on top of TMM normalisation before using voom I have to apply it in the offsets? (and more correctly but it's beyond the scope of this support page, Should I? is it correct to apply a correction on top of already normalized data? I have read in different posts that it is not recommended to use a second normalization, for example TMM normalized and then with voom, quantile normalization)

3) If I set a cutoff for each comparison regarding the ranges of logFC of DE spikes-ins, every feature found DE with |logFC(feature)|>|logFC(spike-ins)| could be more "trusted" result of true positive hit or my naiveness leads me to false assumptions?

Best regards
Konstantinos

ADD REPLY • link 7.1 years ago Konstantinos Yeles ▴ 90

1

Entering edit mode

1) No (don't subset to $E, repeat voom and duplicateCorrelation), and no.

2) No, scaling normalization is not additive. It is not valid to apply one set of normalization factors on top of another.

3) No, it doesn't make much sense to compare log-fold changes of genes to spike-ins.

You should think about what the spike-ins are, and what you want them to do.

If you are lucky, your experimentalist has added spike-in RNA to each sample proportional to the number of cells, i.e., at a constant concentration of spike-in/cell to each sample. This allows the spike-ins to capture differences in total RNA content between groups, assuming the addition was done accurately. Scaling normalization based on the spike-ins will preserve differences in RNA content. In contrast, normalization methods based on a non-DE majority of genes (including TMM) will remove such differences, which may not be desirable when studying cellular processes in which RNA content changes, e.g., cell cycle, certain differentiation/activation/maturation events.
The more common experimental scenario is to add spike-in RNA to each sample proportional to the quantity of endogenous mRNA. I'm not sure what this is meant to do. In such cases, one would only be able to use the spike-ins for normalization if there were no composition biases between samples, at which point one might as well use library size normalization. And of course, one hopes that the initial (m)RNA quantification was accurate, and I am told that this is often difficult due to contamination of the UV spectrum by various cellular bits and pieces.

If you went for the first option; if you have major differences in total RNA content between compartments; and if you used TMM normalization prior to a DE analysis; your spike-ins will always show up as "DE", simply due to composition effects. If you don't care about total RNA content, the spike-ins are more or less useless. Well, for DE analyses at least - you can still use them for modelling technical variability, as we often do for scRNA-seq data analyses.

ADD REPLY • link 7.1 years ago Aaron Lun ★ 29k

0

Entering edit mode

Dear Aaron,

thank you for your answer!! Unfortunately, I just discussed with the experimentalist; the spike-ins haven't been added to each sample proportional to the number of cells. At least now I could suggest that for future experiments.
But with this data, is it correct to proceed with the workflow mentioned above?

ADD REPLY • link 7.1 years ago Konstantinos Yeles ▴ 90

0

Entering edit mode

design

component1.tre component1.unt component2.tre component2.unt total.tre total.unt
sample1               1              0              0              0         0         0
sample2               0              1              0              0         0         0
sample3               1              0              0              0         0         0
sample4               0              1              0              0         0         0
sample5               1              0              0              0         0         0
sample6               0              1              0              0         0         0
sample7               0              0              1              0         0         0
sample8               0              0              0              1         0         0
sample9               0              0              1              0         0         0
sample10              0              0              0              1         0         0
sample11              0              0              1              0         0         0
sample12              0              0              0              1         0         0
sample13              0              0              0              0         1         0
sample14              0              0              0              0         0         1
sample15              0              0              0              0         1         0
sample16              0              0              0              0         0         1
sample17              0              0              0              0         1         0
sample18              0              0              0              0         0         1
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`Treat`
[1] "contr.treatment"

ADD REPLY • link 7.1 years ago Konstantinos Yeles ▴ 90