Question: limma: same data in different formats produces different DEG-analysis results
0
22 months ago by
littlefishes2010 wrote:

Background:

Sample Information: 8 groups of cells (cell1,cell2,cell3,cell4,cell5,cell6,cell7,cell8), 4 replicates each group, 32 samples in total.

Comparison we are interested in: cell1-vs-cell2, cell3-vs-cell4,cell5-vs-cell6,cell7-vs-cell8

*Pay attention: sample 1,2 and sample 3,4 in each sample are sequenced in different platform, in an order word, there are batch effect in the samples

I analysed the data twice, with the same operation but with different input. First, I inputed all the data of 32 samples. In the second analysis, I inputed the data only with cell7 and cell8 (8 samples in total). I want to get DEGs of cell7-vs-cell8. In the first time, I got 266 DEGs, while I got none in the second time. It confused me a lot, it shouldn't be the result like that.

I hope someone would offer me some ideas. From all of my heart, sincerely.

limma • 467 views
modified 22 months ago • written 22 months ago by littlefishes2010

How did you evaluate the comparison between cell7 and cell8 in the first analysis? Did you use constrasts?
How have you corrected the batch effect in the second analysis and in the first?

Thanks for quick response!

The answer to the first question is YES, I made the comparison by using the function makeContrasts.

I didn't remove the batch effect in both two analysis, instead, I specified the batch effect in the linear model, like design <- model.matrix(~0+group+batch)

The code of my twice analysis

The first analysis:

require(limma)
library(edgeR)
##read in count data and remove genes that are not expressed##
x <-DGEList(counts=rawx[,1:24],lib.size=colSums(rawx[,1:24]),genes=rownames(rawx),remove.zeros=TRUE)

##organising sample information##
samplenames <- substring(colnames(x),1,nchar(colnames(x)))
colnames(x) <- samplenames
group <- as.factor(c("Cell1","Cell1","Cell1","Cell1","Cell2","Cell2","Cell2","Cell2","Cell3","Cell3","Cell3","Cell3","Cell4","Cell4","Cell4","Cell4","Cell5","Cell5","Cell5","Cell5","Cell6","Cell6","Cell6","Cell6","Cell7","Cell7","Cell7","Cell7","Cell8","Cell8","Cell8","Cell8"))
x$samples$group <- group
batch <- as.factor(c("Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2"))
x$samples$batch <- batch

##data pre-processing#
cpm <- cpm(x)
lcpm <- cpm(x,log=TRUE)
keep.expres <- rowSums(cpm>1)>=4
x <- x[keep.expres,,keep.lib.sizes=FALSE]

##differential expression analysis##
design <- model.matrix(~0+group+batch)
colnames(design) <- gsub("group","",colnames(design))
contr.matrix <- makeContrasts(
Cell1vsCell2 = Cell1 – Cell2,
Cell3vsCell4 = Cell3 – Cell4,
Cell5vsCell6 = Cell5 – Cell6,
Cell7vsCell8 = Cell7 – Cell8,
levels = colnames(design))

v <- voom(x,design,plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
summary(decideTests(efit))
q()

The second analysis:

require(limma)
library(edgeR)
##read in count data and remove genes that are not expressed##
x <-DGEList(counts=rawx[,1:8],lib.size=colSums(rawx[,1:8]),genes=rownames(rawx),remove.zeros=TRUE)

##organising sample information##
samplenames <- substring(colnames(x),1,nchar(colnames(x)))
colnames(x) <- samplenames
group <- as.factor(c("Cell1","Cell1","Cell1","Cell1","Cell2","Cell2","Cell2","Cell2"))
x$samples$group <- group
batch <- as.factor(c("Batch1","Batch1","Batch2","Batch2","Batch1","Batch1","Batch2","Batch2"))
x$samples$batch <- batch

##data pre-processing#
cpm <- cpm(x)
lcpm <- cpm(x,log=TRUE)
keep.expres <- rowSums(cpm>1)>=4
x <- x[keep.expres,,keep.lib.sizes=FALSE]

##differential expression analysis##
design <- model.matrix(~0+group+batch)
colnames(design) <- gsub("group","",colnames(design))
contr.matrix <- makeContrasts(
Cell1vsCell2 = Cell1 – Cell2,
levels = colnames(design))

v <- voom(x,design,plot=TRUE,save.plot=TRUE)
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
summary(decideTests(efit))
q()
Answer: limma: same data in different formats produces different DEG-analysis results
3
22 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.3k wrote:

You didn't provide the code that you used for the analyses, so it's impossible to give you an exact answer, but the most likely reason for the discrepancy is that limma uses all samples in the model to estimate the variance, so with 32 samples, the variance estimate is much more precise than with only 8 samples. Since the variance is estimated more precisely, limma can more confidently call DEGs. Chances are that the logFC values in both analyses are very similar, as is the rank order of the genes by significance. With methods based on (generalized) linear models such as limma, edgeR, or DESeq2, it's recommended to always fit the full model with all samples included for this reason.

I can understand what you means. However, to be honest, I found this problem just because I have got the total 32 samples' data. The most important problem puzzles me is what happens if I have only the data of cell7 and cell8? I would find no difference in this comparison? Doesn't it sound so strange?

The codes are listed as follows.

3

With the full data, you have 24 - 9 = 15 degrees of freedom. With the subset of data, you have 8 - 3 = 5 degrees of freedom. This means that when you use all of the data, you're estimating the variances with much more precision. This is especially true when the prior degrees of freedom is low and there's not much information to share between genes.

As Ryan said, the precision of the variance estimate will have an effect on your downstream tests. If you don't have a precise estimate of the variance, you cannot be sure that a log-fold change is genuine or just due to random variability. This means that limma cannot give you a low p-value, as there is no strong evidence against the null hypothesis.

Obviously if you give limma more data, it will give you different (usually better) results; this is not particularly strange.

Answer: limma: same data in different formats produces different DEG-analysis results
1
22 months ago by
littlefishes2010 wrote:

"The hallmark of the limma approach is the use of linear models to analyse entire experiments as an integrated whole rather than making piece-meal comparisons between pairs of treatments. This has the effect of sharing information between samples.  Analysing the data as a whole also allows us to model correlations that may exist between samples due to repeated measures or other causes. This kind of analysis would not be feasible were the data partitioned into subsets and analysed as a series of pairwise comparisons."

-Cited from paper "limma powers differential expression analyses for RNA-sequencing and microarray studies"

It really helps if I used the dataset of all samples.

I thinks it means the sample from the same tissue, what happens if I want to figure out the difference in one same tissue with different phenotype while I have the data of many samples (such as 3 different samples)? Should I use all of them to fit the linear model?