Question

Limma-voom contrast confusion

0

Entering edit mode

gregory.l.stone ▴ 10

@gregorylstone-12225

Last seen 5.5 years ago

I am trying to use limma-voom to determine the effect of a condition between males and females. I have paired samples, and about 70 male and 45 females. At first I organized my data according to section 3.5 of the edgeR users manual (https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf). My data therefore looked as such:

sample	condition	sex	nested
sample1	pre	M	1
sample1	post	M	1
sample2	pre	M	2
sample2	post	M	2
sample70	pre	F	1
sample 70	post	F	1
...	...	...	...

My code looks as follows:

design = model.matrix(~ sex + sex:nested + sex:condition, data)

colnames(design) <- make.names(colnames(design), unique=TRUE)

v = voomWithQualityWeights(dge, design, plot=FALSE)

fit = lmFit(v, design)

#construct contrast matrix

p <- ncol(fit)
cont <- rep(0, p)
cont[p] <- 1
cont[p-1] <- -1

fit2 = contrasts.fit(fit, cont)

fit2 = eBayes(fit2)

results = decideTests(fit2)

> summary(results)
   [,1]
-1     0
0 21457
1      0

Getting 0 sig genes from the contrast seems unlikely because, when I go to see the genes from just the interaction between sex and condition I get the following:

fit_simple = eBayes(fit)
results_simple = decideTests(fit_simple)

summary(results_simple)

#just showing columns of interest

sexMale.conditionpost sexFemale.conditionpost
-1                  4500                    3859
0                   9812                   11541
1                   6819                    5731

Because getting no sig genes seemed suspicious I then tried to incorporate pairing using the duplicateCorrelation function instead of in the design matrix. I also simplified the design, so I did a separate run for male samples, for female samples, and for the interaction between sex and condition and the contrast between the two sexes. My data then looked as such (for just males and females I just had those samples in a table with the patient number starting at 1. Patient is a factor, not numeric):

sample	condition	sex	patient
sample1	pre	M	1
sample1	post	M	1
sample2	pre	M	2
sample2	post	M	2
sample70	pre	F	70
sample 70	post	F	70
...	...	...	...

The code for male samples (and female samples, just different count matrix and data table) looks like this:

dge_Male <- DGEList(countdata_filtered_Male)
dge_Male <- calcNormFactors(dge_Male)
design_Male = model.matrix(~ condition, colData_Male)
v_preCor_Male = voomWithQualityWeights(dge_Male, design_Male, plot=FALSE)
first_cor_Male = duplicateCorrelation(v_preCor_Male, design = design_Male, block = colData_Male$nested)
v_postCor_Male = voomWithQualityWeights(dge_Male, design_Male, block = colData_Male$nested, correlation = first_cor_Male$consensus, plot=FALSE)
second_cor_Male = duplicateCorrelation(v_postCor_Male, design = design_Male, block = colData_Male$nested)
fit_Male <- lmFit(v_postCor_Male, design_Male, block = colData_Male$nested, correlation = second_cor_Male$consensus)
fit_Male <- eBayes(fit_Male)
results_Male <- decideTests(fit_Male)

summary(results_Male)
   (Intercept) conditionpost
-1        6709          3572
0          575          8409
1        11315          6618

summary(results_Female)
   (Intercept) conditionpost
-1        5412          2429
0          929          8720
1         9624          4816

The code for the interaction and contrast is as follows:

dge_pairingCor <- DGEList(countdata_filtered_pairingCor)
dge_pairingCor <- calcNormFactors(dge_pairingCor)

design_pairingCor = model.matrix(~ sex + sex:condition, colData_pairingCor)
colnames(design_pairingCor) <- make.names(colnames(design_pairingCor), unique=TRUE)
v_preCor_pairingCor = voomWithQualityWeights(dge_pairingCor, design_pairingCor, plot=FALSE)
first_cor_pairingCor = duplicateCorrelation(v_preCor_pairingCor, design = design_pairingCor, block = colData_pairingCor$patient)
v_postCor_pairingCor = voomWithQualityWeights(dge_pairingCor, design_pairingCor, block = colData_pairingCor$patient, correlation = first_cor_pairingCor$consensus, plot=FALSE)
second_cor_pairingCor = duplicateCorrelation(v_postCor_pairingCor, design = design_pairingCor, block = colData_pairingCor$patient)
fit_pairingCor <- lmFit(v_postCor_pairingCor, design_pairingCor, block = colData_pairingCor$patient, correlation = second_cor_pairingCor$consensus)

p_pairingCor <- ncol(fit_pairingCor)
cont_pairingCor <- rep(0, p_pairingCor)
cont_pairingCor[p_pairingCor] <- 1
cont_pairingCor[p_pairingCor-1] <- -1

fit_pairingCor2 = contrasts.fit(fit_pairingCor, cont_pairingCor)
fit_pairingCor2 = eBayes(fit_pairingCor2)
results_pairingCor = decideTests(fit_pairingCor2)

#results for contrast

summary(results_pairingCor)
   [,1]
-1    45
0 21041
1     45

#results for interaction between sex and condition

fit_pairingCor_simple = eBayes(fit_pairingCor)
results_pairingCor_simple = decideTests(fit_pairingCor_simple)

summary(results_pairingCor_simple)
   X.Intercept. sexFemale sexMale.conditionpost sexFemale.conditionpost
-1         7628       260                  4694                    4022
0           476     20610                  8583                   10274
1         13027       261                  7854                    6835

I am confused about several things: 1) Why do I get different results when taking the pairing out of the design matrix and instead use duplicateCorrelation? 2) Which result should I trust? 3) Why are the results for the interaction term different from when I run just male and female samples? Shouldn't the results be the same?

I am clearly confused as to how best run this type of analysis. I apologize if I did not explain my problem clearly enough, and will do my best to clarify promptly. Any help and guidance would be greatly appreciated! Thank you!

limma limma voom limma design matrix limma contrast matrix limma paired analysis • 2.1k views

ADD COMMENT • link updated 7.2 years ago by Aaron Lun ★ 28k • written 7.2 years ago by gregory.l.stone ▴ 10

score 3 · Answer 1 · 2017-02-13

Why do I get different results when taking the pairing out of the design matrix and instead use duplicateCorrelation?

Because duplicateCorrelation is somewhat liberal (i.e., detects more false positives than it should). We use it because it enables us to do things that we can't do with standard blocking in a design matrix, e.g., when conditions are confounded with multiple batches. In your case, this is not a problem as you are not comparing expression values directly between patients of different sex. Thus, it is better to block on patient in the design matrix, which avoids making assumptions about the patient effect (e.g., homogeneous, no outliers or systematic differences) and constraining the correlation to a single consensus value across genes.

Which result should I trust?

The first result where you block on patient in the design matrix. If you want some genes for experimental validation, you should increase the FDR threshold (e.g., to 10%) rather than blindly shopping around for an analysis strategy that gives you a "better" result. Then, at least, you and your collaborators know where you stand with regard to the reliability of the gene list. The only extra thing I can suggest is to try turning on robust=TRUE in eBayes to see if outlier variances are distorting the shrinkage statistics.

Incidentally, you say that getting zero genes is unlikely, based on the numbers of genes for each interaction term by itself. I see nothing that supports your claim. You will obviously get fewer DE genes in the female condition because you have fewer samples and less power. Being unable to detect these genes doesn't mean that they're not DE to the same extent as in the male condition.

Why are the results for the interaction term different from when I run just male and female samples? Shouldn't the results be the same?

No, because different numbers of samples are used to estimate the gene-wise variance. When you subset the data, you're only using the male or female samples for estimation in each of the separate analyses. When you analyze all samples together, you get more residual d.f., more reliable variance estimates and thus greater power to detect differential expression.