Question

limma: treating multiple variables as random effects in duplicateCorrelation()

0

Entering edit mode

Kei Enomoto • 0

@kei-enomoto-20585

Last seen 5.8 years ago

I’m working on the identification of the genes affecting the efficacy of an anti-cancer drug, which targets specific protein expressed on the surface of the tumor cells. I’m trying to identify such genes because the efficacy cannot be fully correlated or explained by the expression level of the target gene. I have RNA-seq data and efficacy data taken from several samples. I show you a part of sample annotations bellow.

> head(sample.annotation, 20)
     cellLine    tumorType efficacy targetExpr bioRep
1  cellLine_A       Breast     40.9      211.0      1
2  cellLine_A       Breast     40.9      211.0      2
3  cellLine_B       Breast     33.9      107.0      1
4  cellLine_B       Breast     33.9      107.0      2
5  cellLine_C  Endometrium     91.1      280.6      1
6  cellLine_C  Endometrium     91.1      280.6      2
7  cellLine_D  Endometrium     55.7       97.6      1
8  cellLine_D  Endometrium     55.7       97.6      2
9  cellLine_E  Endometrium     98.1      151.4      1
10 cellLine_E  Endometrium     98.1      151.4      2
11 cellLine_E  Endometrium     98.1      151.4      3
12 cellLine_F   Esophageal     24.3       35.6      1
13 cellLine_F   Esophageal     24.3       35.6      2
14 cellLine_F   Esophageal     24.3       35.6      3
15 cellLine_G   Esophageal     58.8      128.2      1
16 cellLine_G   Esophageal     58.8      128.2      2
17 cellLine_G   Esophageal     58.8      128.2      3
18 cellLine_H Glioblastoma     70.5      200.4      1
19 cellLine_H Glioblastoma     70.5      200.4      2
20 cellLine_I     Melanoma     58.7      255.3      1

I have 2-3 biological replicates (NOT technical replicates) of RNA-seq data for each cell line. What I tried to do with limma is

design <- model.matrix(~ 0 + efficacy + targetExpr + tumorType, data = sample.annotation)
v <- voom(dgeData, design)
corfit <- duplicateCorrelation(v, design, block = sample.annotation$cellLine)
fit <- lmFit(v, design, block = sample.annotation$cellLine, correlation = corfit$consensus.correlation)
fit2 <- eBayes(fit)

But I realized that some of tumor types have a few cell lines so that it could be not appropriate to set tumorType as a covariate. Then I thought it would be better to treat tumor types as random effect as well as cell lines.

> table(sample.annotation$tumorType, sample.annotation$bioRep)
                1  2  3
  Breast        2  2  0
  Endometrium   3  3  1
  Esophageal    2  2  2
  Glioblastoma  1  1  0
  Melanoma      1  1  0
  Mesothelioma  1  1  0
  NSCLC        11 11  9
  Ovarian       1  1  0
  Prostate      1  1  0
  SCLC          3  3  3

My question is

How can I take two items (cell lines and tumor types in this case) into consideration in duplicateCorrelation()?
Or is it acceptable to set tumorType as a covariate like script above even when some of tumor types have just a few cell lines?

Thank you.

limma duplicateCorrelation • 1.4k views

ADD COMMENT • link updated 6.2 years ago by Gordon Smyth 52k • written 6.2 years ago by Kei Enomoto • 0

score 2 · Answer 1 · 2019-04-23

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

duplicateCorrelation only accepts one variable. You can't specify two random effects.
It seems perfectly natural to set tumorType as an ordinary factor. It wouldn't make sense as a random effect.

ADD COMMENT • link 6.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you very much for your answer. Let me ask a little bit more just for making sure if my idea is correct.

I agree to put tumorType as one of covariates. However, don't some tumor types with only one cell line (Glioblastoma, Melanoma, Mesothelioma, Ovarian, and Prostate) have "harmful" effects? The reason why I think so is as follows. When we consider the effect of being Glioblastoma, for example, the data will be devided into Glioblastoma group and non-Glioblastoma group. Glioblastoma group has a few samples while non-Glioblastoma group has a lot of samples. I'm afraid that in such situation the effect of being Glioblastoma could not be properly estimated ending up with inaccurate results. If it is the case, I think it could be better to exclude these tumor types before we move on to the analysis. (Not only tumor types with only one cell line but also the ones with 2-3 cell lines could also have to be excluded in this case, though.)

Which do you think is the best way?

-1. Analysis using all of tumor types with model.matrix(~ 0 + efficacy + targetExpr + tumorType, data = sample.annotation)

-2. Analysis using tumor types with at least 2 cell lines with model.matrix(~ 0 + efficacy + targetExpr + tumorType, data = sample.annotation)

-3. Analysis focusing on NSCLC which is the only one tumor type that has more than 10 cell lines with model.matrix(~ 0 + efficacy + targetExpr, data = sample.annotation)

I would appreciate it if you give me some comments on this.

ADD REPLY • link 6.2 years ago Kei Enomoto • 0

0

Entering edit mode

You're asking how to design a particular scientific analysis rather than how to use the software. Sorry, but I have to limit myself to software advice rather than scientific advice.

ADD REPLY • link 6.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you very much for your reply again. I will keep considering how to deal with my data. I found limma very powerful tool anyway. I appreciate your great help!

ADD REPLY • link 6.2 years ago Kei Enomoto • 0