I’m working on the identification of the genes affecting the efficacy of an anti-cancer drug, which targets specific protein expressed on the surface of the tumor cells. I’m trying to identify such genes because the efficacy cannot be fully correlated or explained by the expression level of the target gene. I have RNA-seq data and efficacy data taken from several samples. I show you a part of sample annotations bellow.
> head(sample.annotation, 20)
cellLine tumorType efficacy targetExpr bioRep
1 cellLine_A Breast 40.9 211.0 1
2 cellLine_A Breast 40.9 211.0 2
3 cellLine_B Breast 33.9 107.0 1
4 cellLine_B Breast 33.9 107.0 2
5 cellLine_C Endometrium 91.1 280.6 1
6 cellLine_C Endometrium 91.1 280.6 2
7 cellLine_D Endometrium 55.7 97.6 1
8 cellLine_D Endometrium 55.7 97.6 2
9 cellLine_E Endometrium 98.1 151.4 1
10 cellLine_E Endometrium 98.1 151.4 2
11 cellLine_E Endometrium 98.1 151.4 3
12 cellLine_F Esophageal 24.3 35.6 1
13 cellLine_F Esophageal 24.3 35.6 2
14 cellLine_F Esophageal 24.3 35.6 3
15 cellLine_G Esophageal 58.8 128.2 1
16 cellLine_G Esophageal 58.8 128.2 2
17 cellLine_G Esophageal 58.8 128.2 3
18 cellLine_H Glioblastoma 70.5 200.4 1
19 cellLine_H Glioblastoma 70.5 200.4 2
20 cellLine_I Melanoma 58.7 255.3 1
I have 2-3 biological replicates (NOT technical replicates) of RNA-seq data for each cell line. What I tried to do with limma is
design <- model.matrix(~ 0 + efficacy + targetExpr + tumorType, data = sample.annotation)
v <- voom(dgeData, design)
corfit <- duplicateCorrelation(v, design, block = sample.annotation$cellLine)
fit <- lmFit(v, design, block = sample.annotation$cellLine, correlation = corfit$consensus.correlation)
fit2 <- eBayes(fit)
But I realized that some of tumor types have a few cell lines so that it could be not appropriate to set tumorType as a covariate. Then I thought it would be better to treat tumor types as random effect as well as cell lines.
> table(sample.annotation$tumorType, sample.annotation$bioRep)
1 2 3
Breast 2 2 0
Endometrium 3 3 1
Esophageal 2 2 2
Glioblastoma 1 1 0
Melanoma 1 1 0
Mesothelioma 1 1 0
NSCLC 11 11 9
Ovarian 1 1 0
Prostate 1 1 0
SCLC 3 3 3
My question is
How can I take two items (cell lines and tumor types in this case) into consideration in duplicateCorrelation()?
Or is it acceptable to set tumorType as a covariate like script above even when some of tumor types have just a few cell lines?
Thank you.
Thank you very much for your answer. Let me ask a little bit more just for making sure if my idea is correct.
I agree to put tumorType as one of covariates. However, don't some tumor types with only one cell line (Glioblastoma, Melanoma, Mesothelioma, Ovarian, and Prostate) have "harmful" effects? The reason why I think so is as follows. When we consider the effect of being Glioblastoma, for example, the data will be devided into Glioblastoma group and non-Glioblastoma group. Glioblastoma group has a few samples while non-Glioblastoma group has a lot of samples. I'm afraid that in such situation the effect of being Glioblastoma could not be properly estimated ending up with inaccurate results. If it is the case, I think it could be better to exclude these tumor types before we move on to the analysis. (Not only tumor types with only one cell line but also the ones with 2-3 cell lines could also have to be excluded in this case, though.)
Which do you think is the best way?
-1. Analysis using all of tumor types with
model.matrix(~ 0 + efficacy + targetExpr + tumorType, data = sample.annotation)
-2. Analysis using tumor types with at least 2 cell lines with
model.matrix(~ 0 + efficacy + targetExpr + tumorType, data = sample.annotation)
-3. Analysis focusing on NSCLC which is the only one tumor type that has more than 10 cell lines with
model.matrix(~ 0 + efficacy + targetExpr, data = sample.annotation)
I would appreciate it if you give me some comments on this.
You're asking how to design a particular scientific analysis rather than how to use the software. Sorry, but I have to limit myself to software advice rather than scientific advice.
Thank you very much for your reply again. I will keep considering how to deal with my data. I found limma very powerful tool anyway. I appreciate your great help!