Question

Confusing: identify differentially expression genes (DEG) perturbed by a drug in Limma

1

Entering edit mode

kevin.m.hao ▴ 10

@kevinmhao-12183

Last seen 7.4 years ago

Hello,

I would like to identify the Differentially Expressed Genes (DEG) perturbed by a drug
'metformin'.

The experiment is one-color microarray as shown below. The 'controlFileName' refers to the array without
the drug (metformin here) treatment, and 'perturbationFileName' refers to the corresponding array with
the drug (metformin here) treatment. The original file like this

> orgFile
drugName concentration cellLine perturbationFileName controlFileName
1 metformin 0.00001 MCF7 treat_1.CEL control_1.CEL
2 metformin 0.00001 MCF7 treat_2.CEL control_1.CEL
3 metformin 0.0000001 MCF7 treat_3.CEL control_1.CEL
4 metformin 0.001 MCF7 treat_4.CEL control_1.CEL
5 metformin 0.00001 MCF7 treat_5.CEL control_2.CEL
6 metformin 0.0000242 MCF7 treat_6.CEL control_3.CEL
7 metformin 0.0000242 MCF7 treat_6.CEL control_4.CEL
8 metformin 0.0000242 PC3 treat_7.CEL control_5.CEL
9 metformin 0.0000242 PC3 treat_7.CEL control_6.CEL
10 metformin 0.0000242 HL60 treat_8.CEL control_7.CEL
11 metformin 0.0000242 HL60 treat_8.CEL control_8.CEL

####################################
# produce the original file
orgFile <- rbind(
c("metformin", "0.00001", "MCF7", "treat_1.CEL", "control_1.CEL"),
c("metformin", "0.00001", "MCF7", "treat_2.CEL", "control_1.CEL"),
c("metformin", "0.0000001", "MCF7", "treat_3.CEL", "control_1.CEL"),
c("metformin", "0.001", "MCF7", "treat_4.CEL", "control_1.CEL"),
c("metformin", "0.00001", "MCF7", "treat_5.CEL", "control_2.CEL"),
c("metformin", "0.0000242", "MCF7", "treat_6.CEL", "control_3.CEL"),
c("metformin", "0.0000242", "MCF7", "treat_6.CEL", "control_4.CEL"),
c("metformin", "0.0000242", "PC3", "treat_7.CEL", "control_5.CEL"),
c("metformin", "0.0000242", "PC3", "treat_7.CEL", "control_6.CEL"),
c("metformin", "0.0000242", "HL60", "treat_8.CEL", "control_7.CEL"),
c("metformin", "0.0000242", "HL60", "treat_8.CEL", "control_8.CEL")
)
colnames(orgFile) <- c("drugName", "concentration", "cellLine", "perturbationFileName", "controlFileName")
orgFile <- as.data.frame(orgFile)
####################################

I have serveral questions confusing me.

1.
The 'concentration' is an important factor for the microarray analysis? If so, why one same 'control file'
can be used for different concentrations (such as: 'control_1.CEL' for concentration of '0.00001', '0.0000001' and
'0.001' respectively)? If I understand right, each concentration should have their own 'control file', right?

2.
Should I perform limma analysis by considering ('concentraion'+'cellLine') or just by ('cellLine') ignoring concentration?

3.
If I analyze the microarray by just considering ('cellLine'), then for MCF7, the DEG I identified are
for example, 'geneA', 'geneB', 'genC' and for PC3, the DEG for example are 'geneB', 'geneC' and
for HL60 the DEG are 'geneD'. Can I conclude that for 'metformin' perturbation,
the total DEG are: 'geneA', 'geneB', 'geneC', 'geneD' which is union for the three analysis? if not, what is better ways to get DEG
for the 'metformin' drug perturbation?

> totDEG
[1] "geneA" "geneB" "geneC" "geneD"

#####################################
# the code like this
mcf7 <- c("geneA", "geneB", "geneC")
pc3 <- c("geneB", "geneC")
hl60 <- c("geneD")
totDEG <- Reduce(union, list(mcf7, pc3, hl60))
totDEG
#####################################

4.
Can I ignore both 'concentration' and 'cellLine' factors and do Limma analysis using all 'perturbationFileNames' and 'controlFileName',
which can get the DEG 'just one time' instead of that did in my Third question (analyze by different cellLine and get union them)?

is this analysis reasonable, since I have already ignored the 'concentration and cellLine' factors?

> targetFrame
FileName Group
1 treat_1.CEL treat
2 treat_2.CEL treat
3 treat_3.CEL treat
4 treat_4.CEL treat
5 control_1.CEL control
6 treat_5.CEL treat
7 control_2.CEL control
8 treat_6.CEL treat
9 control_3.CEL control
10 control_4.CEL control
11 treat_7.CEL treat
12 control_5.CEL control
13 control_6.CEL control
14 treat_8.CEL treat
15 control_7.CEL control
16 control_8.CEL control

######################################
# the code like this
targetFrame <- data.frame(
FileName = c("treat_1.CEL", "treat_2.CEL", "treat_3.CEL", "treat_4.CEL",
"control_1.CEL", "treat_5.CEL", "control_2.CEL", "treat_6.CEL", "control_3.CEL",
"control_4.CEL", "treat_7.CEL", "control_5.CEL", "control_6.CEL",
"treat_8.CEL", "control_7.CEL", "control_8.CEL"),
Group = c("treat", "treat", "treat", "treat", "control", "treat", "control",
"treat", "control", "control", "treat", "control", "control", "treat",
"control", "control")
)
targetFrame
library(limma)
group <- factor(targetFrame$Group)
design <- model.matrix(~ 0 + group)
colnames(design) <- c("control", "treat")
# eset is log2 data for example
fit <- lmFit(eset, design)
cm <- makeContrasts(treat - control)
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)
#######################################

5.
Given the 'orgFile', what is the proper way to to identify the Differentially Expressed Genes (DEG) perturbed by 'metformin'?

Any help will be appreciated!

limma cmap microarrays differential gene expression • 1.6k views

ADD COMMENT • link 7.9 years ago kevin.m.hao ▴ 10

score 2 · Accepted Answer · 2017-01-25

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

The starting point for any analysis is to make a proper targets frame. You obviously have data from only 16 microarrays, so the targets frame should have 16 rows (not 17 as you have above). The targets frame should also include columns for "cell line" and "concentration".

There is no reason why you need to ignore cell line or concentration in your analysis. Making decisions like that though is up to you based on what your aims are.

The "original file" that you show is not useful and you cannot do any sensible analysis using it. There is no point in asking us any questions about the "original file" because we don't know where you got it from or what it was supposed to be used for.

I cannot see any reason why each concentration should need a separate control. The control simply means no treatment, in other words the control files correspond to a concentration of zero. Trying to pair up particular treatment files with particular control files, as in your "original file", is unnecessary and makes little sense.

ADD COMMENT • link 7.9 years ago Gordon Smyth 52k

0

Entering edit mode

Thanks Gordon. I modified the typos I made before. Now it is 16 microarrays.

In fact, my goad is that: I would like to compare DEGs induced by two kinds of perturbations (e.g., one is drugA, and another is rnaA). However, I can not do wet experiment for that, so I just want to analyze the exisiting microarray data (for example, data deposited in CMAP and NCBI GEO).

In principal, these microarrays should use the same cell line and same concentration (if any) and other same factors except that the perturbations are different, right?

However, all these microarrays may be from different researchers, so the factors can not be ensured to be same. At this time, I need determine how to use these microarrays to ensure the comparison of DEGs for different perturbations tends to be reasonable.

If drugA was tested in MCF7 cell line and rnaA was tested in PC3 cell line, is there a reasonable way to compare DEGs for these two perturbations? since they were tesed in different cell lines.

What is the bottom line for comparing DEGs for two different perturbations? (MUST be same cell line? same concentration?) But if analyzing public microarray data, the limitations can be ensured, how to do?

Sorry for asking so many (naive) questions, but I am new in microarray analysis field.

Thanks so much.

ADD REPLY • link 7.9 years ago kevin.m.hao ▴ 10

1

Entering edit mode

The sensibility of comparing DEG lists depends on the biological context of the contrasts. Do your perturbations affect the same pathway? Are the cell lines related in terms of their originating cell type? If you can't answer these questions, you won't be able to interpret any similarities or differences in the DEG lists. For example, MCF7 is derived from the mammary epithelium while PC3 is derived from prostate tissue. Their regulatory networks are probably completely different - how do you plan to interpret the effects of two different perturbations on each of the two tissues?

Good experimental design obviously plays a big role here. If both contrasts involved the same cell line, then you could interpret similarities or differences in the lists as those caused by the perturbations. Alternatively, if both contrasts involved the same perturbation, then similarities/differences could be attributed to the cell line. However, if you aren't involved in planning the experiment - and you wouldn't be, for public data - then there's little you can do - see Fisher's quote:

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

ADD REPLY • link 7.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron.

If I just know drugA and rnaA were both tested in the same cell lines such as MCF7 (homo), but I do not know if other experimental factors (such as concentrations, time etc) are same (due to that they are from different researchers' experiment), can I compare the DEG lists perturbed by drugA and rnaA?

If nope, the large scale of comparison of DEG lists (perturbed saying by 1000 drugs and 1000 RNAs, which is a 1000-by-1000 similarity matrix if interpreting the similairities in the lists) in public data seems not easy, since it is difficult for computational researchers to pick up all same experimental factors of microarrays deposited by different researchers, right?

If I just want to interpret the similarities in the DEG lists (in a large scale) as those caused by the perturbations, what is a reasonable way to do it?

Thanks!

ADD REPLY • link 7.9 years ago kevin.m.hao ▴ 10

1

Entering edit mode

"Can I compare the DEG lists perturbed by drugA and rnaA?" That is not a question that is easily answered without further biological context. If one perturbation affects DNA replication, and another perturbation affects protein synthesis, what exactly are you hoping to find when you compare the DEG lists?

If the perturbations affect the same pathway, perhaps the comparison is more understandable. However, there will always be the possibility of additional confounding variables when you try to combine results from different studies. Never mind concentration and time of perturbation, what about culture conditions? Stage of growth? That's not even considering mycoplasma contamination or misidentification of cell lines.

If you're concerned about additional variables, the only way to control them is to redo the experiment yourself. If you won't or can't do that, you'll just have to accept that the comparison of DEG lists is confounded by these experimental factors, and proceed cautiously with the interpretation of the results.

ADD REPLY • link 7.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks Aaron.

By comparison of the DEG lists by computing the gene list similarity using for example 'biological process, BP', one may identify if the two perturbations have the same action to certain diseases.

If so, drug repositioning may be generated, right?

I am still confusing that how to extract the DEGs more reasonably,if one perturbation was tested in multiple microarrays of different cell lines.

ADD REPLY • link 7.9 years ago kevin.m.hao ▴ 10