Question

DESeq2- what to do when two conditions share controls?

0

Entering edit mode

grashow ▴ 10

@grashow-13014

Last seen 6.7 years ago

Hi,

I have count data from an experiment where we used 8 chemicals tested at 4 concentrations each, both in the presence of estrogen and without estrogen. There were 3 technical replicates and 3 biological replicates. The DESeqDataSetFromMatrix command looks like this (without interactions):

dds <- DESeqDataSetFromMatrix(countData = countData_matrix,
colData = colData, design= ~ Conc_uM + chemical + estrogen)

Problematically, I just learned that some of the chemicals share controls. In other words, chemical #1 and chemical #2 share positive and negative estrogen controls; chemical #3 and chemical #4 share positive and negative estrogen controls, and so on.

So I need to figure out how to represent this in my colData and countData. Does it make sense to copy and rename the countData that I need to be used twice, and create new colData conditions that match the new colnames in the countData? Will this mess up the statistics at all, since there will be times that I will want to pool all the data and just compare all chemicals without estrogen to all chemicals with estrogen?

Thanks in advance,

Rachel

deseq2 • 2.9k views

ADD COMMENT • link 6.9 years ago grashow ▴ 10

0

Entering edit mode

Can you post the colData (the table where columns are variables and rows are samples). This makes it much easier to understand the design. You can do as.data.frame(colData(dds))

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Here is the colData, although it is incorrectly formatted for DESeq2. As currently written, the controls (both treated and untreated with estrogen) are used in the "chemical" column. I believe what I need is for the controls to be represented instead as associated with a chemical at a concentration of 0.0. However, these would have to be duplicated due to the sharing of controls (if that makes sense). Here is a sample of the first ~15 rows (this isn't the best formatting):

         E2 Conc_uM rep chemical
PL1A_B02  0   1e+01   1      Tam
PL1A_B03  0   1e+00   1      Tam
PL1A_B04  0   1e-01   1      Tam
PL1A_B05  0   1e-03   1      Tam
PL1A_B06  0   0e+00   1  control
PL1A_B07  0   1e+01   1     PFOA
PL1A_B08  0   1e+00   1     PFOA
PL1A_B09  0   1e-01   1     PFOA
PL1A_B10  0   1e-03   1     PFOA
PL1A_C02  0   1e+01   1      Tam
PL1A_C03  0   1e+00   1      Tam
PL1A_C04  0   1e-01   1      Tam
PL1A_C05  0   1e-03   1      Tam
PL1A_C06  0   0e+00   1  control
PL1A_C07  0   1e+01   1     PFOA
PL1A_C08  0   1e+00   1     PFOA
PL1A_C09  0   1e-01   1     PFOA

As you can see, the controls are listed as a chemical. Let me know if I need to clarify further. Thank you for the help,

Rachel

ADD REPLY • link 6.9 years ago grashow ▴ 10

0

Entering edit mode

Is there more experimental information missing? Are the 'B' samples related? Where is the presence/absence of estrogen?

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Yes, sorry- the "E2" column represents with estrogen as a "1" and those without with a "0". The first letters in the row names ("PL1A", etc) represent plate number, and the later letters ("C08") represent well locations. "Rep" refers to replicates- either 1, 2 or 3.

Here is a longer version of that table- it has 640 rows...

         E2 Conc_uM rep chemical
PL1A_B02  0   1e+01   1      Tam
PL1A_B03  0   1e+00   1      Tam
PL1A_B04  0   1e-01   1      Tam
PL1A_B05  0   1e-03   1      Tam
PL1A_B06  0   0e+00   1  control
PL1A_B07  0   1e+01   1     PFOA
PL1A_B08  0   1e+00   1     PFOA
PL1A_B09  0   1e-01   1     PFOA
PL1A_B10  0   1e-03   1     PFOA
PL1A_C02  0   1e+01   1      Tam
PL1A_C03  0   1e+00   1      Tam
PL1A_C04  0   1e-01   1      Tam
PL1A_C05  0   1e-03   1      Tam
PL1A_C06  0   0e+00   1  control
PL1A_C07  0   1e+01   1     PFOA
PL1A_C08  0   1e+00   1     PFOA
PL1A_C09  0   1e-01   1     PFOA
PL1A_C10  0   1e-03   1     PFOA
PL1A_D02  0   1e+01   1      Tam
PL1A_D03  0   1e+00   1      Tam
PL1A_D04  0   1e-01   1      Tam
PL1A_D05  0   1e-03   1      Tam
PL1A_D06  0   0e+00   1  control
PL1A_D07  0   1e+01   1     PFOA
PL1A_D08  0   1e+00   1     PFOA
PL1A_D09  0   1e-01   1     PFOA
PL1A_D10  0   1e-03   1     PFOA
PL1A_E02  1   1e+01   1      Tam
PL1A_E03  1   1e+00   1      Tam
PL1A_E04  1   1e-01   1      Tam
PL1A_E05  1   1e-03   1      Tam
PL1A_E06  1   0e+00   1  control
PL1A_E07  1   1e+01   1     PFOA
PL1A_E08  1   1e+00   1     PFOA
PL1A_E09  1   1e-01   1     PFOA
PL1A_E10  1   1e-03   1     PFOA
PL1A_F02  1   1e+01   1      Tam
PL1A_F03  1   1e+00   1      Tam
PL1A_F04  1   1e-01   1      Tam
PL1A_F05  1   1e-03   1      Tam
PL1A_F06  1   0e+00   1  control

I hope this helps get an idea of this issue. I appreciate your thoughts on this.

ADD REPLY • link 6.9 years ago grashow ▴ 10

0

Entering edit mode

Is a technical replicate here just more sequencing of the same library? I would collapse these into a single count. So that would reduce your rows down by a factor of 3. We have a function in DESeq2 to do this, but I want to make sure that I get it right. And so PL1A_B02, PL1A_C02, and PL1A_D02 are technical replicates?

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

I have been using the collapseReplicates function in DESeq2 to account for the technical replicates. You are correct about identifying the technical replicates.

The problem comes with the way the plates were laid out. If we want to do dose response curves and more, I believe that DESeq2 expects that each chemical will have its own unique set of controls, where the chemical is labeled by its name (instead of "control" as currently shown) but where the concentration is set to 0.0. But since each plate contained 2 chemicals and there is only one set of shared controls between them I don't think this will work.

Thanks so much for your help,

Rachel

ADD REPLY • link 6.9 years ago grashow ▴ 10

0

Entering edit mode

Hi Michael,

Thanks for your response. To operationalize what you're recommending, would I create a new version of my colData table, where the rows labeled as "control" under the "chemical" column with a Conc_uM = 0.0 be changed to Tam? Will this limit my ability to analyze PFOA data because it won't know what set of results to use as a control? I'm thinking about when I do contrasts to compare PFOA at different concentrations to control or other scenarios as well.

I am going to want to do some dose-response analysis using these unevenly-spaced concentrations. Would you recommend that I create some sort of an ordinal index variable so that I can analyze how gene expression changes within a certain chemical across these concentrations. However, if I were to add that column to my colData table I'd run into the collinear problem again and get an error because two of the columns will essentially be identical. I currently have been using concentration as a factor

You mention "manually remove the interaction term of concentration=0 and chemical=PFOA (essentially saying that the concentration=0 level for Tam serves as the baseline for PFOA as well) from the model matrix before supplying this to DESeq()." What would this look like?

Thanks for your patience- I really want to gain an intuition for how DESeq is working so that I can use it to investigate all the contrasts we're interested in.

Rachel

ADD REPLY • link 6.9 years ago grashow ▴ 10

0

Entering edit mode

I just updated my answer below, because I think you also need estrogen (you are assuming different slopes according to the introduction of estrogen?)

I'll have to answer the rest of your questions later.

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Can you plot what gene expression looks like over concentration for some genes? This is critical for getting the modeling right. You can use plotCounts() to do this, and potentially first subsetting the dds to particular sets of samples.

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks again for all these answers. Here's what I'm confused about- doesn't each of the treated conditions need to refer back to a control condition to establish which genes are differentially expressed? In other words, if I turn each of the "control"-labeled conditions into "Tam" at concentration zero, how will DESeq2 know what to compare PFOA data to to determine differential expression? Furthermore, if I wanted to compare all the treated chemicals to control (to see whether any genes universally change in response to treatment when compared to control), how would this be accomplished?

I've looked through the vignette a number of times, but if there is a section that I should re-read to get a better sense of this, please do direct me.

Thank you,

Rachel

ADD REPLY • link 6.9 years ago grashow ▴ 10

0

Entering edit mode

Yes, each treated condition is compared to its control, and it's possible to use a model such that the PFOA samples are compared to the control that you label as Tam, by not including a main effect for chemical. If you imagine using concentration as a linear effect, it's fixing the lines for Tam and PFOA at the same baseline, but they have different slopes (that's what the code I gave in my answer below does). But in order to do the modeling properly and optimally, you need to figure out what is reasonable in terms of log gene expression changes over concentration. I wouldn't assume it's necessarily linear in the way you have concentration coded now.

I'd recommend you to collaborate with a local statistician who can help you do some initial data exploration and then you can plug whatever linear model you decide on into DESeq2 for analysis across all genes. You don't have a trivial experimental design: you have replicate structure, estrogen treatment, concentration and different chemicals. There are a lot of statistical design choices to be made here, and it's outside the scope of software support.

"Furthermore, if I wanted to compare all the treated chemicals to control (to see whether any genes universally change in response to treatment when compared to control), how would this be accomplished?"

I wouldn't recommend this analysis. You're more well powered if you use the concentration information.

ADD REPLY • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Thank you for all of these answers- this makes sense.

ADD REPLY • link 6.9 years ago grashow ▴ 10

score 1 · Answer 1 · 2017-05-25

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

hi Rachel,

You can code the control samples as "Tam", and then use a design that doesn't have a difference between PFOA and Tam at concentration=0. This is saying that at concentration=0, the expected expression for two chemicals are equal (i.e. control). It would look like a design of ~rep + estrogen + estrogen:concentration + estrogen:chemical:concentration ... supposing that you were going to use concentration as a continuous covariate. Have you thought how to code concentration here? I wouldn't necessary use it as is (because the values do not evenly span the range of concentration in the current scale), but you might have to think about how the gene expression will change across concentration if you want to model it as a continuous covariate. You could also code it as a factor but then you would have to manually remove the interaction term of concentration=0 and chemical=PFOA (essentially saying that the concentration=0 level for Tam serves as the baseline for PFOA as well) from the model matrix before supplying this to DESeq().

ADD COMMENT • link 6.9 years ago Michael Love 41k

0

Entering edit mode

Hi Micheal,

Looking over this model matrix, why is replicates (rep) included as a covariate?

Thanks,

-R

ADD REPLY • link 6.7 years ago rrcutler ▴ 70

0

Entering edit mode

Those are biological replicates (technical replicates were collapsed already). So all the 1's could have some baseline difference in expression for a gene, and we want to control for that. But this is a pretty complicated design, and note that there is a new thread continuing the discussion (I don't give a final answer on the *correct* design here, because this is not always possible with complex experiments: there are many possible designs when you have continuous covariates and so many interaction choices to make, and this involves long conversations between statistician and biologist).

ADD REPLY • link 6.7 years ago Michael Love 41k

0

Entering edit mode

Okay, I have not seen biological replicates as a factor before. Usually I thought that biological replicates were rows of colData which had the same levels of factors?

ADD REPLY • link 6.7 years ago rrcutler ▴ 70

1

Entering edit mode

Here the original post implies there are multiple samples from the same organism or donor, so some structure to the data beyond one row of colData per individual. I don't know exactly the definition of "rep" here because it's not my data, but the original post said it was not technical replication (not just additional sequencing runs) and that the variation should be controlled for.

ADD REPLY • link 6.7 years ago Michael Love 41k