Hi,
I have count data from an experiment where we used 8 chemicals tested at 4 concentrations each, both in the presence of estrogen and without estrogen. There were 3 technical replicates and 3 biological replicates. The DESeqDataSetFromMatrix command looks like this (without interactions):
dds <- DESeqDataSetFromMatrix(countData = countData_matrix,
colData = colData, design= ~ Conc_uM + chemical + estrogen)
Problematically, I just learned that some of the chemicals share controls. In other words, chemical #1 and chemical #2 share positive and negative estrogen controls; chemical #3 and chemical #4 share positive and negative estrogen controls, and so on.
So I need to figure out how to represent this in my colData and countData. Does it make sense to copy and rename the countData that I need to be used twice, and create new colData conditions that match the new colnames in the countData? Will this mess up the statistics at all, since there will be times that I will want to pool all the data and just compare all chemicals without estrogen to all chemicals with estrogen?
Thanks in advance,
Rachel
Can you post the colData (the table where columns are variables and rows are samples). This makes it much easier to understand the design. You can do
as.data.frame(colData(dds))
Here is the colData, although it is incorrectly formatted for DESeq2. As currently written, the controls (both treated and untreated with estrogen) are used in the "chemical" column. I believe what I need is for the controls to be represented instead as associated with a chemical at a concentration of 0.0. However, these would have to be duplicated due to the sharing of controls (if that makes sense). Here is a sample of the first ~15 rows (this isn't the best formatting):
As you can see, the controls are listed as a chemical. Let me know if I need to clarify further. Thank you for the help,
Rachel
Is there more experimental information missing? Are the 'B' samples related? Where is the presence/absence of estrogen?
Yes, sorry- the "E2" column represents with estrogen as a "1" and those without with a "0". The first letters in the row names ("PL1A", etc) represent plate number, and the later letters ("C08") represent well locations. "Rep" refers to replicates- either 1, 2 or 3.
Here is a longer version of that table- it has 640 rows...
I hope this helps get an idea of this issue. I appreciate your thoughts on this.
Is a technical replicate here just more sequencing of the same library? I would collapse these into a single count. So that would reduce your rows down by a factor of 3. We have a function in DESeq2 to do this, but I want to make sure that I get it right. And so PL1A_B02, PL1A_C02, and PL1A_D02 are technical replicates?
I have been using the collapseReplicates function in DESeq2 to account for the technical replicates. You are correct about identifying the technical replicates.
The problem comes with the way the plates were laid out. If we want to do dose response curves and more, I believe that DESeq2 expects that each chemical will have its own unique set of controls, where the chemical is labeled by its name (instead of "control" as currently shown) but where the concentration is set to 0.0. But since each plate contained 2 chemicals and there is only one set of shared controls between them I don't think this will work.
Thanks so much for your help,
Rachel
Hi Michael,
Thanks for your response. To operationalize what you're recommending, would I create a new version of my colData table, where the rows labeled as "control" under the "chemical" column with a Conc_uM = 0.0 be changed to Tam? Will this limit my ability to analyze PFOA data because it won't know what set of results to use as a control? I'm thinking about when I do contrasts to compare PFOA at different concentrations to control or other scenarios as well.
I am going to want to do some dose-response analysis using these unevenly-spaced concentrations. Would you recommend that I create some sort of an ordinal index variable so that I can analyze how gene expression changes within a certain chemical across these concentrations. However, if I were to add that column to my colData table I'd run into the collinear problem again and get an error because two of the columns will essentially be identical. I currently have been using concentration as a factor
You mention "manually remove the interaction term of concentration=0 and chemical=PFOA (essentially saying that the concentration=0 level for Tam serves as the baseline for PFOA as well) from the model matrix before supplying this to DESeq()." What would this look like?
Thanks for your patience- I really want to gain an intuition for how DESeq is working so that I can use it to investigate all the contrasts we're interested in.
Rachel
I just updated my answer below, because I think you also need estrogen (you are assuming different slopes according to the introduction of estrogen?)
I'll have to answer the rest of your questions later.
Can you plot what gene expression looks like over concentration for some genes? This is critical for getting the modeling right. You can use plotCounts() to do this, and potentially first subsetting the dds to particular sets of samples.
Hi Michael,
Thanks again for all these answers. Here's what I'm confused about- doesn't each of the treated conditions need to refer back to a control condition to establish which genes are differentially expressed? In other words, if I turn each of the "control"-labeled conditions into "Tam" at concentration zero, how will DESeq2 know what to compare PFOA data to to determine differential expression? Furthermore, if I wanted to compare all the treated chemicals to control (to see whether any genes universally change in response to treatment when compared to control), how would this be accomplished?
I've looked through the vignette a number of times, but if there is a section that I should re-read to get a better sense of this, please do direct me.
Thank you,
Rachel
Yes, each treated condition is compared to its control, and it's possible to use a model such that the PFOA samples are compared to the control that you label as Tam, by not including a main effect for chemical. If you imagine using concentration as a linear effect, it's fixing the lines for Tam and PFOA at the same baseline, but they have different slopes (that's what the code I gave in my answer below does). But in order to do the modeling properly and optimally, you need to figure out what is reasonable in terms of log gene expression changes over concentration. I wouldn't assume it's necessarily linear in the way you have concentration coded now.
I'd recommend you to collaborate with a local statistician who can help you do some initial data exploration and then you can plug whatever linear model you decide on into DESeq2 for analysis across all genes. You don't have a trivial experimental design: you have replicate structure, estrogen treatment, concentration and different chemicals. There are a lot of statistical design choices to be made here, and it's outside the scope of software support.
"Furthermore, if I wanted to compare all the treated chemicals to control (to see whether any genes universally change in response to treatment when compared to control), how would this be accomplished?"
I wouldn't recommend this analysis. You're more well powered if you use the concentration information.
Thank you for all of these answers- this makes sense.