Dear community,
i would like to address my important questions regarding a complex experimental design in R, concerning an agilent microarray dataset with multiple time points. Briefly, after normalizing, a view of my phenotype information is the following:
> y2$targets Sample.and.Data.Relationship.Format time Batch GSM526606 irradiated 0.5 h 1 GSM526607 irradiated 0.5 h 2 GSM526608 irradiated 0.5 h 3 GSM526609 irradiated 0.5 h 4 GSM526610 bystander 0.5 h 1 GSM526611 bystander 0.5 h 2 GSM526612 bystander 0.5 h 3 GSM526613 bystander 0.5 h 4 GSM526614 control 0.5 h 1 GSM526615 control 0.5 h 2 GSM526616 control 0.5 h 3 GSM526617 control 0.5 h 4 GSM526618 irradiated 1.0 h 1 GSM526619 irradiated 1.0 h 2 GSM526620 irradiated 1.0 h 3 GSM526621 irradiated 1.0 h 4 GSM526622 bystander 1.0 h 1 GSM526623 bystander 1.0 h 2 GSM526624 bystander 1.0 h 3 GSM526625 bystander 1.0 h 4 GSM526626 control 1.0 h 1 GSM526627 control 1.0 h 2 GSM526628 control 1.0 h 3 GSM526629 control 1.0 h 4 GSM526630 irradiated 2.0 h 1 GSM526631 irradiated 2.0 h 2 GSM526632 irradiated 2.0 h 3 GSM526633 irradiated 2.0 h 4 GSM526634 bystander 2.0 h 1 GSM526635 bystander 2.0 h 2 GSM526636 bystander 2.0 h 3 GSM526637 bystander 2.0 h 4 GSM526638 control 2.0 h 1 GSM526639 control 2.0 h 2 GSM526640 control 2.0 h 3 GSM526641 control 2.0 h 4 GSM526642 irradiated 4.0 h 1 GSM526643 irradiated 4.0 h 2 GSM526644 irradiated 4.0 h 3 GSM526645 irradiated 4.0 h 4 GSM526646 bystander 4.0 h 1 GSM526647 bystander 4.0 h 2 GSM526648 bystander 4.0 h 3 GSM526649 bystander 4.0 h 4 GSM526650 control 4.0 h 1 GSM526651 control 4.0 h 2 GSM526652 control 4.0 h 3 GSM526653 control 4.0 h 4 GSM526654 irradiated 6.0 h 1 GSM526655 irradiated 6.0 h 2 GSM526656 irradiated 6.0 h 3 GSM526657 irradiated 6.0 h 4 GSM526658 bystander 6.0 h 1 GSM526659 bystander 6.0 h 2 GSM526660 bystander 6.0 h 3 GSM526661 bystander 6.0 h 4 GSM526662 control 6.0 h 1 GSM526663 control 6.0 h 2 GSM526664 control 6.0 h 3 GSM526665 control 6.0 h 4 GSM526666 irradiated 24.0 h 1 GSM526667 irradiated 24.0 h 2 GSM526668 irradiated 24.0 h 3 GSM526669 irradiated 24.0 h 4 GSM526670 bystander 24.0 h 1 GSM526671 bystander 24.0 h 2 GSM526672 bystander 24.0 h 3 GSM526673 bystander 24.0 h 4 GSM526674 control 24.0 h 1 GSM526675 control 24.0 h 2 GSM526676 control 24.0 h 3 GSM526677 control 24.0 h 4 |
As you can see from the above phenotype data frame, the experimental design is rather complex: i have 6 different time points of 3 different conditions(control, bystander & irradiated cells) and also 4 different biological replicates of each condition-time point, which are called "batches" for simplicity. Thus, at a first glance I'm mainly interest to test for DE genes between bystander vs control and irradiated vs control at a specific timepoint, which is the 4h time point. So, because I'm a newbie to R and complex methodologies, how should i formulate my design matrix with limma, for instance for this specific example ? In order also to adjust for the "possible batch effect", that is the 4 different biological replicates ?
Here is a link to an MDS plot of my dataset after normalization (where the color represents the 3 different groups of comparison) :
https://www.dropbox.com/s/qv3n5lrnlh7focg/plotMDS_after_normalization.png?dl=0
I found in the limma users guide the tutorial 9.6.2 in page 49, but the additional complexity in my design is also the different biological replicates !!
Thank you in advance !!!
Konstantinos
Thank you very much Aaron for your detailed answer and comments !! So, my further crucial questions are :
1. Firstly, concerning your crucial comment about the possible present of batch effect: from the related paper above(http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-12-2#CR19_3141), it has two specific parts that in my opinion(and please) indicate that there is actually a batch effect regarding the different day of experiments:
"Four independent experiments were conducted, and each was performed in parallel with irradiated, bystander and sham-irradiated samples derived from a sub-cultivated pool of IMR-90 cells that were seeded from a single cryo-vial. Directly irradiated (outer dish) and bystander (inner dish) cells were separated at 30 minutes, 1, 2, 4, 6 and 24 hours after exposure, and RNA was isolated from the exposed cultures and from time-matched sham-irradiated controls using Ribopure (Ambion, Life Technologies)."
& also... "Each sample was hybridized to an Agilent Whole Human Genome Oligo Microarray (G4112F) using the Agilent one-color workflow as previously described".
Thus, these don't indicate that in each timepoint, both 3 conditions were processed together ? Also what is your opinion about the above MDS plot ?
2. Moreover, i don't have to use the complex approach with splines like the tutorial, correctly ? Although i have 6 different time points ??
3. Finally, my third question is related to the design matrix: in case my above notion about the possible inclusion of the batch is correct, then:
batch <- factor(y2$targets$Batch)
design <- model.matrix(~0 + grouping + batch)
And then continue with makeContrasts to specify my selected comparisons. However, in case--which is a second goal of my current analysis--to identify possible DE genes in all time points in a specific comparison: for instance, irradiated vs bystander, which strategy should i follow ? Again, use
makeContrasts
to define all needed comparisons ? And then with thetopTable
function, if i dont define a coefficient, this will return the DE genes that are common between all the comparisons from above ?makeContrasts
to define pairwise comparisons between the two groups at all time points (below). No, if you don't specify a coefficient,topTable
will do an ANOVA looking for DE at any comparison, i.e., at any time point.If you want to identify DE that is present across all comparisons (i.e., at all time points), you'll need to run
topTable
separately forcoef=1
toncol(con)
, and then find the intersection of the DE lists.Dear Aaron,
thank you again for your comments !! Unfortunately, regarding the issue for the dishes you have mentioned, i could not find any related information on the paper, but only this part:
"Early passage (population doubling < 35) IMR-90 human lung fibroblasts (Coriell repository, NJ) were sub-cultured in Dulbecco's modified Eagle's medium (Gibco) and Ham's F10 medium in a 1:1 mixture plus 15% fetal bovine serum. Mylar-bottomed culture dishes were prepared as described previously [50]. An inner dish with a base of 38-μm-thick Mylar strips was inserted into a larger dish with a 6-μm Mylar base. The 38-μm Mylar completely shields the α-particles so that only cells on the thinner Mylar areas of the dish were directly irradiated [50]. Cells seeded in these dishes formed a contiguous layer. Cells were exposed to 0 (sham irradiated) or 50 cGy4He ions (125 keV per micron) as simulated α-particles using the track segment mode of the 5.5-MV Singletron accelerator at the Radiological Research Accelerator Facility of Columbia University"...
& the above specific part again:
"Directly irradiated (outer dish) and bystander (inner dish) cells were separated at 30 minutes, 1, 2, 4, 6 and 24 hours after exposure, and RNA was isolated from the exposed cultures and from time-matched sham-irradiated controls using Ribopure (Ambion, Life Technologies)".
2. Regarding your answer number 3: i have two further important questions in order to understand appropriately the comparisons: although i understand that this is not the forum for statistical learning, which is the main difference for not using a number in coef--thus an ANOVA model (after for instance the con object you post above), and calling individually all needed contrasts and then intersect the common DE genes ?
3. To continue my previous question, you pinpointed above:
"If you want to identify DE that is present across all comparisons (i.e., at all time points), you'll need to run
topTable
separately forcoef=1
toncol(con)"
Thus, something like this ?
topTable(ebayes, coef=1:4,..) ?
or top.0.5h <- topTable(ebayes, coef=1,...), top1h <- topTable(ebayes, coef=2, ....)
For the first point, I can't tell for sure whether they're independent cultures. I guess they are, though, because it doesn't sound easy (or sensible) to extract RNA from part of the dish and then try to re-use the remaining culture in the dish for the the rest of the time-course.
For the second point, doing an ANOVA tests for DE in any of the comparisons, while dropping each coefficient separately and intersecting the resulting DE lists will identify genes that are DE in all comparisons.
For the third point, my suggestion refers to your second approach, where
topTable
is run separately on each comparison. Your first approach is basically an ANOVA with all contrasts.Dear Aaron,
i return with one crusial question---as with some specific independent comparisons, i got a small number of DE genes, i tried the ANOVA approach as you suggested, with the following contrasts:
con <- makeContrasts(BYSTANDER.4.0H=groupingbystander.4.0h - groupingcontrol.4.0h, IRRADIATED.4.0H=groupingirradiated.4.0h - groupingcontrol.4.0h, BYSTANDER.0.5H=groupingbystander.0.5h - groupingcontrol.0.5h, IRRADIATED.0.5H=groupingirradiated.0.5h - groupingcontrol.0.5h,
BYSTANDER.1.0H=groupingbystander.1.0h - groupingcontrol.1.0h, IRRADIATED.1.0H=groupingirradiated.1.0h - groupingcontrol.1.0h,
BYSTANDER.2.0H=groupingbystander.2.0h - groupingcontrol.2.0h, IRRADIATED.2.0H=groupingirradiated.2.0h - groupingcontrol.2.0h,
BYSTANDER.6.0H=groupingbystander.6.0h - groupingcontrol.6.0h, IRRADIATED.6.0H=groupingirradiated.6.0h - groupingcontrol.6.0h,
BYSTANDER.24.0H=groupingbystander.24.0h - groupingcontrol.24.0h, IRRADIATED.24.0H=groupingirradiated.24.0h - groupingcontrol.24.0h,
levels=design)
So firstly, when i type:
Thus, the above probesets(possibly might include some probesets annotated to the same gene symbol) are considered as "DE" in at least one of my specific above comparisons-based on the adjusted p-value cutoff ? Like essentially performing an F-test ?
Moreover, if my notion is correct: without specifying above a coef argument and also with the FDR correction, essentially each contrast individually is corrected for significance, right ?
2) If my above approach is valid, then about the output of topTable:
In detail, regarding the output: each column-coefficient represents the average log2-fold change of each defined contrast ? And what is the difference with the column with title F In a sense similar to the B column returned in a specific coefficient?
And if i want to restrict my "DE candidates", i should use something like the following :
top2 <- top[top$adj.P.Val < 0.05,] ? or i should also utilize the F information ??
Thank you,
Konstantinos
Break up your posts, because it's hard to answer your individual questions.
F
column represents the F-statistic of the ANOVA-like comparison, and is not related to the B-statistic (which represents the log-odds). Yes, you should filter on the adjusted p-value to get DE genes.Thank you again for your comments. Just one final question about the annotation procedure and my DE probesets: as now i have multiple contrasts, and i selected my DE genes with the following code:
top2 <- top[top$adj.P.Val<0.05,]
Thus, because i have as expected some duplicated probesets matching to the same gene symbol :
1) As i would like to use these probesets to perform something like time-series clustering with all the time points i have in this specific dataset, should i remove these duplicated probesets in order to have unique gene symbols ? And what is your opinion on this matter ?
2) Because you mentioned above that the "FDR correction is not applied on each individual contrast separately, it is applied on the p-value from the ANOVA using all contrasts simultaneously"--so, i can't tell with certainty in which contrast a specific probeset has been "characterized" as DE, right ?
3) Because i have two other DE lists from 2 separate datasets--that had two common timepoints comparison--and i would like to compare my resulted DE probesets with these 2 lists, i should again first remove any duplicated probesets ??
Best,
Konstantinos
Dear Aaron,
with delay, i would like to address based on your answers above two small questions:
1) Should not in your above implementation of design matrix, also make the variable grouping as factor, before including it in the model matrix ? As in the case above, it stays as a character variable ?
2) Secondly, if my above approach is correct, and i want finally to impement an ANOVA model, but in specific comparisons that i will define by makeContrasts, as above, i should follow:
grouping <- as.factor(paste0(final$targets$Sample.and.Data.Relationship.Format, ".", final$targets$time))
batch <- factor(final$targets$Batch)
design <- model.matrix(~0 + grouping + batch)
fit <- lmFit(final, design)
con <- makeContrasts(irradiated.0.5h - control.0.5h, irradiated.1h - control.1h,irradiated.2h - control.2h,
# and also some other contrastsirradiated.4h - control.4h, bystander.0.5h.-control.0.5h,... levels=design)
fit2 <- contrasts.fit(fit, con)
fit3 <- eBayes(fit2, trend=TRUE, robust=TRUE)
Then, if i use top <- topTable(fit3, sort.by="none", number=nrow(fit3)) without any coefficients:
it will return probesets that are DE in at least one of the above specified contrasts ?
Or i also have to include a cutoff in topTable, like p.value=0.05 in order to be more appropriate??
Best,
Konstantinos
model.matrix
will convert character vectors into factors for you.topTable
call will return all probe sets. If you want DE probe sets, you should filtertop
to keep only those rows whereadj.P.val
is less than some FDR threshold, e.g., 5%.Thank you again Aaron for your comments !! I thought that coercing to factor is almost a pre-requisite step before model.matrix, but i didn't have in mind that model.matrix convert character vectors into factors. So, coercing to factor is most advisable for variables with discrete numeric values, before implement into model.matrix ?
For your second answer about the ANOVA, so something like this:
top <- topTable(fit3, sort.by="none", number=nrow(fit3), p.value=0.05)
?And this will return any probesets that have an adjusted p.value less than 0.05 in any contrasts ??