Questions about complex design in limma regarding an agilent microarray dataset with multiple time points and different biological replicates
1
0
Entering edit mode
@konstantinos-yeles-8961
Last seen 3 months ago
Italy

Dear community,

i would like to address my important questions regarding a complex experimental design in R, concerning an agilent microarray dataset with multiple time points. Briefly, after normalizing, a view of my phenotype information is the following:

> y2$targets
          Sample.and.Data.Relationship.Format   time Batch
GSM526606                          irradiated  0.5 h     1
GSM526607                          irradiated  0.5 h     2
GSM526608                          irradiated  0.5 h     3
GSM526609                          irradiated  0.5 h     4
GSM526610                           bystander  0.5 h     1
GSM526611                           bystander  0.5 h     2
GSM526612                           bystander  0.5 h     3
GSM526613                           bystander  0.5 h     4
GSM526614                             control  0.5 h     1
GSM526615                             control  0.5 h     2
GSM526616                             control  0.5 h     3
GSM526617                             control  0.5 h     4
GSM526618                          irradiated  1.0 h     1
GSM526619                          irradiated  1.0 h     2
GSM526620                          irradiated  1.0 h     3
GSM526621                          irradiated  1.0 h     4
GSM526622                           bystander  1.0 h     1
GSM526623                           bystander  1.0 h     2
GSM526624                           bystander  1.0 h     3
GSM526625                           bystander  1.0 h     4
GSM526626                             control  1.0 h     1
GSM526627                             control  1.0 h     2
GSM526628                             control  1.0 h     3
GSM526629                             control  1.0 h     4
GSM526630                          irradiated  2.0 h     1
GSM526631                          irradiated  2.0 h     2
GSM526632                          irradiated  2.0 h     3
GSM526633                          irradiated  2.0 h     4
GSM526634                           bystander  2.0 h     1
GSM526635                           bystander  2.0 h     2
GSM526636                           bystander  2.0 h     3
GSM526637                           bystander  2.0 h     4
GSM526638                             control  2.0 h     1
GSM526639                             control  2.0 h     2
GSM526640                             control  2.0 h     3
GSM526641                             control  2.0 h     4
GSM526642                          irradiated  4.0 h     1
GSM526643                          irradiated  4.0 h     2
GSM526644                          irradiated  4.0 h     3
GSM526645                          irradiated  4.0 h     4
GSM526646                           bystander  4.0 h     1
GSM526647                           bystander  4.0 h     2
GSM526648                           bystander  4.0 h     3
GSM526649                           bystander  4.0 h     4
GSM526650                             control  4.0 h     1
GSM526651                             control  4.0 h     2
GSM526652                             control  4.0 h     3
GSM526653                             control  4.0 h     4
GSM526654                          irradiated  6.0 h     1
GSM526655                          irradiated  6.0 h     2
GSM526656                          irradiated  6.0 h     3
GSM526657                          irradiated  6.0 h     4
GSM526658                           bystander  6.0 h     1
GSM526659                           bystander  6.0 h     2
GSM526660                           bystander  6.0 h     3
GSM526661                           bystander  6.0 h     4
GSM526662                             control  6.0 h     1
GSM526663                             control  6.0 h     2
GSM526664                             control  6.0 h     3
GSM526665                             control  6.0 h     4
GSM526666                          irradiated 24.0 h     1
GSM526667                          irradiated 24.0 h     2
GSM526668                          irradiated 24.0 h     3
GSM526669                          irradiated 24.0 h     4
GSM526670                           bystander 24.0 h     1
GSM526671                           bystander 24.0 h     2
GSM526672                           bystander 24.0 h     3
GSM526673                           bystander 24.0 h     4
GSM526674                             control 24.0 h     1
GSM526675                             control 24.0 h     2
GSM526676                             control 24.0 h     3
GSM526677                             control 24.0 h     4
 
 

As you can see from the above phenotype data frame, the experimental design is rather complex: i have 6 different time points of 3 different conditions(control, bystander & irradiated cells) and also 4 different biological replicates of each condition-time point, which are called "batches" for simplicity. Thus, at a first glance I'm mainly interest to test for DE genes between bystander vs control and irradiated vs control at a specific timepoint, which is the 4h time point. So, because I'm a newbie to R and complex methodologies, how should i formulate my design matrix with limma, for instance for this specific example ? In order also to adjust for the "possible batch effect", that is the 4 different biological replicates ?

Here is a link to an MDS plot of my dataset after normalization (where the color represents the 3 different groups of comparison) :

https://www.dropbox.com/s/qv3n5lrnlh7focg/plotMDS_after_normalization.png?dl=0

I found in the limma users guide the tutorial 9.6.2 in page 49, but the additional complexity in my design is also the different biological replicates !!

Thank you in advance !!!

Konstantinos

limma design matrix multiple time points agilent microarrays • 2.7k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 5 hours ago
The city by the bay

The first question is whether the "biological replicates" really belong in different batches. Is there any relationship between GSM526606, GSM526610, GSM526614, etc. other than being the first sample in each treatment/time combination? If there isn't, then they're independent samples and you shouldn't be treating them as being in a single batch. If there is (e.g., they were processed together), then consideration of the batch effect is more understandable. In any case, I would combine treatment and time into a single factor:

grouping <- paste0(y2$targets$Sample.and.Data.Relationship.Format. ".", y2$targets$time)

You'll need to do this if you want to test for DE between treatments at a specific time point. Then, depending on whether the "biological replicate" factor represents a genuine batch or not, you would then do:

design <- model.matrix(~0 + grouping) # if not a genuine batch

... or:

batch <- factor(y2$targets$Batch)
design <- model.matrix(~0 + grouping + batch) # if a genuine batch

... and set up the DE contrasts appropriately using makeContrasts. This should be easy enough; each grouping* coefficient represents the average log-expression of the corresponding group, so you can just set up contrasts between the 4 hour groups of interest. You'll probably want to rename the levels of time to get rid of spaces, otherwise makeContrasts won't be happy.

ADD COMMENT
0
Entering edit mode

Thank you very much Aaron for your detailed answer and comments !! So, my further crucial questions are :

1. Firstly, concerning your crucial comment about the possible present of batch effect: from the related paper above(http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-12-2#CR19_3141), it has two specific parts that in my opinion(and please) indicate that there is actually a batch effect regarding the different day of experiments:

"Four independent experiments were conducted, and each was performed in parallel with irradiated, bystander and sham-irradiated samples derived from a sub-cultivated pool of IMR-90 cells that were seeded from a single cryo-vial. Directly irradiated (outer dish) and bystander (inner dish) cells were separated at 30 minutes, 1, 2, 4, 6 and 24 hours after exposure, and RNA was isolated from the exposed cultures and from time-matched sham-irradiated controls using Ribopure (Ambion, Life Technologies)."

& also... "Each sample was hybridized to an Agilent Whole Human Genome Oligo Microarray (G4112F) using the Agilent one-color workflow as previously described".

Thus, these don't indicate that in each timepoint, both 3 conditions were processed together ? Also what is your opinion about the above MDS plot ?

2.  Moreover, i don't have to use the complex approach with splines like the tutorial, correctly ? Although i have 6 different time points ??

3. Finally, my third question is related to the design matrix: in case my above notion about the possible inclusion of the batch is correct, then:

batch <- factor(y2$targets$Batch)
design <- model.matrix(~0 + grouping + batch)

And then continue with makeContrasts to specify my selected comparisons. However, in case--which is a second goal of my current analysis--to identify possible DE genes in all time points in a specific comparison: for instance, irradiated vs bystander, which strategy should i follow ? Again, use makeContrasts to define all needed comparisons ? And then with the topTable function, if i dont define a coefficient, this will return the DE genes that are common between all the comparisons from above ?

 

 

ADD REPLY
1
Entering edit mode
  1. Yes, that is a batch effect, with each experiment corresponding to a batch. Hopefully the dishes at each time point (within each batch) are independent, otherwise you could be in a world of pain.
  2. Not if you want to make inferences about specific time points.
  3. Yes, use makeContrasts to define pairwise comparisons between the two groups at all time points (below). No, if you don't specify a coefficient, topTable will do an ANOVA looking for DE at any comparison, i.e., at any time point.
con <- makeContrasts(irradiated.0.5h - bystander.0.5h, 
    irradiated.1h - bystander.1h,irradiated.2h - bystander.2h,
    irradiated.4h - bystander.4h, levels=design)

If you want to identify DE that is present across all comparisons (i.e., at all time points), you'll need to run topTable separately for coef=1 to ncol(con), and then find the intersection of the DE lists.

ADD REPLY
0
Entering edit mode

Dear Aaron,

thank you again for your comments !! Unfortunately, regarding the issue for the dishes you have mentioned, i could not find any related information on the paper, but only this part:

"Early passage (population doubling < 35) IMR-90 human lung fibroblasts (Coriell repository, NJ) were sub-cultured in Dulbecco's modified Eagle's medium (Gibco) and Ham's F10 medium in a 1:1 mixture plus 15% fetal bovine serum. Mylar-bottomed culture dishes were prepared as described previously [50]. An inner dish with a base of 38-μm-thick Mylar strips was inserted into a larger dish with a 6-μm Mylar base. The 38-μm Mylar completely shields the α-particles so that only cells on the thinner Mylar areas of the dish were directly irradiated [50]. Cells seeded in these dishes formed a contiguous layer. Cells were exposed to 0 (sham irradiated) or 50 cGy4He ions (125 keV per micron) as simulated α-particles using the track segment mode of the 5.5-MV Singletron accelerator at the Radiological Research Accelerator Facility of Columbia University"...

& the above specific part again:

"Directly irradiated (outer dish) and bystander (inner dish) cells were separated at 30 minutes, 1, 2, 4, 6 and 24 hours after exposure, and RNA was isolated from the exposed cultures and from time-matched sham-irradiated controls using Ribopure (Ambion, Life Technologies)".

2. Regarding your answer number 3: i have two further important questions in order to understand appropriately the comparisons: although i understand that this is not the forum for statistical learning, which is the main difference for not using a number in coef--thus an ANOVA model (after for instance the con object you post above), and calling individually all needed contrasts and then intersect the common DE genes ?

3. To continue my previous question, you pinpointed above:

"If you want to identify DE that is present across all comparisons (i.e., at all time points), you'll need to run topTable separately for coef=1 to ncol(con)"

Thus, something like this ?

topTable(ebayes, coef=1:4,..) ?

or top.0.5h <- topTable(ebayes, coef=1,...), top1h <- topTable(ebayes, coef=2, ....)

ADD REPLY
0
Entering edit mode

For the first point, I can't tell for sure whether they're independent cultures. I guess they are, though, because it doesn't sound easy (or sensible) to extract RNA from part of the dish and then try to re-use the remaining culture in the dish for the the rest of the time-course.

For the second point, doing an ANOVA tests for DE in any of the comparisons, while dropping each coefficient separately and intersecting the resulting DE lists will identify genes that are DE in all comparisons.

For the third point, my suggestion refers to your second approach, where topTable is run separately on each comparison. Your first approach is basically an ANOVA with all contrasts.

ADD REPLY
0
Entering edit mode

Dear Aaron,

i return with one crusial question---as with some specific independent comparisons, i got a small number of DE genes, i tried the ANOVA approach as you suggested, with the following contrasts:

con <- makeContrasts(BYSTANDER.4.0H=groupingbystander.4.0h - groupingcontrol.4.0h, IRRADIATED.4.0H=groupingirradiated.4.0h - groupingcontrol.4.0h, BYSTANDER.0.5H=groupingbystander.0.5h - groupingcontrol.0.5h, IRRADIATED.0.5H=groupingirradiated.0.5h - groupingcontrol.0.5h,
BYSTANDER.1.0H=groupingbystander.1.0h - groupingcontrol.1.0h, IRRADIATED.1.0H=groupingirradiated.1.0h - groupingcontrol.1.0h,
BYSTANDER.2.0H=groupingbystander.2.0h - groupingcontrol.2.0h, IRRADIATED.2.0H=groupingirradiated.2.0h - groupingcontrol.2.0h,
BYSTANDER.6.0H=groupingbystander.6.0h - groupingcontrol.6.0h, IRRADIATED.6.0H=groupingirradiated.6.0h - groupingcontrol.6.0h,
BYSTANDER.24.0H=groupingbystander.24.0h - groupingcontrol.24.0h, IRRADIATED.24.0H=groupingirradiated.24.0h - groupingcontrol.24.0h,
levels=design)

 fit2 <- contrasts.fit(fit, con)
 fit3 <- eBayes(fit2, trend=TRUE, robust=TRUE)
 top <- topTable(fit3, number=nrow(fit3),adjust.method = "fdr",sort="none")
 dim(top)
[1] 28623    26

So firstly, when i type:

> sum(top$adj.P.Val<0.05)
[1] 2172

Thus, the above probesets(possibly might include some probesets annotated to the same gene symbol) are considered as "DE" in at least one of my specific above comparisons-based on the adjusted p-value cutoff ? Like essentially performing an F-test ?

Moreover, if my notion is correct: without specifying above a coef argument and also with the FDR correction, essentially each contrast individually is corrected for significance, right ?

2) If my above approach is valid, then about the output of topTable:

> head(top)
             Row Col Start                                                     Sequence
A_24_P66027    1  12     0 GCTGCCCGCATCTATGATTACGACCCCCTATATAAGGAGGCGCTGCAAATGCTGCGGGAT
A_32_P77178    1  13     0 GGTCAATTTTTGGTCAAAAGTACAGAGAGCATAGAATAAAAGCAAAGATGTGAATGTCTC
A_23_P212522   1  14     0 ATTTTCTAACTGTCCTCTTTCTTGGGTCTAAAGCTCATAATACACAAAGGCTTCCAGACC
A_24_P9671     1  16     0 ATCCAGGTCAGATTGTCAAGCATGGAGATATCAAGTGTGTACTAAATGAAGGCATGCCAA
A_24_P801451   1  18     0 AAATCGGGCCATCCGCACCAGAGGAAGATCATTCTGCCGGGACGTGGCTCGGGGCTATGA
A_32_P30710    1  19     0 ACGAAAGTCATACCGTAGAAAAGATGGCGTGTTTCTTTATTTTGAAGATAATGCAGGAGT
             ProbeUID ControlType    ProbeName GeneName SystematicName
A_24_P66027         3           0  A_24_P66027 APOBEC3B      NM_004900
A_32_P77178         5           0  A_32_P77178 AA085955       AA085955
A_23_P212522        7           0 A_23_P212522   ATP11B      NM_014616
A_24_P9671         11           0   A_24_P9671   DNAJA1      NM_001539
A_24_P801451       15           0 A_24_P801451    EHMT2      NM_006709
A_32_P30710        17           0  A_32_P30710    RPL23      NM_000978
                                                                                               
(also one other long column with title Description...)
             BYSTANDER.4.0H IRRADIATED.4.0H BYSTANDER.0.5H IRRADIATED.0.5H
A_24_P66027      0.17828327       0.3323737    -0.01818480     -0.06243908
A_32_P77178     -0.14227491      -0.2489862    -0.09736988     -0.06977693
A_23_P212522     0.39661495       0.2496522     0.32818109      0.22839273
A_24_P9671       0.32405611       0.1539536     0.70222291      0.68779624
A_24_P801451    -0.51962523      -0.4583589    -0.27292625     -0.38534732
A_32_P30710     -0.02024048      -0.1667158     0.06890394      0.14195404
             BYSTANDER.1.0H IRRADIATED.1.0H BYSTANDER.2.0H IRRADIATED.2.0H
A_24_P66027      0.19163420       0.2276427    0.528264657     0.611164814
A_32_P77178      0.01291631       0.1028711   -0.002811439    -0.113310500
A_23_P212522     0.17911915       0.2869050    0.071998494     0.004549741
A_24_P9671       0.28910854       0.2861973    0.015788612    -0.086588410
A_24_P801451    -0.23063658      -0.3439779   -0.173207568    -0.280488568
A_32_P30710      0.10791504       0.1999820    0.023367801    -0.009930507
             BYSTANDER.6.0H IRRADIATED.6.0H BYSTANDER.24.0H IRRADIATED.24.0H   AveExpr
A_24_P66027      0.86607625     0.854955100      0.24459418     -0.002272561  8.279915
A_32_P77178     -0.27692035    -0.323513696      0.06368761     -0.317450866  6.508215
A_23_P212522     0.17643157     0.002037142      0.18844095      0.122490970  9.558436
A_24_P9671      -0.01424156    -0.062085584      0.07504571      0.013623162 14.417819
A_24_P801451    -0.64620261    -0.837128081     -0.32691124     -0.344798053  8.299300
A_32_P30710     -0.05942074    -0.049292710     -0.03782244     -0.002235197 15.641793
                     F     P.Value adj.P.Val
A_24_P66027  0.9593174 0.497459386 0.9636784
A_32_P77178  0.9364381 0.518321544 0.9748521
A_23_P212522 1.0397094 0.427117742 0.9133405
A_24_P9671   1.8102526 0.068609651 0.3414735
A_24_P801451 2.9152761 0.003362603 0.0460551
A_32_P30710  0.3317904 0.979941847 0.9999996

In detail, regarding the output: each column-coefficient represents the average log2-fold change of each defined contrast ? And what is the difference with the column with title F  In a sense similar to the B column returned in a specific coefficient?

And if i want to restrict my "DE candidates", i should use something like the following :

top2 <- top[top$adj.P.Val < 0.05,] ? or i should also utilize the F information ??

Thank you,

Konstantinos

ADD REPLY
0
Entering edit mode

Break up your posts, because it's hard to answer your individual questions.

  1. Yes, the DE genes from your ANOVA-like contrast will have differences in at least one of the contrasts. No, the FDR correction is not applied on each individual contrast separately, it is applied on the p-value from the ANOVA using all contrasts simultaneously.
  2. Yes, they are the log-fold changes of each contrast. No, the F column represents the F-statistic of the ANOVA-like comparison, and is not related to the B-statistic (which represents the log-odds). Yes, you should filter on the adjusted p-value to get DE genes.
ADD REPLY
0
Entering edit mode

Thank you again for your comments. Just one final question about the annotation procedure and my DE probesets: as now i have multiple contrasts, and i selected my DE genes with the following code:

top2 <- top[top$adj.P.Val<0.05,]

probes <- as.character(rownames(top2))
> head(probes)
[1] "A_24_P801451" "A_23_P77000"  "A_24_P148907" "A_23_P152107" "A_23_P97584" 
[6] "A_23_P75973" 
gns <- select(hgug4112a.db, keys=probes, columns = "SYMBOL", keytype="PROBEID")
'select()' returned 1:1 mapping between keys and columns
> sum(duplicated(gns$SYMBOL))
[1] 565

Thus, because i have as expected some duplicated probesets matching to the same gene symbol :

1)  As i would like to use these probesets to perform something like time-series clustering with all the time points i have in this specific dataset, should i remove these duplicated probesets in order to have unique gene symbols ? And what is your opinion on this matter ?

2) Because you mentioned above that the "FDR correction is not applied on each individual contrast separately, it is applied on the p-value from the ANOVA using all contrasts simultaneously"--so, i can't tell with certainty in which contrast a specific probeset has been "characterized" as DE, right ?

3) Because i have two other DE lists from 2 separate datasets--that had two common timepoints comparison--and i would like to compare my resulted DE probesets with these 2 lists, i should again first remove any duplicated probesets ??

Best,

Konstantinos

ADD REPLY
0
Entering edit mode
  1. I'd use all the information in the data set during processing with limma. If redundant probe sets are a problem during interpretation, then I might remove them at the end of the analysis, but not before.
  2. That's right.
  3. No. If you work from probe set IDs throughout your analysis, this shouldn't even be a problem. I only convert to gene symbols at the very end of my analysis when I want to interpret the final results.
ADD REPLY
0
Entering edit mode

Dear Aaron,

with delay, i would like to address based on your answers above two small questions:

1) Should not in your above implementation of design matrix, also make the variable grouping as factor, before including it in the model matrix ? As in the case above, it stays as a character variable ?

2) Secondly, if my above approach is correct, and i want finally to impement an ANOVA model, but in specific comparisons that i will define by makeContrasts, as above, i should follow:

grouping <- as.factor(paste0(final$targets$Sample.and.Data.Relationship.Format, ".", final$targets$time))

batch <- factor(final$targets$Batch)
design <- model.matrix(~0 + grouping + batch)

fit <- lmFit(final, design)

con <- makeContrasts(irradiated.0.5h - control.0.5h, irradiated.1h - control.1h,irradiated.2h - control.2h,
irradiated.4h - control.4h, bystander.0.5h.-control.0.5h,... levels=design)
# and also some other contrasts

fit2 <- contrasts.fit(fit, con)

fit3 <- eBayes(fit2, trend=TRUE, robust=TRUE)

Then, if i use top <- topTable(fit3, sort.by="none", number=nrow(fit3)) without any coefficients:

it will return probesets that are DE in at least one of the above specified contrasts ?

Or i also have to include a cutoff in topTable, like p.value=0.05 in order to be more appropriate??

Best,

Konstantinos

ADD REPLY
1
Entering edit mode
  1. No, model.matrix will convert character vectors into factors for you.
  2. Your topTable call will return all probe sets. If you want DE probe sets, you should filter top to keep only those rows where adj.P.val is less than some FDR threshold, e.g., 5%.
ADD REPLY
0
Entering edit mode

Thank you again Aaron for your comments !! I thought that coercing to factor is almost a pre-requisite step before model.matrix, but i didn't have in mind that model.matrix convert character vectors into factors. So, coercing to factor is most advisable for variables with discrete numeric values, before implement into model.matrix ?

For your second answer about the ANOVA, so something like this:

top <- topTable(fit3, sort.by="none", number=nrow(fit3), p.value=0.05) ?

And this will return any probesets that have an adjusted p.value less than 0.05 in any contrasts ??

ADD REPLY

Login before adding your answer.

Traffic: 815 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6