Question: Limma PlotMDS: what values should be used?
1
16 months ago by
dacaba.9210
dacaba.9210 wrote:

Hi,

I'm analyzing 36 RNAseq samples in order to obtain differentially expressed genes. The study design is a single time course with 12 timepoints ("t1" to "t12") and 3 replicates per timepoint (A, B and C)

First of all I want to confirm the strategy I'm following it's the correct one. I'm using edgeR-glm for doing the comparisons 't2-t1', 't3-t2', 't4-t3', 't5-t4' and so on. What I pretend here is to obtain the DE genes between each time. I know about the existence of packages prepared to do time course analyses, such as 'maSigPro'. However I had a problem running this analyses and I think it's due to the large amount of timepoints. Specifically, the command that fails is:

design <- make.design.matrix(eDesign, degree = 2)
min.obs <- (2+1)*(ncol(eDesign)-2)+1
#command that fails:
fit <- p.vector(data = counts_norm, design = design, Q = 0.05, counts = TRUE, min.obs = min.obs) 

Here a little snippet of the 'eDesign' matrix:

<caption>eDesign matrix</caption>
Time Replicates Group
A1 1 1 1
B1 1 1 1
C1 1 1 1
A2 2 2 1
B2 2 2 1
C2 2 2 1
... ... ... ...
A12 12 12 1
B12 12 12 1
C12 12 12 1

I have tried up to degree = 5, and still failing. I think the problem is the timepoints number (12) because if I group them in only 5 the command runs smoothly.

That's the major cause I'm doing this pairwise comparisons between times ('t2-t1', 't3-t2', 't4-t3'...), cause I can't directly use maSigPro with the 12 timepoints. Moreover, with the edger-glm pairwise approach I can know exactly what genes are DE between each paire of times. Haven't seen this is possible with maSigPro.

For this 5 timepoint grouping I have been based on the plotMDS results and the DE analyses between times. And here comes my next question:

- From what data should MDS plots be done: raw counts or normalized counts? I would say normalized data cause they take into account lib. sizes. The point is I don't obtain the same results when running next commands:

# First one
DGE <- calcNormFactors(DGE, method = "TMM")
plotMDS(DGE)
# Second one
DGE_norm <- tmm(DGE$counts, long = 1000, lc = 0, k = 0) plotMDS(DGE_norm) I know calcNormFactors() doesn't actually normalize the counts, just calculates the normalization factor. However in the edgeR manual (https://www.bioconductor.org/packages/devel/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf), sections 4.1.4 and 4.1.5, the pass to limma::plotMDS() this normalized DGElist, such as the first comand above. Doing that I obtain the same MDS plot that with the raw counts (the raw DGElist without doing calcNormFactors()). The second command does generate a different plot. I could attach the plots if someone ask for them. I'm interested in knowing how my samples group because I need to reduce the timepoints from 12 to 4-6 in order to use maSigPro. Many thanks, Dani limma masigpro edger tmm plotmds • 874 views ADD COMMENTlink modified 16 months ago • written 16 months ago by dacaba.9210 What exactly is the tmm() function that you are using? There is no such function in any of the packages you say that you are using (limma, edgeR or maSigPro). I am guessing that tmm() is actually a function from the NOISeq package. Why are you using it? It doesn't produce a DGEList object and it therefore will give quite different results in downstream analysis as compared to calcNormFactors(). You would do better to stick to the regular TMM method that is implemented in the edgeR package. I think that NOISeq::tmm is probably a sort of approximation to cpm() in the edgeR package, but the help page for tmm() doesn't document what the output is so I don't know for sure. ADD REPLYlink modified 16 months ago • written 16 months ago by Gordon Smyth37k Answer: Limma PlotMDS: what values should be used? 2 16 months ago by Aaron Lun23k Cambridge, United Kingdom Aaron Lun23k wrote: I can't speak for whatever MaSigPro is doing, but edgeR itself can handle 12 time points without any problems. There is no need to reduce the number of time points prior to a DE analysis. In your case, it seems you want to do comparisons between specific time points, so parametrizing each time point as a group would be sufficient: design <- model.matrix(~0 + Time, data=eDesign)  ... or, if the samples with the same Replicate value are related to each other: design <- model.matrix(~0 + Time + Replicate, data=eDesign)  You should then be able to do comparisons between specific groups (time points) as you please, using edgeR's GLM machinery to set up and test contrasts of interest. In fact, you seem to acknowledge that this is possible in your post, so I don't see why you need to consolidate the time points to get what you want. As for the MDS plots; I don't know what the tmm() function does, so I can't speak for that. All I can say is that calling plotMDS() on a DGEList will first call cpm() and then create an MDS plot on the log-CPMs generated from the count matrix. This will use any normalization factors that are present, so the result will be different from that generated with the "raw" DGElist. If you're not observing any differences, I would suggest that it is because your normalization factors are so close to 1 that they're not having much effect. ADD COMMENTlink modified 16 months ago • written 16 months ago by Aaron Lun23k Answer: Limma PlotMDS: what values should be used? 0 16 months ago by dacaba.9210 dacaba.9210 wrote: Hi Aaron, many thanks for your quick response. I must apologize for one thing: MDS plots do actually vary due to the TMM normalization. The samples distribution is quite similar in both plots, and this is why I got mistaken, just didn't look enough to them. Here both plots. First number means the timepoint and letter is the replicates (A, B or C). I don't know if the replicates pertain to different batches. Last number (i.e. 1A1, 4B2) has not importance. Here a summary of lib. sizes and norm. factors: > summary(DGE$samples$lib.size) Min. 1st Qu. Median Mean 3rd Qu. Max. 8232274 9206213 11469978 11822144 13279158 18278786 > summary(DGE$samples$norm.factors) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.5085 0.8216 1.0399 1.0521 1.3012 1.5670 And lastly, plots showing counts distribution before (first one) and after normalization (second one). Based on these initial quality control plots I would say data quality seems to be quite good. However, I'm thinking in removing the '12C1' as it clearly doesn't group with the other 't12' samples. There is something you said that I have not clear. When you say '... or, if the samples with the same Replicate value are related to each other...', what do you mean exactly? As far as I understand these replicates are supposed to be identical (in theory at least), so I'm not taking into account the Replicates information. Should I include this information in my analysis? Once normalization and dispersion stimation have been done, that's how I'm doing the contrasts (notice I build design in a different manner): # times are 't1', 't2', 't3'... group <- as.factor( paste("t", (rep(c(1:12), each=3)), sep="")) # design creation design <- model.matrix(~0+group) colnames(design) <- gsub("group", "", colnames(design)) # Contrasts contr.matrix <- makeContrasts( t2vst1 = t2-t1, t3vst2 = t3-t2, t4vst3 = t4-t3, t5vst4 = t5-t4, t6vst5 = t6-t5, t7vst6 = t7-t6, t8vst7 = t8-t7, t9vst8 = t9-t8, t10vst9 = t10-t9, t11vst10 = t11-t10, t12vst11 = t12-t11, levels = colnames(design) ) # DE analysis fit <- glmQLFit(DGE, design) qlf <-glmQLFTest(fit, contrast = contr.matrix[,"t2vst1"]) I run topTags(qlf, n = nrow(DGE$counts))\$table after every constrast, and that's how I obtain DE genes for every contrast. This is the way I have done my time course analysis, but I'm not sure if is the best, or even if it's correct.

Is there any better way than doing these pairwise comparisons?

1. With respect to Replicates: what is the reasoning behind labelling the samples A, B, C? Do all samples labelled as "A" come from the same individual, experiment, run of the time course, etc.? If so, there will be correlations between those samples, and you should block on the A/B/C identity of each sample in the design matrix.
2. Your pairwise contrasts look fine to me.

Sorry, I put you in context. Three cultures (A, B and C) were prepared from the same pure extract. Samples were collected from these cultures at different times, so for the first time we had samples 1A, 1B and 1C, for the second time 2A, 2B, 2C etc. All 36 samples were sequenced in the same run.

Correct me if I'm wrong, but now I think this information should be included in my analysis in order to correct a possible batch effect from cultures A, B anc C.

Here is how I would proceed:

- eDesign file -

Time Batch
1A 1 A
1B 1 B
1C 1 C
2A 2 A
2B 2 B
2C 2 C
... ... ...
12A 12 A
12B 12 B
12C 12 C

- Code -

eDesign <- read.delim("blocking_design.txt", check.names = FALSE, stringsAsFactors = FALSE, row.names = 1)
design <- model.matrix(~0+Batch+Time, data = eDesign)

The problem I'm facing now it's I don't know how to make the contrast with this matrix design, attached here:

There aren't different columns for the different times, so I don't know how to make these constrasts. Probably it's cause the matrix design is not well builded for my purposes. I have been looking the worked case at the edgeR manual, but it's still not clear to me.

Here my questions:

• Is this a valid matrix design (design) for my purpose (compare t1-t2, t2-t3, t3-4..)? If so, how can I proceed with the contrasts?
• When this blocking factor is used,will the contrasts be made only within the same batch? If so, will the results across the batches be merged and corrected by this batch effect?

If you use ~0 + Batch + Time, your Time coefficients will represent the log-fold change relative to the first time point. You can still perform your contrasts, but perhaps it would be easier to do ~0 + Time + Batch, where the Time coefficients represent the average log-expression at each time point(*), just like you had before.

Your comparisons between time points will be performed using information from all batches. This is because your design is an additive model, which assumes that the effect of time is the same for every batch. Thus, testing the effect of time within one batch is the same as testing the average effect across all batches. This is why I put a (*) above; technically, your Time coefficients actually represent the log-expression in the first batch only, but the differences between coefficients are assumed to be the same across batches, so information from the other batches gets used anyway.