Question

Advice for the analysis of microarray data with limma

0

Entering edit mode

Sam • 0

@3d8b1bad

Last seen 14 months ago

Germany

I would be very grateful for advice regarding the analysis of my microarray data. I am pretty new to gene expression analysis, and limma seems to be a very powerful package, but I am still a bit overwhelmed by it because it offers so many different functions that I can't make sense of yet.

In my experiment, I thawed a vial of cells and cultured them. The cells were then split into four plates and further cultured. Three of the four plates were treated with different drugs for several days; the fourth plate was not treated and served as a control.

At different time points, I collected a portion of the cells for gene expression analysis.

In analyzing the microarray data, I am now interested in looking at the differentially expressed genes (DEGs) at the different time points to determine whether the prolonged treatment has an enhancing effect, i.e., whether the number of significant DEGs increases. Since the cell culture's duration will also affect the genome, I would like to normalize the treated cells to the control sample of each time point so that I can only see the effects of the treatment.

Unfortunately, we could not make replicates of each sample due to cost, so I only measured one sample per treatment and time point.

How would one proceed here? So far, I have been inspired by chapter 17.4 (Time course effects of corn oil on rat thymus) of the limma Handbook. Here, however, the data are not normalized to a control, so I cannot adopt this pipeline one-to-one. Furthermore, replicates were measured here, and I wonder if I have to use any special functions to interpret my single measurements correctly.

My idea was to analyze the three treatments separately, i.e., first load only the data from the first treatment and the control, normalize the data to the control and then perform linear regression modeling using a y ~ 0 + time function. When the data are normalized to the control, no genes should be differentially expressed at time t0, which is why a forced intercept at (0,0) makes biological sense. From the model, the effect of treatment over time would then be seen. I would subsequently run this pipeline for the remaining two treatments as well.

I am grateful to the community for any hints, code snippets, or advice.

Targets.tsv

FileName    Name        Treatment       Time    Cy3

file1       ctrl_t0     none            0       control
file2       treat1_t0   treat1          0       sample
file3       treat2_t0   treat2          0       sample
file4       treat3_t0   treat3          0       sample
file5       ctrl_t1     none            1       control
file6       treat1_t1   treat1          1       sample
file7       treat2_t1   treat2          1       sample
file8       treat3_t1   treat3          1       sample
file9       ctrl_t2     none            2       control
file10      treat1_t2   treat1          2       sample
file11      treat2_t2   treat2          2       sample
file12      treat3_t2   treat3          2       sample
file13      ctrl_t3     none            3       control
file14      treat1_t3   treat1          3       sample
file15      treat2_t3   treat2          3       sample
file16      treat3_t3   treat3          3       sample
file17      ctrl_t4     none            4       control
file18      treat1_t4   treat1          4       sample
file19      treat2_t4   treat2          4       sample
file20      treat3_t4   treat3          4       sample

Selection of the files of one treatment*

targetinfo <- readTargets(targetsFile, row.names = 'Name')
targetinfo <- subset(targetinfo, Treatment == "treat1" | Treatment == "none")

Resulting targets

FileName    Name        Treatment       Time    Cy3

file1       ctrl_t0     none            0       control
file2       treat1_t0   treat1          0       sample
file5       ctrl_t1     none            1       control
file6       treat1_t1   treat1          1       sample
file9       ctrl_t2     none            2       control
file10      treat1_t2   treat1          2       sample
file13      ctrl_t3     none            3       control
file14      treat1_t3   treat1          3       sample
file17      ctrl_t4     none            4       control
file18      treat1_t4   treat1          4       sample

Linear regression*

time_factor <- factor(targetinfo$Time)
design <- model.matrix(~ 0 + time_factor)
fit <- lmFit(eset, design = design, weights = array.weights)

With this approach, the data is NOT normalized to the control as I do not know how to set the control samples as a reference. I saw a modelMatrix function in limma, which seems to do what I'm seeking, but I don't understand how I should combine it with my linear regression model ~ 0 + time_factor.

Microarray MicroarrayData limma • 2.2k views

ADD COMMENT • link updated 14 months ago by Gordon Smyth 50k • written 18 months ago by Sam • 0

0

Entering edit mode

A couple of questions:

Should treat1_t0, treat2_t0 and treat3_t0 all be the same as ctrl_t0? I have assumed in my answer below that they are.
Are the time-points equally spaced? When you asked this question on the Cross-Validated forum ( https://stats.stackexchange.com/questions/592854/ ) you said the time-points were t0, t2, t5, t7 and t10.
Can you show an MDS plot of the data so we can get an idea of the expression pattern over time? The first steps in a microarray analysis are always to normalize the data and then to explore the sample relationships with an MDS plot.

ADD REPLY • link 18 months ago Gordon Smyth 50k

0

Entering edit mode

Yes, the t0 samples were harvested just before adding the pharmaceuticals, thus all the t0 should be more or less the same. They are just four dirrerent cell culture plates cultured from one and the same cell vial.
The time-points are, indeed, 0, 2, 5, 7 , and 10, thus not strictly evenly spaced, unless we can omit t0.
Of course, I can! Which data should I include in the MDS plot? One plot per treatment and time-point?

ADD REPLY • link 18 months ago Sam • 0

0

Entering edit mode

The most important plot is the MDS plot of all 20 samples at once. Simply plotMDS(y).

ADD REPLY • link 18 months ago Gordon Smyth 50k

0

Entering edit mode

Here you go :)

MDS Plot

ADD REPLY • link 18 months ago Sam • 0

0

Entering edit mode

The MDS plot shows suggests that your treatments are hardly different to the controls. So the results from your experiment are likely to be rather limited. The most noticeable aspect of the plot one sample that is an outlier.

ADD REPLY • link 18 months ago Gordon Smyth 50k

score 0 · Answer 1 · 2022-10-22

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

It is not necessary to subset the data. It is easier as well as more powerful to analyse all the data together.

On the other hand, this experiment has some special features and needs a bespoke analysis approach.

I think that assuming simple linear regression over 5 time-points is too restrictive. I would allow for a quadratic trend at the very least. I suggest allowing for linear and quadratic baseline trends like this:

X <- poly(Targets$Time, degree=2)
Baseline.Lin  <- X[,1]
Baseline.Quad <- X[,2]

Then setup relative trends for the treatments:

Treat1.Lin  <- (Targets$Treatment=="treat1" & Targets$Time>0) * X[,1]
Treat1.Quad <- (Targets$Treatment=="treat1" & Targets$Time>0) * X[,2]
Treat2.Lin  <- (Targets$Treatment=="treat2" & Targets$Time>0) * X[,1]
Treat2.Quad <- (Targets$Treatment=="treat2" & Targets$Time>0) * X[,2]
Treat3.Lin  <- (Targets$Treatment=="treat3" & Targets$Time>0) * X[,1]
Treat3.Quad <- (Targets$Treatment=="treat3" & Targets$Time>0) * X[,2]

The design matrix will be

design <- model.matrix( ~ Baseline.Lin + Baseline.Quad
                        + Treat1.Lin + Treat1.Quad
                        + Treat2.Lin + Treat2.Quad
                        + Treat3.Lin + Treat3.Quad )

After fitting a linear model

fit <- lmFit(y, design)
fit <- eBayes(fit, trend=TRUE, robust=TRUE)

you can test whether Treatment 1 affects expression relative to the baseline (control) by

topTable(fit, coef=c("Treat1.Lin","Treat1.Quad"))

You can test whether the trend is increasing or decreasing by

topTable(fit, coef="Treat1.Lin")

Similarly for the other two treatments.

Note that you do not need to explicitly normalize the treated expression to the controls. That adjustment is done automatically by the linear model.

Note also that if the time-points are not equally spaced, you should define Time to be the real time-points in days.

ADD COMMENT • link 18 months ago Gordon Smyth 50k

0

Entering edit mode

Thank you for this detailed answer! I just tried the approach and came up with a few questions.

Why are there no FC or log2FC values in the topTable-output?
Where do I see, that the treated data of, say, t5 is adjusted to the control of t5? Is this done via the Baseline.Lin and Baseline.Quad?
Is it also possible to extract the topTable for each time-point and treatment, in order to deep-dive into the trends and changes?

ADD REPLY • link 18 months ago Sam • 0

0

Entering edit mode

I have shown you how to test for trends over time, which I had, possibly mistakingly, interpreted your question as asking for. You cannot expect to do individual tests for each time-point and treatment with any confidence when you have no replication. IMO you are trying to over-analyse and over-interpret what is fairly limited data.

ADD REPLY • link 18 months ago Gordon Smyth 50k

0

Entering edit mode

Okay, I understand. Yes, unfortunately, the data is pretty limited due to the high cost of a microarray experiment...

Could you please explain why you fit the data to a linear and a quadratic function while only using the linear one to determine whether there is a positive trend? What's the point in those other fits, then? Moreover, how do you define the positive trend? Am I supposed to interpret the numbers in the Treat1.Lin column? So a positive number means that there is a positive trend, while a negative number refers to a negative trend and bigger absolute values indicate that the trend is more pronounced?

ADD REPLY • link 18 months ago Sam • 0

0

Entering edit mode

The poly function computes orthogonal polynomials so the quadratic column is uncorrelated with the linear column. So we can examine the linear column alone to see if the overall trend is increasing or not. Yes, a positive coefficient for Treat1.Lin means that the trend is increasing and a negative means decreasing (relative to the control trend). A increasing trend means the fitted value will be higher at the last day than at time 0.

We need to allow for a quadratic or nonlinear effect in order to separate the systematic trend from the noise. If you only allow for a linear trend when the trend is actually nonlinear, then the nonlinear component of the trend will be pooled with the residual variance, increasing the standard errors and making it very hard to get any significant results.

ADD REPLY • link 18 months ago Gordon Smyth 50k

0

Entering edit mode

Thank you so much for your patience; I really appreciate it! Since I am not an expert in biostatistics, I have - as you can see - quite a hard time with this complex evaluation. Your answers are a great help to me!

If I understood you correctly, my data is only sufficient to draw a conclusion about the general trend of the treatments. This general trend can be seen in the linear regression coefficients, as you have pointed out once again. While I understand that additional degrees of freedom, such as in a quadratic or cubic regression, allow for more accurate data modeling, I still don't understand what information the quadratic coefficients contain since the individual polynomials are orthogonal to each other.

Are the coefficients in any relation to the foldchange? I want to visualize the results, and in microarray analysis, a volcano plot is generally a good way to do that. Still, I am unsure how legitimate it is to use the coefficients as a foldchange.

ADD REPLY • link 18 months ago Sam • 0

0

Entering edit mode

The linear coefficients estimate the log2-fold-change per unit of time.

A volcano plot does not visualize the data, it is just a pictorial way of presenting a topTable of genes. It doesn't tell you anything about your data that you can't already see from the DE gene list.

Personally, I would be visualizing the actual data and the actual expression trends. For example, I would plot fit$fitted.values versus time for selected genes of interest.

ADD REPLY • link 17 months ago Gordon Smyth 50k

0

Entering edit mode

Gordon Smyth Got it! But I still don't understand what information the higher-order polynomials provide. Could you please briefly elaborate on this topic?

ADD REPLY • link 17 months ago Sam • 0

0

Entering edit mode

Gordon Smyth Do you have any references for this method? I'd be interested in learning more about this topic. I appreciate any help you can provide.

ADD REPLY • link 14 months ago Sam • 0

0

Entering edit mode

You mean references on orthogonal polynomials? Polynomial approximation is a classic topic in mathematics and statistics and there are many references. Here is one of mine:

ADD REPLY • link 14 months ago Gordon Smyth 50k

0

Entering edit mode

I rather meant the basis behind your proposed approach. How do I know that I can model temporal effects without replicates by fitting orthogonal polynomials (at least 2nd degree) to them? Surely there is a statistical basis for this, but all the publications I can find on this subject have had a far more complex procedure.

ADD REPLY • link 14 months ago Sam • 0

0

Entering edit mode

Polynomial regression is the basis behind my proposed approach. It is a standard statistical approach covered in introductory statistics courses and I have pointed you to a relevant reference in the statistics literature.

For expositions of polynomial regression in the context of RNA-seq expression experiments, see

design guide https://f1000research.com/articles/9-1444
edgeR User's Guide https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf, see Section 4.8
voom paper https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29, see Drosophila example.

You have to tune the complexity of the models you fit to the data available. I would have thought it is pretty obvious that you can't usefully fit complex spline curves with large number of parameters (as in the papers you cite) to a time course of only 4 points. The limma User's Guide advises splines only when there is a large number of time points. And it is equally obvious that the time course pattern might be more complex than a simple linear trend. So fitting quadratic or cubic polynomials are your only choices.

ADD REPLY • link 14 months ago Gordon Smyth 50k