The Experiment

Question

Correct setup of a design matrix for a two-color, interwoven microarray loop design with dye swaps

0

Entering edit mode

mwesthues • 0

@mwesthues-8547

Last seen 9.6 years ago

Germany

This question has migrated from https://www.biostars.org/p/153362/ to this website.

I am looking for the correct way to set up a design matrix for a two-color. interwoven microarray loop design with dye swaps. So far, I have only found thoroughly described/discussed examples that use a common reference sample in the call to the "modelMatrix()" function in limma.

The Experiment
- Data
- Goal of the experiment
- Read data
- Data Normalization
- Fitting the linear model
- Creation of the design matrix
- Questions
- Further Data Processing
- Calculate gene expressions for inbred lines:
Appendix

The Experiment

The goal of this experiment is to compute gene expressions for inbred lines, which were assembled in hybridization pairs on a set of arrays. A two-color interwoven loop-design was used for this purpose. Hence, there is no common reference as typically used throughout the examples in the limma user manual.

Please note that you can recreate the design matrix yourself by using the necessary objects from the appendix at the bottom of this page!

Data

The data used herein are dummy data, nevertheless they are suited to describe the problem of setting up the design matrix that I would like to use.

We have no biological replicates in the real data.
No array was replicated.
All RNA sources that occur on multiple arrays are technical replicates.
The hybridization partners were selected in a way that ensures that all RNA sources are connected.

Goal of the experiment

Our intention is the computation of one gene expression value for every RNA source and for every gene, preferably with corresponding standard errors.

Read data

library("limma")
# Read target files.
targets <- readTargets{"targets.txt"}
# Extract unique RNA sources (i.e. inbred lines)
lines <- uniqueTargets(targets)
# Import hybridization data.
RG <- read.maimages(targets$FileName, source = "genepix",
                    wt.fun <- wtflags(weight = 0, cutoff = -50))
# Read GAL files and SpotType files
RG$genes <- readGAL("Microarray_layout.gal")
spottypes <- readSpotTypes()
RG$genes$Status <- controlStatus(spottypes, RG)

Data Normalization

MA <- normalizeWithinArrays(RG)
MA <- normalizeBetweenArrays(MA, method = "scale")

Fitting the linear model

Limma generates a linear model (generalized least squares) for each gene with the following model:

y_i = Xβ_i + ϵ_i

isGene <- MA$genes$Status == "Gene"
fit <- lmFit(MA[isGene, ], design)

n: number of arrays
m: number of unique inbred lines
y_i: vector with log-ratios of each array
X: design matrix
β_i: parameters of all coefficients 1, ..., n − 1
ϵ_i: vector of residuals

Creation of parameters for `limma`.

Below you can see the design of the entire microarray experiment, where the coefficient denotes the dye which was used for an inbred line in a particular array. In particular, the cell elements indicate whether

the inbred line (column) was not part of the array (row): 0
the inbred line was part of the array and labeled with Cy3: -1
the inbred line was part of the array and labeled with Cy5: 1

full_mat

##                 F001 F002 F003 F004 F005 P001 P002 P003
## P002Cy3xF001Cy5    1    0    0    0    0    0   -1    0
## F004Cy3xF001Cy5    1    0    0   -1    0    0    0    0
## P001Cy3xF005Cy5    0    0    0    0    1   -1    0    0
## F005Cy3xF004Cy5    0    0    0    1   -1    0    0    0
## P003Cy3xF004Cy5    0    0    0    1    0    0    0   -1
## F001Cy3xP003Cy5   -1    0    0    0    0    0    0    1
## F003Cy3xP003Cy5    0    0   -1    0    0    0    0    1
## F004Cy3xP003Cy5    0    0    0   -1    0    0    0    1
## F001Cy3xP002Cy5   -1    0    0    0    0    0    1    0
## P003Cy3xF002Cy5    0    1    0    0    0    0    0   -1
## F003Cy3xF005Cy5    0    0   -1    0    1    0    0    0
## P002Cy3xP001Cy5    0    0    0    0    0    1   -1    0
## F004Cy3xF003Cy5    0    0    1   -1    0    0    0    0
## F002Cy3xP003Cy5    0   -1    0    0    0    0    0    1
## F005Cy3xF001Cy5    1    0    0    0   -1    0    0    0
## P003Cy3xP002Cy5    0    0    0    0    0    0    1   -1
## P001Cy3xF004Cy5    0    0    0    1    0   -1    0    0

The reduced design, which still describes the entire experiment, but where the number of selected arrays is one less than the number of unique inbred lines:

parameters

##      F003Cy3xF005Cy5 P002Cy3xP001Cy5 F004Cy3xF003Cy5 F002Cy3xP003Cy5
## F001               0               0               0               0
## F002               0               0               0              -1
## F003              -1               0               1               0
## F004               0               0              -1               0
## F005               1               0               0               0
## P001               0               1               0               0
## P002               0              -1               0               0
## P003               0               0               0               1
##      F005Cy3xF001Cy5 P003Cy3xP002Cy5 P001Cy3xF004Cy5
## F001               1               0               0
## F002               0               0               0
## F003               0               0               0
## F004               0               0               1
## F005              -1               0               0
## P001               0               0              -1
## P002               0               1               0
## P003               0              -1               0

Creation of the design matrix

I cannot find any examples in the limma user guide that describe the setup of design matrices for a case where no common reference is used on every array and dye swaps are applied at the same time.

design <- modelMatrix(targets, parameters = parameters)

## Found unique target names:
##  F001 F002 F003 F004 F005 P001 P002 P003

design

##                 F003Cy3xF005Cy5 P002Cy3xP001Cy5 F004Cy3xF003Cy5
## P002Cy3xF001Cy5               1               1               1
## F004Cy3xF001Cy5               1               0               1
## P001Cy3xF005Cy5               1               0               1
## F005Cy3xF004Cy5              -1               0              -1
## P003Cy3xF004Cy5               0               1               0
## F001Cy3xP003Cy5              -1              -1              -1
## F003Cy3xP003Cy5               0              -1              -1
## F004Cy3xP003Cy5               0              -1               0
## F001Cy3xP002Cy5              -1              -1              -1
## P003Cy3xF002Cy5               0               0               0
## F003Cy3xF005Cy5               1               0               0
## P002Cy3xP001Cy5               0               1               0
## F004Cy3xF003Cy5               0               0               1
## F002Cy3xP003Cy5               0               0               0
## F005Cy3xF001Cy5               0               0               0
## P003Cy3xP002Cy5               0               0               0
## P001Cy3xF004Cy5               0               0               0
##                 F002Cy3xP003Cy5 F005Cy3xF001Cy5 P003Cy3xP002Cy5
## P002Cy3xF001Cy5               0               1               0
## F004Cy3xF001Cy5               0               1               0
## P001Cy3xF005Cy5               0               0               0
## F005Cy3xF004Cy5               0               0               0
## P003Cy3xF004Cy5               0               0               1
## F001Cy3xP003Cy5               0              -1              -1
## F003Cy3xP003Cy5               0               0              -1
## F004Cy3xP003Cy5               0               0              -1
## F001Cy3xP002Cy5               0              -1               0
## P003Cy3xF002Cy5              -1               0               0
## F003Cy3xF005Cy5               0               0               0
## P002Cy3xP001Cy5               0               0               0
## F004Cy3xF003Cy5               0               0               0
## F002Cy3xP003Cy5               1               0               0
## F005Cy3xF001Cy5               0               1               0
## P003Cy3xP002Cy5               0               0               1
## P001Cy3xF004Cy5               0               0               0
##                 P001Cy3xF004Cy5
## P002Cy3xF001Cy5               1
## F004Cy3xF001Cy5               0
## P001Cy3xF005Cy5               1
## F005Cy3xF004Cy5               0
## P003Cy3xF004Cy5               1
## F001Cy3xP003Cy5              -1
## F003Cy3xP003Cy5              -1
## F004Cy3xP003Cy5              -1
## F001Cy3xP002Cy5              -1
## P003Cy3xF002Cy5               0
## F003Cy3xF005Cy5               0
## P002Cy3xP001Cy5               0
## F004Cy3xF003Cy5               0
## F002Cy3xP003Cy5               0
## F005Cy3xF001Cy5               0
## P003Cy3xP002Cy5               0
## P001Cy3xF004Cy5               1

Questions

To me this design matrix looks spurious. Generally I think of design matrices as planes of zeroes and ones that either link an experimental factor to the response (1) or not (0). However, in the design matrix above I do not see an obvious relationship between the columns (coefficients) and the rows (arrays). For example a value of 1 was assigned to row 1 of column 1 (["P002Cy3xF001Cy5", "F003Cy3xF005Cy5"]) although there is no obvious connection between the two. Digging into the modelMatrix() function from the limma package, I got stuck with the following line of code:

zapsmall(t(solve(crossprod(parameters), crossprod(parameters, 
                                                  J))), 14)

My understanding is that parameters represents the connection between all unique RNA sources (inbred lines) and the selected arrays that represent the whole experiment. Moreover, it looks as if J is an extension of parameters in a way that it represents the connection between all RNA sources and all arrays of the experiment. When comparing parameters, J and the output of modelMatrix() to the examples for design matrices in the limma user manual, it seems to me as if the design matrix should be rather J and not the output of modelMatrix().

It would be appreciated if you could fill this gap in my understanding!

Further Data Processing

Correct the standard deviations and apply a moderated F-Test to determine whether a gene was differentially expressed in any comparison (as determined by the used coefficients) using eBayes().
Correct p-values for multiple testing using the approach.

fit <- eBayes(fit)
fit[["F.p.value"]] <- p.adjust(fit[["F.p.value"]], method = "fdr",
                               n = length(fit[["F.p.value"]]))

Write the results to a file:

write.fit(fit, results = NULL, file = "exp-001.dta", digits = 5,
          adjust = "BH", sep = "\t")

Read the results from a file:

exp.data <- read.table("exp-001.dta", header = TRUE, sep = "\t", 
                       stringsAsFactors = FALSE)
exp.data <- exp.data[!is.na(exp.data$F.p.value), ]
attr(exp.data, "row.names") <- exp.data[, "Genes.ID"]

Select genes, which were differentially expressed on at least one array:

exp.dta <- exp.data[exp.data$F.p.value < 0.01, ]

Extract coefficients and Amean:

exp.ratio <- exp.dta[, seq_along(lines)]

Calculate gene expressions for inbred lines:

B <- t(exp.ratio)
# Obtain gene expression values for each inbred line and each mRNA by solving 
# the equation a %*% b = b for x
x <- solve(A, B)

A**x = B

A * x_{lines × mRNAs} = B_{coe**f × mRNAs}

with A_m, m where ...

the columns represent m unique RNA sources (i.e. inbred lines).
the first row contains the frequency of inbred lines 1,...,m within the entire experiment.
all following rows represent the unique set of coefficients of length m - 1 that connect the entire experiment.
the cell elements indicate whether
- the inbred line (column) was not part of the array (row): 0
- the inbred line was part of the array and labeled with Cy3: -1
- the inbred line was part of the array and labeled with Cy5: 1

Appendix

R-objects for the generation of the design matrix.

# parameters
dput(parameters)

structure(c(0, 0, -1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 
0, 0, 1, -1, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 
-1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 1, 0, -1, 0, 0
), .Dim = c(8L, 7L), .Dimnames = list(c("F001", "F002", "F003", 
"F004", "F005", "P001", "P002", "P003"), c("F003Cy3xF005Cy5", 
"P002Cy3xP001Cy5", "F004Cy3xF003Cy5", "F002Cy3xP003Cy5", "F005Cy3xF001Cy5", 
"P003Cy3xP002Cy5", "P001Cy3xF004Cy5")))

# targets
dput(targets)

structure(list(FileName = structure(c(13L, 6L, 12L, 10L, 16L, 
2L, 5L, 8L, 1L, 15L, 4L, 14L, 7L, 3L, 9L, 17L, 11L), .Label = c("F001Cy3xP002Cy5", 
"F001Cy3xP003Cy5", "F002Cy3xP003Cy5", "F003Cy3xF005Cy5", "F003Cy3xP003Cy5", 
"F004Cy3xF001Cy5", "F004Cy3xF003Cy5", "F004Cy3xP003Cy5", "F005Cy3xF001Cy5", 
"F005Cy3xF004Cy5", "P001Cy3xF004Cy5", "P001Cy3xF005Cy5", "P002Cy3xF001Cy5", 
"P002Cy3xP001Cy5", "P003Cy3xF002Cy5", "P003Cy3xF004Cy5", "P003Cy3xP002Cy5"
), class = "factor"), Cy3 = structure(c(7L, 4L, 6L, 5L, 8L, 1L, 
3L, 4L, 1L, 8L, 3L, 7L, 4L, 2L, 5L, 8L, 6L), .Label = c("F001", 
"F002", "F003", "F004", "F005", "P001", "P002", "P003"), class = "factor"), 
    Cy5 = structure(c(1L, 1L, 5L, 4L, 4L, 8L, 8L, 8L, 7L, 2L, 
    5L, 6L, 3L, 8L, 1L, 7L, 4L), .Label = c("F001", "F002", "F003", 
    "F004", "F005", "P001", "P002", "P003"), class = "factor")), .Names = c("FileName", 
"Cy3", "Cy5"), row.names = c("P002Cy3xF001Cy5", "F004Cy3xF001Cy5", 
"P001Cy3xF005Cy5", "F005Cy3xF004Cy5", "P003Cy3xF004Cy5", "F001Cy3xP003Cy5", 
"F003Cy3xP003Cy5", "F004Cy3xP003Cy5", "F001Cy3xP002Cy5", "P003Cy3xF002Cy5", 
"F003Cy3xF005Cy5", "P002Cy3xP001Cy5", "F004Cy3xF003Cy5", "F002Cy3xP003Cy5", 
"F005Cy3xF001Cy5", "P003Cy3xP002Cy5", "P001Cy3xF004Cy5"), class = "data.frame")

# full_mat
dput(full_mat)

structure(c(1, 1, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, 0, 0, 0, 
0, 0, 0, 0, -1, 0, 0, 0, -1, 0, 1, 0, 0, 0, 0, 0, -1, 0, 1, 1, 
0, 0, -1, 0, 0, 0, 0, -1, 0, 0, 0, 1, 0, 0, 1, -1, 0, 0, 0, 0, 
0, 0, 1, 0, 0, 0, -1, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, -1, 0, 0, 
0, 1, 0, 0, 0, 0, 0, -1, 1, 1, 1, 0, -1, 0, 0, 0, 1, 0, -1, 0
), .Dim = c(17L, 8L), .Dimnames = list(c("P002Cy3xF001Cy5", "F004Cy3xF001Cy5", 
"P001Cy3xF005Cy5", "F005Cy3xF004Cy5", "P003Cy3xF004Cy5", "F001Cy3xP003Cy5", 
"F003Cy3xP003Cy5", "F004Cy3xP003Cy5", "F001Cy3xP002Cy5", "P003Cy3xF002Cy5", 
"F003Cy3xF005Cy5", "P002Cy3xP001Cy5", "F004Cy3xF003Cy5", "F002Cy3xP003Cy5", 
"F005Cy3xF001Cy5", "P003Cy3xP002Cy5", "P001Cy3xF004Cy5"), c("F001", 
"F002", "F003", "F004", "F005", "P001", "P002", "P003")))

view raw mRNA_limma_design_matrix.md hosted with ❤ by GitHub

limma design matrix dye-swap two-colour microarray limma • 2.1k views

ADD COMMENT • link updated 9.7 years ago by James W. MacDonald 68k • written 9.7 years ago by mwesthues • 0

0

Entering edit mode

I applaud your attempt at comprehensiveness in posing a question, but in trying to be comprehensive, you have failed to give us much useful information!

Do you have biological replicates? If so, which samples are replicates? F001, etc are needlessly obscure.
What are your goals? If you are using limma, it's already obvious that you are fitting a linear model, so you don't need to tell anybody that. Your goals aren't to fit a linear model, but instead are to determine differences between one or more pairs of sample types (or maybe combinations thereof). What comparisons are you interested in making?

ADD REPLY • link 9.7 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks for your helpful remarks! I have added information on the experimental design and the goal of our analysis.

1. We do not have any biological replicates, just technical replicates of RNA sources.

2. Our goal is to compute "adjusted" gene expression values for each RNA source.

ADD REPLY • link 9.7 years ago mwesthues • 0

score 1 · Answer 1 · 2015-08-05

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

Your goal is an unconventional one for the microarray context, especially a two-color microarray. In other words, an expression value really doesn't mean anything by itself, but is only useful when compared to the expression of the same gene in a different sample. In addition, standard errors of technical replicates only tell you how reproducible the technology is, not how variable the gene expression is, across your sample (and presumably across the population of interest).

Ideally you would analyze this as a two-color array, in which case you would compute (and use) the log ratios that are inherent to this sort of design. Do note that starting on p. 38 (and again on p 54) of the limma User's Guide, there are examples that directly relate to what you are doing. But this will not result in individual gene expression values for each gene, but instead will give you log ratios.

If you are truly interested in expression estimates for each gene, then you likely have to do a separate analysis of two-color data, which is covered in the limma User's Guide starting on p. 58.

ADD COMMENT • link 9.7 years ago James W. MacDonald 68k

0

Entering edit mode

I will try a separate channel analysis of the two-color data and check whether that works for me.
Regarding the direct two-color design, I am still lacking the understanding of the set-up of the design matrix. In the limma User's guide (p.38 & p.54) a common reference, which I do not have, is used. Instead we used an interwoven loop design. Hence my original question on the design matrix for such a direct two-color design with dye swaps. Do you have any suggestions?

ADD REPLY • link 9.7 years ago mwesthues • 0

1

Entering edit mode

On page 38, the design is NOT a common reference design. Did you read that page? I am not sure why you think it is a common reference design. Note that there is a difference between using a common reference on your arrays and specifying a sample type to use as a reference when you are fitting a model.

The main idea behind using a two-color array is that a split-plot design is inherently more powerful because you can more accurately account for technical variability. And in a microarray context, where the goal is usually to compare gene expression between samples (rather than trying to estimate gene expression), the fact that the data are all log ratios is not a hardship.

Even with an interwoven design you can simply say that one sample is the reference, so all your coefficients then estimate the log ratio between a given sample and the reference. And making any other comparison is as simple as computing the log ratio between the coefficients. In other words

Coef1 = log(F002/F001)

Coef2 = log(F003/F001)

log(F003/F002) = Coef2 - Coef1

So specifying a reference in this context has nothing to do with how the arrays were hybridized, and everything to do with how you plan to make comparisons, and which comparisons are more convenient.

The Experiment

Data

Goal of the experiment

Read data

Data Normalization

Fitting the linear model

Creation of parameters for limma.

Creation of the design matrix

Questions

Further Data Processing

Calculate gene expressions for inbred lines:

Appendix

Creation of parameters for `limma`.