- The Experiment
- Data
- Goal of the experiment
- Read data
- Data Normalization
- Fitting the linear model
- Creation of the design matrix
- Questions
- Further Data Processing
- Calculate gene expressions for inbred lines:
- Appendix
The goal of this experiment is to compute gene expressions for inbred lines, which were assembled in hybridization pairs on a set of arrays. A two-color interwoven loop-design was used for this purpose. Hence, there is no common reference as typically used throughout the examples in the limma user manual.
Please note that you can recreate the design matrix yourself by using the necessary objects from the appendix at the bottom of this page!
The data used herein are dummy data, nevertheless they are suited to describe the problem of setting up the design matrix that I would like to use.
- We have no biological replicates in the real data.
- No array was replicated.
- All RNA sources that occur on multiple arrays are technical replicates.
- The hybridization partners were selected in a way that ensures that all RNA sources are connected.
Our intention is the computation of one gene expression value for every RNA source and for every gene, preferably with corresponding standard errors.
library("limma")
# Read target files.
targets <- readTargets{"targets.txt"}
# Extract unique RNA sources (i.e. inbred lines)
lines <- uniqueTargets(targets)
# Import hybridization data.
RG <- read.maimages(targets$FileName, source = "genepix",
wt.fun <- wtflags(weight = 0, cutoff = -50))
# Read GAL files and SpotType files
RG$genes <- readGAL("Microarray_layout.gal")
spottypes <- readSpotTypes()
RG$genes$Status <- controlStatus(spottypes, RG)
MA <- normalizeWithinArrays(RG)
MA <- normalizeBetweenArrays(MA, method = "scale")
Limma generates a linear model (generalized least squares) for each gene with the following model:
yi = Xβi + ϵi
isGene <- MA$genes$Status == "Gene"
fit <- lmFit(MA[isGene, ], design)
- n: number of arrays
- m: number of unique inbred lines
- yi: vector with log-ratios of each array
- X: design matrix
- βi: parameters of all coefficients 1, ..., n − 1
- ϵi: vector of residuals
Below you can see the design of the entire microarray experiment, where the coefficient denotes the dye which was used for an inbred line in a particular array. In particular, the cell elements indicate whether
- the inbred line (column) was not part of the array (row): 0
- the inbred line was part of the array and labeled with Cy3: -1
- the inbred line was part of the array and labeled with Cy5: 1
full_mat
## F001 F002 F003 F004 F005 P001 P002 P003
## P002Cy3xF001Cy5 1 0 0 0 0 0 -1 0
## F004Cy3xF001Cy5 1 0 0 -1 0 0 0 0
## P001Cy3xF005Cy5 0 0 0 0 1 -1 0 0
## F005Cy3xF004Cy5 0 0 0 1 -1 0 0 0
## P003Cy3xF004Cy5 0 0 0 1 0 0 0 -1
## F001Cy3xP003Cy5 -1 0 0 0 0 0 0 1
## F003Cy3xP003Cy5 0 0 -1 0 0 0 0 1
## F004Cy3xP003Cy5 0 0 0 -1 0 0 0 1
## F001Cy3xP002Cy5 -1 0 0 0 0 0 1 0
## P003Cy3xF002Cy5 0 1 0 0 0 0 0 -1
## F003Cy3xF005Cy5 0 0 -1 0 1 0 0 0
## P002Cy3xP001Cy5 0 0 0 0 0 1 -1 0
## F004Cy3xF003Cy5 0 0 1 -1 0 0 0 0
## F002Cy3xP003Cy5 0 -1 0 0 0 0 0 1
## F005Cy3xF001Cy5 1 0 0 0 -1 0 0 0
## P003Cy3xP002Cy5 0 0 0 0 0 0 1 -1
## P001Cy3xF004Cy5 0 0 0 1 0 -1 0 0
The reduced design, which still describes the entire experiment, but where the number of selected arrays is one less than the number of unique inbred lines:
parameters
## F003Cy3xF005Cy5 P002Cy3xP001Cy5 F004Cy3xF003Cy5 F002Cy3xP003Cy5
## F001 0 0 0 0
## F002 0 0 0 -1
## F003 -1 0 1 0
## F004 0 0 -1 0
## F005 1 0 0 0
## P001 0 1 0 0
## P002 0 -1 0 0
## P003 0 0 0 1
## F005Cy3xF001Cy5 P003Cy3xP002Cy5 P001Cy3xF004Cy5
## F001 1 0 0
## F002 0 0 0
## F003 0 0 0
## F004 0 0 1
## F005 -1 0 0
## P001 0 0 -1
## P002 0 1 0
## P003 0 -1 0
I cannot find any examples in the limma user guide that describe the setup of design matrices for a case where no common reference is used on every array and dye swaps are applied at the same time.
design <- modelMatrix(targets, parameters = parameters)
## Found unique target names:
## F001 F002 F003 F004 F005 P001 P002 P003
design
## F003Cy3xF005Cy5 P002Cy3xP001Cy5 F004Cy3xF003Cy5
## P002Cy3xF001Cy5 1 1 1
## F004Cy3xF001Cy5 1 0 1
## P001Cy3xF005Cy5 1 0 1
## F005Cy3xF004Cy5 -1 0 -1
## P003Cy3xF004Cy5 0 1 0
## F001Cy3xP003Cy5 -1 -1 -1
## F003Cy3xP003Cy5 0 -1 -1
## F004Cy3xP003Cy5 0 -1 0
## F001Cy3xP002Cy5 -1 -1 -1
## P003Cy3xF002Cy5 0 0 0
## F003Cy3xF005Cy5 1 0 0
## P002Cy3xP001Cy5 0 1 0
## F004Cy3xF003Cy5 0 0 1
## F002Cy3xP003Cy5 0 0 0
## F005Cy3xF001Cy5 0 0 0
## P003Cy3xP002Cy5 0 0 0
## P001Cy3xF004Cy5 0 0 0
## F002Cy3xP003Cy5 F005Cy3xF001Cy5 P003Cy3xP002Cy5
## P002Cy3xF001Cy5 0 1 0
## F004Cy3xF001Cy5 0 1 0
## P001Cy3xF005Cy5 0 0 0
## F005Cy3xF004Cy5 0 0 0
## P003Cy3xF004Cy5 0 0 1
## F001Cy3xP003Cy5 0 -1 -1
## F003Cy3xP003Cy5 0 0 -1
## F004Cy3xP003Cy5 0 0 -1
## F001Cy3xP002Cy5 0 -1 0
## P003Cy3xF002Cy5 -1 0 0
## F003Cy3xF005Cy5 0 0 0
## P002Cy3xP001Cy5 0 0 0
## F004Cy3xF003Cy5 0 0 0
## F002Cy3xP003Cy5 1 0 0
## F005Cy3xF001Cy5 0 1 0
## P003Cy3xP002Cy5 0 0 1
## P001Cy3xF004Cy5 0 0 0
## P001Cy3xF004Cy5
## P002Cy3xF001Cy5 1
## F004Cy3xF001Cy5 0
## P001Cy3xF005Cy5 1
## F005Cy3xF004Cy5 0
## P003Cy3xF004Cy5 1
## F001Cy3xP003Cy5 -1
## F003Cy3xP003Cy5 -1
## F004Cy3xP003Cy5 -1
## F001Cy3xP002Cy5 -1
## P003Cy3xF002Cy5 0
## F003Cy3xF005Cy5 0
## P002Cy3xP001Cy5 0
## F004Cy3xF003Cy5 0
## F002Cy3xP003Cy5 0
## F005Cy3xF001Cy5 0
## P003Cy3xP002Cy5 0
## P001Cy3xF004Cy5 1
To me this design matrix looks spurious. Generally I think of design
matrices as planes of zeroes and ones that either link an experimental
factor to the response (1
) or not (0
). However, in the design matrix
above I do not see an obvious relationship between the columns
(coefficients) and the rows (arrays). For example a value of 1 was
assigned to row 1 of column 1 (["P002Cy3xF001Cy5", "F003Cy3xF005Cy5"])
although there is no obvious connection between the two. Digging into
the modelMatrix()
function from the limma
package, I got stuck with
the following line of code:
zapsmall(t(solve(crossprod(parameters), crossprod(parameters,
J))), 14)
My understanding is that parameters
represents the connection between
all unique RNA sources (inbred lines) and the selected arrays that
represent the whole experiment. Moreover, it looks as if J
is an
extension of parameters
in a way that it represents the connection
between all RNA sources and all arrays of the experiment. When comparing
parameters
, J
and the output of modelMatrix()
to the examples for
design matrices in the limma
user manual, it seems to me as if the
design matrix should be rather J
and not the output of
modelMatrix()
.
It would be appreciated if you could fill this gap in my understanding!
- Correct the standard deviations and apply a moderated F-Test to
determine whether a gene was differentially expressed in any
comparison (as determined by the used coefficients) using
eBayes()
. - Correct p-values for multiple testing using the approach.
fit <- eBayes(fit)
fit[["F.p.value"]] <- p.adjust(fit[["F.p.value"]], method = "fdr",
n = length(fit[["F.p.value"]]))
Write the results to a file:
write.fit(fit, results = NULL, file = "exp-001.dta", digits = 5,
adjust = "BH", sep = "\t")
Read the results from a file:
exp.data <- read.table("exp-001.dta", header = TRUE, sep = "\t",
stringsAsFactors = FALSE)
exp.data <- exp.data[!is.na(exp.data$F.p.value), ]
attr(exp.data, "row.names") <- exp.data[, "Genes.ID"]
Select genes, which were differentially expressed on at least one array:
exp.dta <- exp.data[exp.data$F.p.value < 0.01, ]
Extract coefficients and Amean
:
exp.ratio <- exp.dta[, seq_along(lines)]
B <- t(exp.ratio)
# Obtain gene expression values for each inbred line and each mRNA by solving
# the equation a %*% b = b for x
x <- solve(A, B)
A**x = B
A * xlines × mRNAs = Bcoe**f × mRNAs
with Am, m where ...
- the columns represent
m
unique RNA sources (i.e. inbred lines). - the first row contains the frequency of inbred lines
1,...,m
within the entire experiment. - all following rows represent the unique set of coefficients of
length
m - 1
that connect the entire experiment. - the cell elements indicate whether
- the inbred line (column) was not part of the array (row): 0
- the inbred line was part of the array and labeled with Cy3: -1
- the inbred line was part of the array and labeled with Cy5: 1
R-objects for the generation of the design matrix.
# parameters
dput(parameters)
structure(c(0, 0, -1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0,
0, 0, 1, -1, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
-1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, -1, 0, 0, 0, 1, 0, -1, 0, 0
), .Dim = c(8L, 7L), .Dimnames = list(c("F001", "F002", "F003",
"F004", "F005", "P001", "P002", "P003"), c("F003Cy3xF005Cy5",
"P002Cy3xP001Cy5", "F004Cy3xF003Cy5", "F002Cy3xP003Cy5", "F005Cy3xF001Cy5",
"P003Cy3xP002Cy5", "P001Cy3xF004Cy5")))
# targets
dput(targets)
structure(list(FileName = structure(c(13L, 6L, 12L, 10L, 16L,
2L, 5L, 8L, 1L, 15L, 4L, 14L, 7L, 3L, 9L, 17L, 11L), .Label = c("F001Cy3xP002Cy5",
"F001Cy3xP003Cy5", "F002Cy3xP003Cy5", "F003Cy3xF005Cy5", "F003Cy3xP003Cy5",
"F004Cy3xF001Cy5", "F004Cy3xF003Cy5", "F004Cy3xP003Cy5", "F005Cy3xF001Cy5",
"F005Cy3xF004Cy5", "P001Cy3xF004Cy5", "P001Cy3xF005Cy5", "P002Cy3xF001Cy5",
"P002Cy3xP001Cy5", "P003Cy3xF002Cy5", "P003Cy3xF004Cy5", "P003Cy3xP002Cy5"
), class = "factor"), Cy3 = structure(c(7L, 4L, 6L, 5L, 8L, 1L,
3L, 4L, 1L, 8L, 3L, 7L, 4L, 2L, 5L, 8L, 6L), .Label = c("F001",
"F002", "F003", "F004", "F005", "P001", "P002", "P003"), class = "factor"),
Cy5 = structure(c(1L, 1L, 5L, 4L, 4L, 8L, 8L, 8L, 7L, 2L,
5L, 6L, 3L, 8L, 1L, 7L, 4L), .Label = c("F001", "F002", "F003",
"F004", "F005", "P001", "P002", "P003"), class = "factor")), .Names = c("FileName",
"Cy3", "Cy5"), row.names = c("P002Cy3xF001Cy5", "F004Cy3xF001Cy5",
"P001Cy3xF005Cy5", "F005Cy3xF004Cy5", "P003Cy3xF004Cy5", "F001Cy3xP003Cy5",
"F003Cy3xP003Cy5", "F004Cy3xP003Cy5", "F001Cy3xP002Cy5", "P003Cy3xF002Cy5",
"F003Cy3xF005Cy5", "P002Cy3xP001Cy5", "F004Cy3xF003Cy5", "F002Cy3xP003Cy5",
"F005Cy3xF001Cy5", "P003Cy3xP002Cy5", "P001Cy3xF004Cy5"), class = "data.frame")
# full_mat
dput(full_mat)
structure(c(1, 1, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, 0, 0, 0,
0, 0, 0, 0, -1, 0, 0, 0, -1, 0, 1, 0, 0, 0, 0, 0, -1, 0, 1, 1,
0, 0, -1, 0, 0, 0, 0, -1, 0, 0, 0, 1, 0, 0, 1, -1, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, -1, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, -1, 0, 0,
0, 1, 0, 0, 0, 0, 0, -1, 1, 1, 1, 0, -1, 0, 0, 0, 1, 0, -1, 0
), .Dim = c(17L, 8L), .Dimnames = list(c("P002Cy3xF001Cy5", "F004Cy3xF001Cy5",
"P001Cy3xF005Cy5", "F005Cy3xF004Cy5", "P003Cy3xF004Cy5", "F001Cy3xP003Cy5",
"F003Cy3xP003Cy5", "F004Cy3xP003Cy5", "F001Cy3xP002Cy5", "P003Cy3xF002Cy5",
"F003Cy3xF005Cy5", "P002Cy3xP001Cy5", "F004Cy3xF003Cy5", "F002Cy3xP003Cy5",
"F005Cy3xF001Cy5", "P003Cy3xP002Cy5", "P001Cy3xF004Cy5"), c("F001",
"F002", "F003", "F004", "F005", "P001", "P002", "P003")))
I applaud your attempt at comprehensiveness in posing a question, but in trying to be comprehensive, you have failed to give us much useful information!
Thanks for your helpful remarks! I have added information on the experimental design and the goal of our analysis.
1. We do not have any biological replicates, just technical replicates of RNA sources.
2. Our goal is to compute "adjusted" gene expression values for each RNA source.