Search
Question: limma+voom question for effect across multiple tissues
0
4.1 years ago by
United States
artvandelay0 wrote:

I have scoured the forums and the manual, but the specific answer to my question still eludes me.  Apologies if this is a repeated question.

What I have: RNA-Seq count data from various tissues from three biological replicates in two species.  Certain tissue types in one of the species are expected to exhibit particular expression differences.  The two groups are denoted by the variable name 'tissue state' (see table below).

What I want: A list of genes that are specifically differentially expressed between the two 'tissue state' groups, while controlling for both variation within each sample 'group', and between the 'group's that are nested within each 'tissue state'.

I have tried to work through some examples and pull together a script that addresses, but I'm still not sure I'm specifying the right combination of "design" and "block".  I am using the "double-voom" strategy recommended in the old mailing list forums.

Any advice would be appreciated.  (my script is posted below)

Thanks,

Art V.

sample   group   tissuestate individual  species
A1       A       1           1           1
A2       A       1           2           1
A3       A       1           3           1
B1       B       1           1           1
B2       B       1           2           1
B3       B       1           3           1
C1       C       1           1           1
C2       C       1           2           1
C3       C       1           3           1
D1       D       0           1           1
D2       D       0           2           1
D3       D       0           3           1
E1       E       0           4           0
E2       E       0           5           0
E3       E       0           6           0
F1       F       0           4           0
F2       F       0           5           0
F3       F       0           6           0



Here is my current working script:

library("limma")
library("statmod")
library("edgeR")
args <- commandArgs()
rownames(y) <- y[,1]
y <- y[,-1]
y <- DGEList(counts=y, group=metadata$group) keep <- rowSums(cpm(y)>1) >= 3 y <- y[keep,] y <- calcNormFactors(y, method="TMM") design <- model.matrix(~metadata$tissuestate, data=metadata)
v.data <- voom(y, design)
v.data$genes = rownames(y) block=metadata$group
corfit <- duplicateCorrelation(v.data,design,block=block)
v.data <- voom(y,design,block=block,correlation=corfit$consensus) v.fitlimma = lmFit(v.data, design,block=block,correlation=corfit$consensus)
v.fitbayes = eBayes(v.fitlimma)
v.pvalues = v.fitbayes\$p.value[,2]
write.table(toptable(v.fitbayes, coef=2, number=length(v.fitbayes), sort.by="p"), file=OutputFile, sep="\t")

Edit: the metadata table was confusing, added some information and renamed columns

modified 4.1 years ago by Aaron Lun21k • written 4.1 years ago by artvandelay0

Your column names in metadata are a little confusing; it's easy to get group and condition mixed up during a more general discussion of groups and conditions. I assume condition refers to the two species of interest, whereas group refers to the various tissue types.

Thanks for the feedback, I have updated the table to be hopefully less confusing.  The particular "tissue state" (a.k.a. condition) we are interested in occurs in only SOME of the tissue types in only ONE of the species.

That's better. Still, what does the group factor actually represent? For example, is it some kind of batch variable?

"group" are biological replicates of the same tissue type taken from three separate individuals.  I basically want to make sure that we are controlling for the fact that three individuals in a common environment with tissues collected at the same approximate stage SHOULD have correlated expression.  Without some kind of nested model or blocking all the individual samples on each side of "tissuestate" would be treated as completely independent.

1
4.1 years ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

There's a philosophical issue with your current analysis, in that you're comparing between tissue states of different tissue types. It's hard to judge without knowing exactly what tissuestate refers to, but for the time being, I'll pretend it corresponds to diseased versus normal. It makes sense to compare, e.g., normal heart to normal liver, or diseased heart to normal heart, but the comparison between diseased heart and normal liver is difficult to interpret.

To me, it seems like the tissue type is an important explanatory factor that shouldn't be ignored. Sure, duplicateCorrelation will account for the correlations between samples of the same tissue type, but that just hides the philosophical problem of interpreting the contrast. You also make some assumptions regarding the similarity of the correlations across different genes when you use this function. This can be problematic for heteroskedastic data where these assumptions fail.

I think that your best bet is to include your tissue type (i.e., group) in the design matrix. You can then specify your contrasts to get DE genes between tissue states. For example, if you do want to compare between tissue states 0 and 1 in species 1, you can test for D - (A + B + C)/3. Alternatively, you can do pairwise comparisons between D and each of A, B and C, and intersect the resulting lists. In both cases, whether the contrasts are sensible or not still depends on the biology of tissuestate. But with this approach, at least, the sensibility (or lack thereof) is made explicit in the specification of the contrast. It also avoids the need to use duplicateCorrelation as the group factor is now in the design matrix.

In addition, many of your samples are drawn from the same individual so you should also account for individual-specific effects. This can be done by including individual in the design matrix, like so (the second line just gets rid of an unestimable coefficient):

design <- model.matrix(~0 + group + factor(individual), metadata)
design <- design[,-match("factor(individual)4", colnames(design))]

This will probably improve the accuracy of your fitted model, while still allowing you to compare between tissue types/states in group. I should also add that the group factor absorbs tissuestate and species, so there's no need to specify the latter two explicitly in design.