limma+voom question for effect across multiple tissues
1
1
Entering edit mode
artvandelay ▴ 10
@artvandelay-6985
Last seen 9.3 years ago
United States

I have scoured the forums and the manual, but the specific answer to my question still eludes me.  Apologies if this is a repeated question.

What I have: RNA-Seq count data from various tissues from three biological replicates in two species.  Certain tissue types in one of the species are expected to exhibit particular expression differences.  The two groups are denoted by the variable name 'tissue state' (see table below).

What I want: A list of genes that are specifically differentially expressed between the two 'tissue state' groups, while controlling for both variation within each sample 'group', and between the 'group's that are nested within each 'tissue state'.  

I have tried to work through some examples and pull together a script that addresses, but I'm still not sure I'm specifying the right combination of "design" and "block".  I am using the "double-voom" strategy recommended in the old mailing list forums.

Any advice would be appreciated.  (my script is posted below)

Thanks,

Art V.


The metadata file is:

sample   group   tissuestate individual  species
A1       A       1           1           1
A2       A       1           2           1
A3       A       1           3           1
B1       B       1           1           1
B2       B       1           2           1
B3       B       1           3           1
C1       C       1           1           1
C2       C       1           2           1
C3       C       1           3           1
D1       D       0           1           1
D2       D       0           2           1
D3       D       0           3           1 
E1       E       0           4           0
E2       E       0           5           0
E3       E       0           6           0
F1       F       0           4           0
F2       F       0           5           0
F3       F       0           6           0

 

Here is my current working script:

library("limma")
library("statmod")
library("edgeR")
args <- commandArgs()
y <- read.table(CountFile, header=T, sep="\t", stringsAsFactors=F)
rownames(y) <- y[,1]
y <- y[,-1]
metadata <- read.table(MetadataFile, header=T, sep="\t")
y <- DGEList(counts=y, group=metadata$group)
keep <- rowSums(cpm(y)>1) >= 3
y <- y[keep,]
y <- calcNormFactors(y, method="TMM")
design <- model.matrix(~metadata$tissuestate, data=metadata)
v.data <- voom(y, design)
v.data$genes = rownames(y)
block=metadata$group
corfit <- duplicateCorrelation(v.data,design,block=block)
v.data <- voom(y,design,block=block,correlation=corfit$consensus)
v.fitlimma = lmFit(v.data, design,block=block,correlation=corfit$consensus)
v.fitbayes = eBayes(v.fitlimma)
v.pvalues = v.fitbayes$p.value[,2]
v.adjpvalues = p.adjust(v.pvalues, method="BH")
write.table(toptable(v.fitbayes, coef=2, number=length(v.fitbayes), sort.by="p"), file=OutputFile, sep="\t")

 

Edit: the metadata table was confusing, added some information and renamed columns

 

 

limma voom • 3.3k views
ADD COMMENT
0
Entering edit mode

Your column names in metadata are a little confusing; it's easy to get group and condition mixed up during a more general discussion of groups and conditions. I assume condition refers to the two species of interest, whereas group refers to the various tissue types.

ADD REPLY
0
Entering edit mode

Thanks for the feedback, I have updated the table to be hopefully less confusing.  The particular "tissue state" (a.k.a. condition) we are interested in occurs in only SOME of the tissue types in only ONE of the species.

ADD REPLY
0
Entering edit mode

That's better. Still, what does the group factor actually represent? For example, is it some kind of batch variable?

ADD REPLY
0
Entering edit mode

"group" are biological replicates of the same tissue type taken from three separate individuals.  I basically want to make sure that we are controlling for the fact that three individuals in a common environment with tissues collected at the same approximate stage SHOULD have correlated expression.  Without some kind of nested model or blocking all the individual samples on each side of "tissuestate" would be treated as completely independent.

ADD REPLY
0
Entering edit mode

Among the products mentioned on the list, which one is the best? https://knowallthe.com/difference-between-hiking-and-work-boots/

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 17 hours ago
The city by the bay

There's a philosophical issue with your current analysis, in that you're comparing between tissue states of different tissue types. It's hard to judge without knowing exactly what tissuestate refers to, but for the time being, I'll pretend it corresponds to diseased versus normal. It makes sense to compare, e.g., normal heart to normal liver, or diseased heart to normal heart, but the comparison between diseased heart and normal liver is difficult to interpret.

To me, it seems like the tissue type is an important explanatory factor that shouldn't be ignored. Sure, duplicateCorrelation will account for the correlations between samples of the same tissue type, but that just hides the philosophical problem of interpreting the contrast. You also make some assumptions regarding the similarity of the correlations across different genes when you use this function. This can be problematic for heteroskedastic data where these assumptions fail.

I think that your best bet is to include your tissue type (i.e., group) in the design matrix. You can then specify your contrasts to get DE genes between tissue states. For example, if you do want to compare between tissue states 0 and 1 in species 1, you can test for D - (A + B + C)/3. Alternatively, you can do pairwise comparisons between D and each of A, B and C, and intersect the resulting lists. In both cases, whether the contrasts are sensible or not still depends on the biology of tissuestate. But with this approach, at least, the sensibility (or lack thereof) is made explicit in the specification of the contrast. It also avoids the need to use duplicateCorrelation as the group factor is now in the design matrix.

In addition, many of your samples are drawn from the same individual so you should also account for individual-specific effects. This can be done by including individual in the design matrix, like so (the second line just gets rid of an unestimable coefficient):

design <- model.matrix(~0 + group + factor(individual), metadata)
design <- design[,-match("factor(individual)4", colnames(design))]

This will probably improve the accuracy of your fitted model, while still allowing you to compare between tissue types/states in group. I should also add that the group factor absorbs tissuestate and species, so there's no need to specify the latter two explicitly in design.

ADD COMMENT
0
Entering edit mode

Thanks the taking the time to give this so much thought and very helpful response. I will definitely try some of these suggestions and let you know how it goes.

ADD REPLY

Login before adding your answer.

Traffic: 972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6