Question

Multi-level RNAseq experimental design question

0

Entering edit mode

csmall • 0

@csmall-8497

Last seen 8.8 years ago

United States

Hi,

I’m working with an RNAseq study design similar to the multi-level experiment example from section 9.7 of the limma user guide. I have a 2-level factor called “Genotype” and a factor called “Family,” which is nested within Genotype (3 families of Genotype “F” and 2 families of Genotype “O”). I also have a 2-level factor called “Treatment.” I’m mainly interested in testing the effects of Treatment within the 2 Genotypes, and the Treatment-by-Genotype interaction, which I’ve managed to do following the example and treating Family as a random effect:

> Targets
Individual Genotype Family Treatment Cage
1 1A_02 F F1 c 1A
2 1A_03 F F1 c 1A
3 1A_04 F F1 c 1A
4 1A_05 F F1 c 1A
5 1A_08 F F1 c 1A
6 1A_10 F F1 c 1A
7 1D_01 F F1 g 1D
8 1D_02 F F1 g 1D
9 1D_04 F F1 g 1D
10 1D_06 F F1 g 1D
11 1D_07 F F1 g 1D
12 1D_08 F F1 g 1D
13 3A_01 F F3 c 3A
14 3A_02 F F3 c 3A
15 3A_04 F F3 c 3A
16 3A_06 F F3 c 3A
17 3A_08 F F3 c 3A
18 3A_10 F F3 c 3A
19 3C_01 F F3 g 3C
20 3C_03 F F3 g 3C
21 3C_04 F F3 g 3C
22 3C_08 F F3 g 3C
23 3C_09 F F3 g 3C
24 3C_10 F F3 g 3C
25 4B_04 F F4 c 4B
26 4B_06 F F4 c 4B
27 4B_07 F F4 c 4B
28 4B_08 F F4 c 4B
29 4B_09 F F4 c 4B
30 4B_10 F F4 c 4B
31 4C_03 F F4 c 4C
32 4C_04 F F4 c 4C
33 4C_05 F F4 c 4C
34 4C_07 F F4 c 4C
35 4C_08 F F4 c 4C
36 4C_14 F F4 c 4C
37 4D_01 F F4 g 4D
38 4D_05 F F4 g 4D
39 4D_07 F F4 g 4D
40 4D_08 F F4 g 4D
41 4D_09 F F4 g 4D
42 4D_13 F F4 g 4D
43 4F_01 F F4 g 4F
44 4F_02 F F4 g 4F
45 4F_03 F F4 g 4F
46 4F_04 F F4 g 4F
47 4F_06 F F4 g 4F
48 4F_08 F F4 g 4F
49 5A_01 O O5 c 5A
50 5A_02 O O5 c 5A
51 5A_03 O O5 c 5A
52 5A_04 O O5 c 5A
53 5A_05 O O5 c 5A
54 5A_08 O O5 c 5A
55 5B_01 O O5 c 5B
56 5B_02 O O5 c 5B
57 5B_04 O O5 c 5B
58 5B_06 O O5 c 5B
59 5B_08 O O5 c 5B
60 5B_10 O O5 c 5B
61 5C_01 O O5 g 5C
62 5C_02 O O5 g 5C
63 5C_03 O O5 g 5C
64 5C_05 O O5 g 5C
65 5C_07 O O5 g 5C
66 5C_09 O O5 g 5C
67 5D_01 O O5 g 5D
68 5D_04 O O5 g 5D
69 5D_05 O O5 g 5D
70 5D_07 O O5 g 5D
71 5D_09 O O5 g 5D
72 5D_11 O O5 g 5D
73 6A_01 O O6 c 6A
74 6A_02 O O6 c 6A
75 6A_04 O O6 c 6A
76 6A_05 O O6 c 6A
77 6A_07 O O6 c 6A
78 6A_10 O O6 c 6A
79 6C_01 O O6 g 6C
80 6C_03 O O6 g 6C
81 6C_04 O O6 g 6C
82 6C_06 O O6 g 6C
83 6C_09 O O6 g 6C
84 6C_14 O O6 g 6C

I set up the design as follows:

> GenTrt <- factor(paste(Targets$Genotype,Targets$Treatment,sep="."))

> design <- model.matrix(~0+GenTrt)

> colnames(design) <- levels(GenTrt)

> v <- voom(dge, design, plot=TRUE)

> corfit <- duplicateCorrelation(v,design,block=Targets$Family)

> corfit$consensus

[1] 0.08579732

I set up and tested the desired contrasts for the model as follows:

> fit <- lmFit(v,design,block=Targets$Family,correlation=corfit$consensus)

> cm <- makeContrasts(

+ cVSg_in_F = F.c-F.g,

+ cVSg_in_O = O.c-O.g,

+ GbyTint = (O.c-O.g)-(F.c-F.g),

+ levels=design)

> cm

Contrasts

Levels cVSg_in_F cVSg_in_O GbyTint

F.c 1 0 -1

F.g -1 0 1

O.c 0 1 1

O.g 0 -1 -1

> fit2 <- contrasts.fit(fit, cm)

> fit2 <- eBayes(fit2)

> topTable(fit2, coef="cVSg_in_F")

> topTable(fit2, coef="cVSg_in_O")

> topTable(fit2, coef="GbyTint")

There is, however, an additional level of complexity. Groups of individuals were raised together in cages (note the “Cage” factor in the Targets object). Originally we intended to have 2 replicate cages for every Family-Treatment combination, but only accomplished this for Family F4 from Genotype F and Family O5 from Genotype O.

Given that we do have replicate cages for both treatments in 2 of the families, I was hoping to include “Cage” as a random effect in the model to account for this potentially important source of variation. Does it make sense to try and accomplish this with limma, and if so, how might I go about it? Or, should I just settle for the analysis I’ve done, in which between-cage effects for a given treatment in Family F4 and in Family O5 are ignored?

I’d be grateful to get feedback on this design, and I thank anyone in advance for her/his time and interest.

Best,

Clay

limma multiple factor design • 935 views

ADD COMMENT • link updated 8.8 years ago by Aaron Lun ★ 28k • written 8.8 years ago by csmall • 0

score 2 · Answer 1 · 2015-07-28

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 9 hours ago

The city by the bay

It seems that each family is associated with one genotype, and has two sets of treatments. If so, you can just compare within each family to eliminate any family-specific effects on expression. I'd suggest a design like this:

> design <- model.matrix(~0 + Family + Treatment:Genotype, Targets)
> design <- design[,-c(6, 8)]
> colnames(design)
[1] "FamilyF1"             "FamilyF3"             "FamilyF4"            
[4] "FamilyO5"             "FamilyO6"             "Treatmentg:GenotypeF"
[7] "Treatmentg:GenotypeO"

The first five coefficients are family-specific terms that represent the log-expression in C-treated samples for each family. The 6th and 7th coefficients represent the log-fold change in G-treated samples over C-treated samples, averaged across families with the corresponding genotype. Dropping either will give you the effect of treatment in each genotype, and it's also easy to test for whether there are differential treatment effects between genotypes:

colnames(design)[6:7] <- c("G.in.F", "G.in.O")
con <- makeContrasts(G.in.O - G.in.F, levels=design)

So why bother doing it like this? Now that Family is in the design, we are free to supply Cage to duplicateCorrelation to block on the cage-specific effect. This approach is better than the two options you could have taken with the previous design matrix; either ignoring Cage completely, which would result in potential problems from cage-level correlations in the two relevant families; or blocking on Cage instead of Family in duplicateCorrelation, which would ignore the correlations between cages of the same family. Note that we can't put Cage in the design itself, as it is confounded with Treatment and would preclude estimation of the treatment effect.

ADD COMMENT • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Keep in mind that it's not a perfect model. The Family terms will absorb any differences in total expression of samples for each family, but will not accommodate differences in the treatment effect for each family. If the treatment effect in each family varies, the fitted values for each family/treatment combination may be consistently lower or higher than the observed expression values for the corresponding samples, i.e., the residuals will be correlated across all samples in the same combination. Using Cage in duplicateCorrelation will not properly account for this effect when multiple cages are present for each combination, as the residuals for each cage are treated as being independent (despite the cages being correlated within the same combination). I can't think of an obvious way to get around this, but hopefully, it shouldn't cause too much damage.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Thank you very much for the suggestion and comments. Your points are well taken. As it turns out, correlation of residuals across samples in the same combination may in fact be an issue, as we have other data that suggest a family-by-treatment interaction, which is potentially more pronounced for one of the genotypes.

Thanks again for the help,

Clay

ADD REPLY • link 8.8 years ago csmall • 0

0

Entering edit mode

I should add that duplicateCorrelation should be able to deal with the situation where there's only one cage for each combination, as correlations between residuals within a single cage will be properly accounted for. It's only when there's several cages that we run into the problem described above. Of course, the GenTrt-based design in the original post does no better in this regard, so we're not losing anything by including Family into the design.

ADD REPLY • link 8.8 years ago Aaron Lun ★ 28k