Question

Correct formulation of contrasts of interest regarding an RNA-Seq gene expression analysis with edgeR

0

Entering edit mode

fotakisg • 0

@fotakisg-17384

Last seen 16 months ago

Austria

Dear Bioconductor Support Site,

I'm currently working with an RNA-Seq dataset, which consists of 84 samples(Collaborative Cross mice genetic panel of normal, diabetic and obese phenotypes). My code for the analysis used so far is the following:

head(sampleInfo) # the phenotype data information

SampleName Sex Obesity Diabetes 1 DT_77 M Obese Diabetic 2 DT_39 F Obese Normal 3 DT_145 M Normal Normal 4 DT_130 F Obese Normal 5 DT_103 M Normal Normal 6 DT_124 M Obese Diabetic

Next, after the creation of DGEList object/filtering/normalization, for the construction of the design matrix, I considered two separate approaches,to find DE genes concerning the obesity phenotype, but taking into account diabetes, while blocking on the sex confounder:

1) First Approach

# Create the design matrix
condition1.group <- as.factor(dge$samples$Obesity)
condition2.group <- as.factor(dge$samples$Diabetes)
sex.group <- as.factor(dge$samples$Sex)
design <- model.matrix(~condition1.group : condition2.group + sex.group)

y3 <- estimateDisp(dge, design, robust=TRUE)

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  :
  Design matrix not of full rank.  The following coefficients not estimable:
 condition1.groupObese:condition2.groupNormal
 head(design)
  (Intercept) sex.groupM condition1.groupNormal:condition2.groupDiabetic condition1.groupObese:condition2.groupDiabetic
1           1          1                                               0                                              1
2           1          0                                               0                                              0
3           1          1                                               0                                              0
4           1          0                                               0                                              0
5           1          1                                               0                                              0
6           1          1                                               0                                              1
  condition1.groupNormal:condition2.groupNormal condition1.groupObese:condition2.groupNormal
1                                             0                                            0
2                                             0                                            1
3                                             1                                            0
4                                             0                                            1
5                                             1                                            0
6                                             0                                            0

fit <- glmQLFit(y3, design, robust=TRUE)

fit2 <- glmQLFTest(fit, coef=2)

2) Second Approach

mult_con <- factor(paste(condition1.group, condition2.group, sep = "."))
sample2 <- cbind(sampleinfo,Group=mult_con)
design <- model.matrix(~0+sample2$Group + sex.group)

head(design)
  sample2$GroupNormal.Diabetic sample2$GroupNormal.Normal sample2$GroupObese.Diabetic sample2$GroupObese.Normal sex.groupM
1                            0                          0                           1                         0          1
2                            0                          0                           0                         1          0
3                            0                          1                           0                         0          1
4                            0                          0                           0                         1          0
5                            0                          1                           0                         0          1
6                            0                          0                           1                         0          1

my.contrasts <- makeContrasts(OvsN = sample2$GroupObese.Normal- sample2$GroupNormal.Normal, ONvsOD = sample2$GroupObese.Normal-sample2$GroupObese.Diabetic, levels=design)

# Obese_Normal vs Obese_Diabetic example

qlf.ON.OD <- glmQLFTest(fit, contrast=my.contrasts[,"ONvsOD"])
de.ON.OD <- topTags(qlf.ON.OD,n=nrow(qlf.ON.OD),adjust.method="BH", p.value=0.05)
dat.ON.OD <- de.ON.OD$table

Thus, my questions are the following;

A) 1) For the 1rst approach: regarding the initial error of full Rank, should I remove manually some coefficients that are not of "interest " ? For example, the aforementioned "condition1.groupObese:condition2.groupNormal" ?

2) Moreover, is the formulation and interpretation of the interaction approach correct ? The coefficient condition1.groupNormal:condition2.groupDiabetic , essentially compares the Normal samples from the Obesity group, with the Diabetic samples from the Diabetes group, correct ?

B) Overall, which of the two approaches should I consider ? Should I expect significant differences between the two approaches for the same comparisons ?

Thank you in advance and excuse me for my long post !!

rnaseq edger design and contrast matrix multifactorial design • 831 views

ADD COMMENT • link updated 5.6 years ago by Aaron Lun ★ 28k • written 5.6 years ago by fotakisg • 0

score 2 · Accepted Answer · 2018-09-19

Your second model parametrization is much easier to use and interpret. The first four coefficients represent the average log-expression within each obesity-disease combination, while the last coefficient represents the log-fold change of males over females. You can then simply test for pairwise differences between combinations:

# Paraphrasing the column names to keep things simple:
con <- makeContrasts(GroupNormal.Diabetic - GroupNormal.Normal, levels=design)

... or if you want to see the average effect of obesity:

con <- makeContrasts((GroupNormal.Diabetic + GroupNormal.Normal)/2
    - (GroupObese.Diabetic + GroupObese.Normal)/2, levels=design)

... or you can test the obesity effect separately between disease and normal patients, and intersect the DE lists.

I won't bother trying to figure out what the coefficients mean in your first approach, as your second approach is much more usable. I will just mention that, if you were to remove the column corresponding to the unestimable coefficient, the first design matrix would be mathematically equivalent to that in your second approach. That is, I could get the second design matrix from linear combinations of the columns in the first design matrix. However, the second design matrix is much easier to interpret, so you might as well start with that.