Search
Question: Correct formulation of contrasts of interest regarding an RNA-Seq gene expression analysis with edgeR
0
5 weeks ago by
fotakisg0
fotakisg0 wrote:

Dear Bioconductor Support Site,

I'm currently working with an RNA-Seq dataset, which consists of 84 samples(Collaborative Cross mice genetic panel of normal, diabetic and obese phenotypes).  My code for the analysis used so far is the following:

head(sampleInfo) # the phenotype data information

SampleName Sex Obesity Diabetes 1      DT_77   M   Obese Diabetic 2      DT_39   F   Obese   Normal 3     DT_145   M  Normal   Normal 4     DT_130   F   Obese   Normal 5     DT_103   M  Normal   Normal 6     DT_124   M   Obese Diabetic

Next, after the creation of DGEList object/filtering/normalization, for the construction of the design matrix, I considered two separate approaches,to find DE genes concerning the obesity phenotype, but taking into account diabetes, while blocking on the sex confounder:

1) First Approach

# Create the design matrix
condition1.group <- as.factor(dge$samples$Obesity)
condition2.group <- as.factor(dge$samples$Diabetes)
sex.group <- as.factor(dge$samples$Sex)
design <- model.matrix(~condition1.group : condition2.group + sex.group)

y3 <- estimateDisp(dge, design, robust=TRUE)

Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  :
Design matrix not of full rank.  The following coefficients not estimable:
condition1.groupObese:condition2.groupNormal
(Intercept) sex.groupM condition1.groupNormal:condition2.groupDiabetic condition1.groupObese:condition2.groupDiabetic
1           1          1                                               0                                              1
2           1          0                                               0                                              0
3           1          1                                               0                                              0
4           1          0                                               0                                              0
5           1          1                                               0                                              0
6           1          1                                               0                                              1
condition1.groupNormal:condition2.groupNormal condition1.groupObese:condition2.groupNormal
1                                             0                                            0
2                                             0                                            1
3                                             1                                            0
4                                             0                                            1
5                                             1                                            0
6                                             0                                            0

fit <- glmQLFit(y3, design, robust=TRUE)

fit2 <- glmQLFTest(fit, coef=2)

2) Second Approach

mult_con <- factor(paste(condition1.group, condition2.group, sep = "."))
sample2 <- cbind(sampleinfo,Group=mult_con)
design <- model.matrix(~0+sample2$Group + sex.group) head(design) sample2$GroupNormal.Diabetic sample2$GroupNormal.Normal sample2$GroupObese.Diabetic sample2$GroupObese.Normal sex.groupM 1 0 0 1 0 1 2 0 0 0 1 0 3 0 1 0 0 1 4 0 0 0 1 0 5 0 1 0 0 1 6 0 0 1 0 1 my.contrasts <- makeContrasts(OvsN = sample2$GroupObese.Normal- sample2$GroupNormal.Normal, ONvsOD = sample2$GroupObese.Normal-sample2$GroupObese.Diabetic, levels=design) # Obese_Normal vs Obese_Diabetic example qlf.ON.OD <- glmQLFTest(fit, contrast=my.contrasts[,"ONvsOD"]) de.ON.OD <- topTags(qlf.ON.OD,n=nrow(qlf.ON.OD),adjust.method="BH", p.value=0.05) dat.ON.OD <- de.ON.OD$table

Thus, my questions are the following;

A) 1) For the 1rst approach: regarding the initial error of full Rank, should I remove manually some coefficients that are not of "interest " ? For example, the aforementioned "condition1.groupObese:condition2.groupNormal" ?

2) Moreover, is the formulation and interpretation of the interaction approach correct ? The coefficient condition1.groupNormal:condition2.groupDiabetic , essentially compares the Normal samples from the Obesity group, with the Diabetic samples from the Diabetes group, correct ?

B) Overall, which of the two approaches should I consider ? Should I expect significant differences between the two approaches for the same comparisons ?

Thank you in advance and excuse me for my long post !!

modified 4 weeks ago by Aaron Lun21k • written 5 weeks ago by fotakisg0
2
4 weeks ago by
Aaron Lun21k
Cambridge, United Kingdom
Aaron Lun21k wrote:

Your second model parametrization is much easier to use and interpret. The first four coefficients represent the average log-expression within each obesity-disease combination, while the last coefficient represents the log-fold change of males over females. You can then simply test for pairwise differences between combinations:

# Paraphrasing the column names to keep things simple:
con <- makeContrasts(GroupNormal.Diabetic - GroupNormal.Normal, levels=design)

... or if you want to see the average effect of obesity:

con <- makeContrasts((GroupNormal.Diabetic + GroupNormal.Normal)/2
- (GroupObese.Diabetic + GroupObese.Normal)/2, levels=design)

... or you can test the obesity effect separately between disease and normal patients, and intersect the DE lists.

I won't bother trying to figure out what the coefficients mean in your first approach, as your second approach is much more usable. I will just mention that, if you were to remove the column corresponding to the unestimable coefficient, the first design matrix would be mathematically equivalent to that in your second approach. That is, I could get the second design matrix from linear combinations of the columns in the first design matrix. However, the second design matrix is much easier to interpret, so you might as well start with that.

Dear Aaron,

Thank you for your reply and your input, especially for the average effect of obesity between patient and normal samples.

Best regards,
Fotakis Georgios