Question: DESeq2 package : Error in checkFullRank(modelMatrix) while designing multi factor
0
7 months ago by
iammiso0
iammiso0 wrote:

HI All.

I'm sorry if many people have already asked questions. But I have searched so many questions and documents so far, but they did not answer my question.

I have difficulty using design factor in DESeqDataSetFromMatrix.

Rather than using only group (case-by-control) information to compare, we wanted to use other additional data to compare.

So I put the multi-factors into the design command. I thought it would affect other results if I included multi-factors, as opposed to just Group(case, control) information.

This is My Design:

Group   Project_Id          Gender  Age_at_Diagnosis

Case     TCGA-A6-6654    female   65

Case     TCGA-AA-3972    male     72

Case     TCGA-G4-6311    male     80

Case     TCGA-AA-3667    female   36

Case     TCGA-F4-6854    female   77

Case     TCGA-A6-6650    female   69

Control   TCGA-AA-3520    female    86

Control   TCGA-A6-5659     male       82

Control   TCGA-AA-3517    male       60

Control   TCGA-A6-2685     female    48

Control   TCGA-AA-3518     female   81

Control   TCGA-AA-3697    male       77

Below is the code I have bee trying to run.

> coldata.clinic$Group <- factor(coldata.clinic$Group, levels = c("Control","Case"))

> coldata.clinic$Project_Id <- factor(coldata.clinic$Project_Id)

> coldata.clinic$Gender <- factor(coldata.clinic$Gender)

> coldata.clinic$Age_at_Diagnosis <- factor(coldata.clinic$Age_at_Diagnosis)

> dds.clinic <- DESeqDataSetFromMatrix(countData = cts,

+                              colData = coldata.clinic,

+                              design = ~Group+Project_Id+Gender+Age_at_Diagnosis)

Error in checkFullRank(modelMatrix) :

the model matrix is not full rank, so the model cannot be fit as specified.

One or more variables or interaction terms in the design formula are linear

combinations of the others and must be removed.

vignette('DESeq2')

Design have included several factors, but I only need to compare case and control, without comparing the other factors(Project_id, Gender, Age_at_Diagnosis). I just want other factors to affect when comparing cases and controls.

And one of these factors can not be excluded.

How should I proceed in this case?

I searched for related problems and saw the design factors linked to *. In this case, the above error does not occur.

> dds.clinic <- DESeqDataSetFromMatrix(countData = cts,

+                               colData = coldata.clinic,

+                               design = ~Group*Project_Id*Gender*Age_at_Diagnosis)

What is the difference between + and *?

And what should I use for my analysis?

Could not solve it for a long time. Hope many comments from experts.

modified 7 months ago by Michael Love24k • written 7 months ago by iammiso0
Answer: DESeq2 package : Error in checkFullRank(modelMatrix) while designing multi facto
0
7 months ago by
Michael Love24k
United States
Michael Love24k wrote:

Project ID seems to be a unique value for every sample. You can’t include such a term in the design because it removes any sense of replication.

Hi Michael,

Some of Project_Id contain duplicates. Do I still have to remove it?

And not all design factors are replication. Can't I consider the way I want other factors to affect when comparing cases and controls?

I don’t really have enough information to answer a question. What kind of replication is there? I assume the table above is not the complete table. For how many samples? You cant include this variable as you have it and I don’t have enough information to say much more.

I'm sorry if the explanation is not good.

I have 470 case samples and 40 control samples. The goal is to compare the expression between case and control.

Since I know what clinical information the case and control samples have, I would like to add information such as TCGA_id, gender, and age at diagnosis to affect the comparative analysis of the expression between case and control.
(TCGA_id seems all different, but some overlap.)

And this information is not information about replication experiment samples.

I hope this is enough explanation for you.

I can’t answer the question without knowing what the nature of replicates are.

By the way, with 400+ samples, I tend to use limma for such analyses, because it is much faster than a GLM approach. You will still have to figure out how to deal with the replicates, either collapsing or using the duplicateCorrelation function.