Question

DESeq2 package : Error in checkFullRank(modelMatrix) while designing multi factor

0

Entering edit mode

iammiso • 0

@iammiso-18808

Last seen 6.0 years ago

HI All.

I'm sorry if many people have already asked questions. But I have searched so many questions and documents so far, but they did not answer my question.

I have difficulty using design factor in DESeqDataSetFromMatrix.

Rather than using only group (case-by-control) information to compare, we wanted to use other additional data to compare.

So I put the multi-factors into the design command. I thought it would affect other results if I included multi-factors, as opposed to just Group(case, control) information.

This is My Design:

Group Project_Id Gender Age_at_Diagnosis

Case TCGA-A6-6654 female 65

Case TCGA-AA-3972 male 72

Case TCGA-G4-6311 male 80

Case TCGA-AA-3667 female 36

Case TCGA-F4-6854 female 77

Case TCGA-A6-6650 female 69

Control TCGA-AA-3520 female 86

Control TCGA-A6-5659 male 82

Control TCGA-AA-3517 male 60

Control TCGA-A6-2685 female 48

Control TCGA-AA-3518 female 81

Control TCGA-AA-3697 male 77

Below is the code I have bee trying to run.

> coldata.clinic$Group <- factor(coldata.clinic$Group, levels = c("Control","Case"))

> coldata.clinic$Project_Id <- factor(coldata.clinic$Project_Id)

> coldata.clinic$Gender <- factor(coldata.clinic$Gender)

> coldata.clinic$Age_at_Diagnosis <- factor(coldata.clinic$Age_at_Diagnosis)

> dds.clinic <- DESeqDataSetFromMatrix(countData = cts,

+ colData = coldata.clinic,

+ design = ~Group+Project_Id+Gender+Age_at_Diagnosis)

Error in checkFullRank(modelMatrix) :

the model matrix is not full rank, so the model cannot be fit as specified.

One or more variables or interaction terms in the design formula are linear

combinations of the others and must be removed.

Please read the vignette section 'Model matrix not full rank':

vignette('DESeq2')

Design have included several factors, but I only need to compare case and control, without comparing the other factors(Project_id, Gender, Age_at_Diagnosis). I just want other factors to affect when comparing cases and controls.

And one of these factors can not be excluded.

How should I proceed in this case?

I searched for related problems and saw the design factors linked to *. In this case, the above error does not occur.

> dds.clinic <- DESeqDataSetFromMatrix(countData = cts,

+ colData = coldata.clinic,

+ design = ~Group*Project_Id*Gender*Age_at_Diagnosis)

What is the difference between + and *?

And what should I use for my analysis?

Could not solve it for a long time. Hope many comments from experts.

DESeq2 multiple factor design design model design matrix rna-seq • 2.0k views

ADD COMMENT • link updated 6.0 years ago by Michael Love 43k • written 6.0 years ago by iammiso • 0

score 0 · Answer 1 · 2018-12-13

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 6 days ago

United States

Project ID seems to be a unique value for every sample. You can’t include such a term in the design because it removes any sense of replication.

ADD COMMENT • link 6.0 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

Some of Project_Id contain duplicates. Do I still have to remove it?

And not all design factors are replication. Can't I consider the way I want other factors to affect when comparing cases and controls?

ADD REPLY • link 6.0 years ago iammiso • 0

0

Entering edit mode

I don’t really have enough information to answer a question. What kind of replication is there? I assume the table above is not the complete table. For how many samples? You cant include this variable as you have it and I don’t have enough information to say much more.

ADD REPLY • link 6.0 years ago Michael Love 43k

0

Entering edit mode

I'm sorry if the explanation is not good.

I have 470 case samples and 40 control samples. The goal is to compare the expression between case and control.

Since I know what clinical information the case and control samples have, I would like to add information such as TCGA_id, gender, and age at diagnosis to affect the comparative analysis of the expression between case and control.
(TCGA_id seems all different, but some overlap.)

And this information is not information about replication experiment samples.

I hope this is enough explanation for you.

ADD REPLY • link 6.0 years ago iammiso • 0

0

Entering edit mode

I can’t answer the question without knowing what the nature of replicates are.

By the way, with 400+ samples, I tend to use limma for such analyses, because it is much faster than a GLM approach. You will still have to figure out how to deal with the replicates, either collapsing or using the duplicateCorrelation function.

ADD REPLY • link 6.0 years ago Michael Love 43k