HI All.
I'm sorry if many people have already asked questions. But I have searched so many questions and documents so far, but they did not answer my question.
I have difficulty using design factor in DESeqDataSetFromMatrix.
Rather than using only group (case-by-control) information to compare, we wanted to use other additional data to compare.
So I put the multi-factors into the design command. I thought it would affect other results if I included multi-factors, as opposed to just Group(case, control) information.
This is My Design:
Group Project_Id Gender Age_at_Diagnosis
Case TCGA-A6-6654 female 65
Case TCGA-AA-3972 male 72
Case TCGA-G4-6311 male 80
Case TCGA-AA-3667 female 36
Case TCGA-F4-6854 female 77
Case TCGA-A6-6650 female 69
Control TCGA-AA-3520 female 86
Control TCGA-A6-5659 male 82
Control TCGA-AA-3517 male 60
Control TCGA-A6-2685 female 48
Control TCGA-AA-3518 female 81
Control TCGA-AA-3697 male 77
Below is the code I have bee trying to run.
> coldata.clinic$Group <- factor(coldata.clinic$Group, levels = c("Control","Case"))
> coldata.clinic$Project_Id <- factor(coldata.clinic$Project_Id)
> coldata.clinic$Gender <- factor(coldata.clinic$Gender)
> coldata.clinic$Age_at_Diagnosis <- factor(coldata.clinic$Age_at_Diagnosis)
> dds.clinic <- DESeqDataSetFromMatrix(countData = cts,
+ colData = coldata.clinic,
+ design = ~Group+Project_Id+Gender+Age_at_Diagnosis)
Error in checkFullRank(modelMatrix) :
the model matrix is not full rank, so the model cannot be fit as specified.
One or more variables or interaction terms in the design formula are linear
combinations of the others and must be removed.
Please read the vignette section 'Model matrix not full rank':
vignette('DESeq2')
Design have included several factors, but I only need to compare case and control, without comparing the other factors(Project_id, Gender, Age_at_Diagnosis). I just want other factors to affect when comparing cases and controls.
And one of these factors can not be excluded.
How should I proceed in this case?
I searched for related problems and saw the design factors linked to *. In this case, the above error does not occur.
> dds.clinic <- DESeqDataSetFromMatrix(countData = cts,
+ colData = coldata.clinic,
+ design = ~Group*Project_Id*Gender*Age_at_Diagnosis)
What is the difference between + and *?
And what should I use for my analysis?
Could not solve it for a long time. Hope many comments from experts.
Hi Michael,
Some of Project_Id contain duplicates. Do I still have to remove it?
And not all design factors are replication. Can't I consider the way I want other factors to affect when comparing cases and controls?
I don’t really have enough information to answer a question. What kind of replication is there? I assume the table above is not the complete table. For how many samples? You cant include this variable as you have it and I don’t have enough information to say much more.
I'm sorry if the explanation is not good.
I have 470 case samples and 40 control samples. The goal is to compare the expression between case and control.
Since I know what clinical information the case and control samples have, I would like to add information such as TCGA_id, gender, and age at diagnosis to affect the comparative analysis of the expression between case and control.
(TCGA_id seems all different, but some overlap.)
And this information is not information about replication experiment samples.
I hope this is enough explanation for you.
I can’t answer the question without knowing what the nature of replicates are.
By the way, with 400+ samples, I tend to use limma for such analyses, because it is much faster than a GLM approach. You will still have to figure out how to deal with the replicates, either collapsing or using the duplicateCorrelation function.