I am facing a tricky problem of experimental design that I would like to share with you.
We are working with several groups of samples, each of them containing sorted cells of the same cell type. In groups A and B, we have a certain percentage of the cells which have been modified in different ways (=continuous value). Group C is a control group without any modification.
The goal of this experiment is to assess the impact of the modification on the cells by comparing A vs C and B vs C while controlling for the number of modified cells in each group.
In addition, we don't know if this modification has a linear impact on the cells: i.e. a continuous value of 1% could have 10 times less impact than a continuous value of 10% which could itself have 1000 times less impact than a 20% of cells.
The design table:
sample group continuous_value(%)
sample_a A 35
sample_b A 10
sample_c B 1
sample_d B 4
sample_e C 0
sample_f C 0
I have already tried two different approaches with DESeq2 to work on this dataset:
a. Use the continuous value as a numeric value in the design (with or without log2 transformation). I obtained some differentially expressed genes but have no idea how to say that it was the right thing to do.
b. Transform the value into small bins. Unfortunately this did not work ("Error in DESeqDataSet(se, design = design, ignoreRank) : the model matrix is not full rank, so the model cannot be fit as specified.one or more variables or interaction terms in the design formula are linear combinations of the others and must be removed"). As you can see my controls are always at 0, my group B is quite low and the % in the group A is alway higher than the rest. Moreover, we currently don't have enough biological evidences to say: from this % to this one the cells can be fitted in the same box.
Since I don't understand enough all the mechanics behind linear model I would like to have some advices on this design. How can I compare my groups while controlling for this variable number of cells? (assuming or not that the % has a linear impact).
Up to know I have used DESeq2 but I could also use other packages if they are more suitable for this kind of messy design.
Sorry for this long post,
Thanks in advance for your answers!
You should be clear on whether that is your entire design or just a sample of six rows, because it's going to make a big difference exactly how many samples you have, as well as how many discrete values of the continuous covariate. Also, you should show the analysis code that your tried that produced your error, and in particular show the design your tried to use. Lastly, you should mention what hypotheses you wish to test. For instance, do you want to test for/are you expecting differences in the relationship between the continuous variable and expression in groups A & B?
A. The design showed here is only a sample of the full design. In reality there are four groups (A to C are modified and D is the Control). I wanted to keep that as simple as possible (A[n=4], B[n=5], C[n=3], D[n=3]).
Samples Group Continous Variable Bins
sample_1 A 6.63 E
sample_2 D 0 A
sample_3 B 3.66 D
sample_4 C 1.27 C
sample_5 D 0 A
sample_6 D 0 A
sample_7 B 1.74 C
sample_8 B 0.14 B
sample_9 B 0.58 B
sample_10 A 13.34 F
sample_11 C 3.23 D
sample_12 C 0.52 B
sample_13 A 7.49 E
sample_14 B 0.06 B
sample_15 A 41 F
B. Here is my code:
# The colData is the table above. I have tried more than 20 different combinations for the bins.
colData <- read.table("~/../colData.txt", sep="\t", header=T)
colData <- colData[,-3]
design <- ~ Bins + Group
# Table containing the number of reads calculated with FeatureCount
Matrix = Matrix
dds <- DESeqDataSetFromMatrix(Matrix,colData,design)
"Error in DESeqDataSet(se, design = design, ignoreRank) :
the model matrix is not full rank, so the model cannot be fit as specified.
one or more variables or interaction terms in the design formula
are linear combinations of the others and must be removed"
dds <- DESeq(dds)
C. I hope my answer is what you were expecting:
I would like to test for differences between the modifications A and the B and all other combinations (AvsD, BvsC,...). I am not currently interested to know the impact of this continuous covariate. I just want to use it to normalise my number of reads the best way possible.
I am expecting that 1% in group A will be equivalent to 1% in group B (this is an assumption). However, I am expecting that from 1 to 10% I won't have a 10 times relationship but something more complicated that I can't currently assess.
The full-rank error message refers to the fact that your groups are confounded by your bins. Group D has the same samples as bin A, while group A is comprised of all samples from bins E and F. This means that any DE between groups cannot be distinguished from an uninteresting bin-related effect (i.e., bin and group coefficients are redundant, such that
designdoes not have full column rank). In essence, the problem is the same as that mentioned in my answer below.
Yes I understand why I got this error but I am not sure to understand what it implicates. Could you correct me if I am wrong?: This clearly means that the bin solution cannot apply to this design since I won't be able to come up with bins that are different enough from the groups to be relevant. in the design But does this mean also that transformation of the numeric values as described by Aaron Lun won't be applicable?
A confounded experimental design is when two variables are identical and therefore redundant. If a bin is the same as a group, then it is impossible to tell whether any effect on those samples is due to the bin or the group, and this impossibility results in the rank-deficiency error. Aaron's solution using natural splines retains the continuous nature of the variable, which means that it won't be confounded with the discrete groups.