Dear all,
I have trouble writing my design formula for an RNASeq experiment of 16 samples.
There are 5 controls and 11 tumors. The tumors correspond to 2 cell lines and have either Low or high intensity:
Sample name Intensity CellLine 1 CDX_1_1 Low_BLI CDX_1 2 CDX_1_2 Low_BLI CDX_1 3 CDX_2_1 High_BLI CDX_2 4 CDX_2_2 High_BLI CDX_2 5 CDX_2_3 Low_BLI CDX_2 6 CDX_2_4 Low_BLI CDX_2 7 CDX_2_5 Low_BLI CDX_2 8 CDX_2_6 High_BLI CDX_2 9 CDX_2_7 High_BLI CDX_2 10 CDX_2_8 High_BLI CDX_2 11 CDX_2_9 Low_BLI CDX_2 12 CTL_1 0_BLI Ctrl 13 CTL_2 0_BLI Ctrl 14 CTL_3 0_BLI Ctrl 15 CTL_4 0_BLI Ctrl 16 CTL_5 0_BLI Ctrl
I get the following error message:
dds=DESeqDataSetFromHTSeqCount(DataFrame,"/home/htseqCount_25082017", ~Intensity+CellLines+Intensity:CellLines) Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. Levels or combinations of levels without any samples have resulted in column(s) of zeros in the model matrix.
I read the “Model matrix not full rank” section and tried to find similar designs, but it did not help.
The experimental design seems simple, but I don't understand why I get this message. I agree that for the Ctrl and CellLine1, the intensity variable is redondant with the CellLine, but not for the CellLine2.
I need to answer questions like: What are the differences between the 5 controls and the 9 tumors of CellLine2?
Except writing something like:
1 CDX_1_1 CDX_1_Low 2 CDX_1_2 CDX_1_Low 3 CDX_2_1 CDX_2_High 4 CDX_2_2 CDX_2_High 5 CDX_2_3 CDX_2_Low 6 CDX_2_4 CDX_2_Low 7 CDX_2_5 CDX_2_Low 8 CDX_2_6 CDX_2_High 9 CDX_2_7 CDX_2_High 10 CDX_2_8 CDX_2_High 11 CDX_2_9 CDX_2_Low 12 CTL_1 Ctrl 13 CTL_2 Ctrl 14 CTL_3 Ctrl 15 CTL_4 Ctrl 16 CTL_5 Ctrl
which seems not clean, I don't see how to use only 1 variable or how to write this design differently.
Any suggestion would be appreciated.
Thank you
Ok, thank you for your answer
If I may ask one more thing, I would like a confirmation:
To look for the differences between the 5 controls and the 6 tumors with low intensity (whatever the cell type), I think I should use the contrast:
Thus I get 120 differentially expressed genes
I tried this:
Thus I get 270 differentially expressed genes.
Am I right with the first solution? And why would be the second incorrect?
Thank you in advance
If you want the average of the low intensity tumors the coefficients need to be 1/2.
Thank you for your reply.
I guess, in the estimation of the average of the low intensity, the sample size is taken into account? There are 4 samples in CellLine1 and 2 in the other one. Since there are less samples in the CellLine1, I want them to contribute less in the model.
Sorry, I don't see clearly the difference with contrast=c(0,1,0,1,-1). Do I look here at the effect of CellLine1 + CellLine2 compared to the 6 controls?
For me, it is unclear if I should look at the average of the low intensity tumors. My aim for this specific question is to look for the differences between the 5 controls and the 6 tumors with low intensity (whatever the cell type). I tried a model with the intensity information only (0, low, high) and got 209 differentially expressed genes (between the results from contrasts c(0,1,0,1,-1) and c(0,1/2,0,1/2,-1)), but I would prefer to keep the same model for all the comparisons I have to do.
Yes, the standard errors take into account the sample size, so the errors for a coefficient are reduced as the number of samples used to calculate that coefficient grows.
For further questions about why one numeric contrast is recommended, or what the statistical meaning of numerical contrasts are, I think you should meet with a statistical collaborator.
Thank you for your help.
Yes, I will try to clarify these points.