Hello,
I am trying to perform some test to see differential counts for different combinations of variables within an experiment.
I have 26 samples divided into "condition" (disease/healthy), additionally there is a "marker" which can be either "plus" or "minus" for healthy samples and "other" for disease samples. Finally there is a third variable: "batch" with the following distribution:
condition marker batch 17.4 disease other 1 24.11 disease other 2 24.7 disease other 3 24.8 disease other 2 26.10B disease other 4 26.12 disease other 4 26.13 disease other 4 26.15B disease other 4 26.18,26.20 healthy plus 4 26.19 healthy minus 4 26.6B disease other 4 17.5 disease other 1 26.8 disease other 4 26.9B disease other 4 27.12 disease other 5 27.18 disease other 5 27.20 disease other 5 27.21 healthy plus 5 27.22 healthy minus 5 17.9 disease other 6 19.1 disease other 7 19.2 disease other 7 19.5 disease other 7 19.6 disease other 7 19.8 disease other 7 24.10 disease other 8
In a general way I want to determine what are the differential counts between Disease and Healthy while accounting for contributions of these differences due to marker and batch effects. I don't know if there is a better way to do this, what I have come to is this:
-Differential counts Disease vs Healthy (condition effect). (1)
-Differential counts based on marker status on healthy condition (conditionHealthy.marker.nested2) (2)
-Differential counts based on batch status on any condition (healthy/disease) (conditiondisease:batchX and conditionhealthy:batchY). For all corresponding batches. (3)
Consider condition only effects (1) not intersecting [(2) union (3)].
~ condition + condition:marker.nested + condition:batch
# Make the base condition healthy coldata_df$condition=relevel(coldata_df$condition, "healthy") # Create a custom model.matrix with the nested markerstatus independently for each condition and the batch model_matrix = model.matrix(~ condition + condition:marker.nested + condition:batch, coldata_df)
Since the marker is a subtype for each condition which is not shared among the conditions, and following the advise given in With DESeq2 "Not full rank" Error with design ~ line + time + condition I created a marker subclass "marker.nested" with an independent id scheme for each condition. Similarly to the link given, the effect of marker.nested can't be distinguished from the condition effect for condition = disease (there is only one type of marker for disease condition and it is different from the markers in the healthy). Therefore after creating a model matrix I delete the column "conditiondisease:marker.nested2". Similarly, with respect to "batch" I delete any columns containing all zeros:
colnames(model_matrix) [1] "(Intercept)" "conditiondisease" "conditionhealthy:marker.nested2" [4] "conditiondisease:marker.nested2" "conditionhealthy:batch2" "conditiondisease:batch2" [7] "conditionhealthy:batch3" "conditiondisease:batch3" "conditionhealthy:batch4" [10] "conditiondisease:batch4" "conditionhealthy:batch5" "conditiondisease:batch5" [13] "conditionhealthy:batch6" "conditiondisease:batch6" "conditionhealthy:batch7" [16] "conditiondisease:batch7" "conditionhealthy:batch8" "conditiondisease:batch8" # Drop coefficients which have all 0's in the column (conditionDisease:marker.nested2) model_matrix = model_matrix[, colSums(model_matrix) > 0]
colnames(model_matrix) [1] "(Intercept)" "conditiondisease" "conditionhealthy:marker.nested2" [4] "conditiondisease:batch2" "conditiondisease:batch3" "conditionhealthy:batch4" [7] "conditiondisease:batch4" "conditionhealthy:batch5" "conditiondisease:batch5" [10] "conditiondisease:batch6" "conditiondisease:batch7" "conditiondisease:batch8"
The resulting model matrix:
(Intercept) conditiondisease conditionhealthy:marker.nested2 conditiondisease:batch2 17.4 1 1 0 0 24.11 1 1 0 1 24.7 1 1 0 0 24.8 1 1 0 1 26.10B 1 1 0 0 26.12 1 1 0 0 conditiondisease:batch3 conditionhealthy:batch4 conditiondisease:batch4 conditionhealthy:batch5 17.4 0 0 0 0 24.11 0 0 0 0 24.7 1 0 0 0 24.8 0 0 0 0 26.10B 0 0 1 0 26.12 0 0 1 0 conditiondisease:batch5 conditiondisease:batch6 conditiondisease:batch7 conditiondisease:batch8 17.4 0 0 0 0 24.11 0 0 0 0 24.7 0 0 0 0 24.8 0 0 0 0 26.10B 0 0 0 0 26.12 0 0 0 0
I create the DESeqDataSet, since I can't specify the design used for the model_matrix at this point (will result in full rank error), I just specify design = ~ condition and rectify it later (I am assuming this can be done??)
# Get the data into DESeqDataSet format, Using design ~ condition and overriding it later dds <- DESeqDataSetFromMatrix(countData = input_matrix, colData = coldata_df, design = ~ condition) # Perform differential expression using the model matrix specified dds = DESeq(dds, full=model_matrix, betaPrior=FALSE)
I get the "the model matrix is not full rank" error.
I am able to run the design:
~ condition + condition:marker.nested
without any issue.
How can I solve the issue with the rank and include batch? Also, is this the best approach to determine genuine condition only effects?
Thank you in advance and sorry for the long post.
Thank you for your input Michael. I have given it some thought, as you say because of the confounding one can't test simultaneously for disease, marker and batch. Additionally, something I hadn't taken into account is that the healthy samples are replicates from the same donor.id see "donor.id" column in the coldata down:
I performed the following strategy, please, if you can, let me know if it makes sense to you:
1) Create a condition.marker combination:
-condition disease -> "disease"
-condition healthy, marker plus -> "healthy_markerplus"
-condition healthy, marker minus -> "healthy_markerminus"
Similarly, create a condition.donor combination
-condition disease -> "disease"
-condition healthy, donor_id 160617 -> "healthy_160617"
-condition healthy, donor_id 271017-> "healthy_271017"
Find differentially expressed counts between condition.marker disease to the average of condition.marker healthy_markerplus and condition.marker healthy_markerminus:
2) Similarly, finds differentially expressed regions between condition.donor disease to the average of condition.donor healthy_160617 and condition.donor healthy_271017:
3) Then test for the effect of disease, controlling for batch through a Likelihood Ratio Test (LRT) obtaining counts changes in disease in the presence of batch effects:
4) Get significant results (padj not "NA", padj < 0.05 and log2FoldChange>=1) for the three results.
5) Intersect the significant features of all three results to get the significant common regions.
6) Only for these common regions in the three results tables:
The analysis strategy is up to you, I'm just here to explain how the software works if users are confused, not to provide analysis plans.
But, yes, you can use listValues=c(1,-1/2) to make the comparison of one level vs the average of the two in the denominator, as you have above. That's reasonable.
Thank you Michael.