Question

Deseq multifactor test design full rank error and how to determine one variable only effect.

0

Entering edit mode

jaime.alvarez.benayas • 0

@jaimealvarezbenayas-14506

Last seen 6.7 years ago

Hello,

I am trying to perform some test to see differential counts for different combinations of variables within an experiment.

I have 26 samples divided into "condition" (disease/healthy), additionally there is a "marker" which can be either "plus" or "minus" for healthy samples and "other" for disease samples. Finally there is a third variable: "batch" with the following distribution:

            condition marker batch
17.4          disease  other     1
24.11         disease  other     2
24.7          disease  other     3
24.8          disease  other     2
26.10B        disease  other     4
26.12         disease  other     4
26.13         disease  other     4
26.15B        disease  other     4
26.18,26.20   healthy   plus     4
26.19         healthy  minus     4
26.6B         disease  other     4
17.5          disease  other     1
26.8          disease  other     4
26.9B         disease  other     4
27.12         disease  other     5
27.18         disease  other     5
27.20         disease  other     5
27.21         healthy   plus     5
27.22         healthy  minus     5
17.9          disease  other     6
19.1          disease  other     7
19.2          disease  other     7
19.5          disease  other     7
19.6          disease  other     7
19.8          disease  other     7
24.10         disease  other     8

In a general way I want to determine what are the differential counts between Disease and Healthy while accounting for contributions of these differences due to marker and batch effects. I don't know if there is a better way to do this, what I have come to is this:

-Differential counts Disease vs Healthy (condition effect). (1)

-Differential counts based on marker status on healthy condition (conditionHealthy.marker.nested2) (2)

-Differential counts based on batch status on any condition (healthy/disease) (conditiondisease:batchX and conditionhealthy:batchY). For all corresponding batches. (3)

Consider condition only effects (1) not intersecting [(2) union (3)].

~ condition + condition:marker.nested + condition:batch

# Make the base condition healthy
coldata_df$condition=relevel(coldata_df$condition, "healthy")

# Create a custom model.matrix with the nested markerstatus independently for each condition and the batch
model_matrix = model.matrix(~ condition + condition:marker.nested + condition:batch, coldata_df)

Since the marker is a subtype for each condition which is not shared among the conditions, and following the advise given in With DESeq2 "Not full rank" Error with design ~ line + time + condition I created a marker subclass "marker.nested" with an independent id scheme for each condition. Similarly to the link given, the effect of marker.nested can't be distinguished from the condition effect for condition = disease (there is only one type of marker for disease condition and it is different from the markers in the healthy). Therefore after creating a model matrix I delete the column "conditiondisease:marker.nested2". Similarly, with respect to "batch" I delete any columns containing all zeros:

colnames(model_matrix)
[1] "(Intercept)"                     "conditiondisease"                "conditionhealthy:marker.nested2"
 [4] "conditiondisease:marker.nested2" "conditionhealthy:batch2"         "conditiondisease:batch2"        
 [7] "conditionhealthy:batch3"         "conditiondisease:batch3"         "conditionhealthy:batch4"        
[10] "conditiondisease:batch4"         "conditionhealthy:batch5"         "conditiondisease:batch5"        
[13] "conditionhealthy:batch6"         "conditiondisease:batch6"         "conditionhealthy:batch7"        
[16] "conditiondisease:batch7"         "conditionhealthy:batch8"         "conditiondisease:batch8"

# Drop coefficients which have all 0's in the column (conditionDisease:marker.nested2)
model_matrix = model_matrix[, colSums(model_matrix) > 0]

colnames(model_matrix)
 [1] "(Intercept)"                     "conditiondisease"                "conditionhealthy:marker.nested2"
 [4] "conditiondisease:batch2"         "conditiondisease:batch3"         "conditionhealthy:batch4"        
 [7] "conditiondisease:batch4"         "conditionhealthy:batch5"         "conditiondisease:batch5"        
[10] "conditiondisease:batch6"         "conditiondisease:batch7"         "conditiondisease:batch8"

The resulting model matrix:

       (Intercept) conditiondisease conditionhealthy:marker.nested2 conditiondisease:batch2
17.4             1                1                               0                       0
24.11            1                1                               0                       1
24.7             1                1                               0                       0
24.8             1                1                               0                       1
26.10B           1                1                               0                       0
26.12            1                1                               0                       0
       conditiondisease:batch3 conditionhealthy:batch4 conditiondisease:batch4 conditionhealthy:batch5
17.4                         0                       0                       0                       0
24.11                        0                       0                       0                       0
24.7                         1                       0                       0                       0
24.8                         0                       0                       0                       0
26.10B                       0                       0                       1                       0
26.12                        0                       0                       1                       0
       conditiondisease:batch5 conditiondisease:batch6 conditiondisease:batch7 conditiondisease:batch8
17.4                         0                       0                       0                       0
24.11                        0                       0                       0                       0
24.7                         0                       0                       0                       0
24.8                         0                       0                       0                       0
26.10B                       0                       0                       0                       0
26.12                        0                       0                       0                       0

I create the DESeqDataSet, since I can't specify the design used for the model_matrix at this point (will result in full rank error), I just specify design = ~ condition and rectify it later (I am assuming this can be done??)

# Get the data into DESeqDataSet format, Using design ~ condition and overriding it later
dds <- DESeqDataSetFromMatrix(countData = input_matrix,
                              colData = coldata_df,
                              design = ~ condition)

# Perform differential expression using the model matrix specified
dds = DESeq(dds, full=model_matrix, betaPrior=FALSE)

I get the "the model matrix is not full rank" error.

I am able to run the design:

~ condition + condition:marker.nested

without any issue.

How can I solve the issue with the rank and include batch? Also, is this the best approach to determine genuine condition only effects?

Thank you in advance and sorry for the long post.

deseq2 multiple factor design design matrix • 1.2k views

ADD COMMENT • link 7.1 years ago • updated 7.0 years ago jaime.alvarez.benayas • 0

0

Entering edit mode

Thank you for your input Michael. I have given it some thought, as you say because of the confounding one can't test simultaneously for disease, marker and batch. Additionally, something I hadn't taken into account is that the healthy samples are replicates from the same donor.id see "donor.id" column in the coldata down:

I performed the following strategy, please, if you can, let me know if it makes sense to you:

1) Create a condition.marker combination:
-condition disease -> "disease"
-condition healthy, marker plus -> "healthy_markerplus"
-condition healthy, marker minus -> "healthy_markerminus"

Similarly, create a condition.donor combination

-condition disease -> "disease"

-condition healthy, donor_id 160617 -> "healthy_160617"

-condition healthy, donor_id 271017-> "healthy_271017"

            condition marker batch donor.id condition.marker condition.donor

17.4          disease  other     1 1 disease disease

24.11         disease  other     2 2 disease disease

24.7          disease  other     3 3 disease disease

24.8          disease  other     2 4 disease disease

26.10B        disease  other     4 5 disease disease

26.12         disease  other     4 6 disease disease

26.13         disease  other     4 7 disease disease

26.15B        disease  other     4 8 disease disease

26.18,26.20   healthy   plus     4 160617 healthy_markerplus healthy_160617

26.19         healthy  minus     4 160617 healthy_markerminus healthy_160617

26.6B         disease  other     4 9 disease disease

17.5          disease  other     1 10 disease disease

26.8          disease  other     4 11 disease disease

26.9B         disease  other     4 12 disease disease

27.12         disease  other     5 13 disease disease

27.18         disease  other     5 14 disease disease

27.20         disease  other     5 15 disease disease

27.21         healthy   plus     5 271017 healthy_markerplus healthy_271017

27.22         healthy  minus     5 271017 healthy_markerminus healthy_271017

17.9          disease  other     6 16 disease disease

19.1          disease  other     7 17 disease disease

19.2          disease  other     7 18 disease disease

19.5          disease  other     7 19 disease disease

19.6          disease  other     7 20 disease disease

19.8          disease  other     7 21 disease disease

24.10         disease  other     8 22 disease disease

Find differentially expressed counts between condition.marker disease to the average of condition.marker healthy_markerplus and condition.marker healthy_markerminus:

design = ~ 0 + condition.marker

# Using contrasts:

numerator<-c("condition.markerdisease")
denominator<-c("condition.markerhealthy_markerplus","condition.markerhealthy_markerminus")
LC<-list(numerator,denominator)
results=as.data.frame(results(dds, contrast=LC, listValues = c(1,-1/2)))

2) Similarly, finds differentially expressed regions between condition.donor disease to the average of condition.donor healthy_160617 and condition.donor healthy_271017:

design = ~ 0 + condition.donor

3) Then test for the effect of disease, controlling for batch through a Likelihood Ratio Test (LRT) obtaining counts changes in disease in the presence of batch effects:

# Put the healthy condition as first level so the comparison results (only "log2 fold change") is disease vs healthy.
coldata_df$condition = relevel(coldata_df$condition, "healthy")

# Get the data into DESeqDataSet format
dds <- DESeqDataSetFromMatrix(countData = input_matrix,
                              colData = coldata_df,
                              design = ~ batch + condition)

dds.LRT <- DESeq(dds,

                  betaPrior=FALSE,

                  test="LRT",

                  full=~ batch + condition,

                  reduced=~ batch)

4) Get significant results (padj not "NA", padj < 0.05 and log2FoldChange>=1) for the three results.

5) Intersect the significant features of all three results to get the significant common regions.

6) Only for these common regions in the three results tables:

get the rank of each of the individual tables in terms of the padj value first and log2FoldChange second. (1-> lowest padj and highest foldchange).
Calculate the average rank (rank of the three tables).
Order the output by increasing average rank.

ADD REPLY • link 7.0 years ago jaime.alvarez.benayas • 0

0

Entering edit mode

The analysis strategy is up to you, I'm just here to explain how the software works if users are confused, not to provide analysis plans.

But, yes, you can use listValues=c(1,-1/2) to make the comparison of one level vs the average of the two in the denominator, as you have above. That's reasonable.

ADD REPLY • link 7.0 years ago Michael Love 43k

0

Entering edit mode

Thank you Michael.

ADD REPLY • link 7.0 years ago jaime.alvarez.benayas • 0

score 1 · Answer 1 · 2018-03-07

Let me jump in at this question: "In a general way I want to determine what are the differential counts between Disease and Healthy while accounting for contributions of these differences due to marker and batch effects"

Because marker and condition are confounded, they can't both be included in a fixed effects model. You could use marker alone, and then test the average of plus vs other or minus vs other. But you cannot simultaneously control for marker and fit a condition effect, using fixed effects models, due to the confounding.