Question

Simultaneously adding two different covariates in the model gives error in DESeq2.

1

Entering edit mode

fizer ▴ 40

@fizer-12098

Last seen 8.3 years ago

US

Hello,

I have rnaseq data generated from brain tissues of 100 individuals, half of them were affected by a disease. All the samples were mix of two different brain regions, obtained from 4 different providers. I am trying to run DESeq2 to identify
differentially expressed genes using the model like: ~age + sex + tissue + PMI + condition, where PMI is post-mortem interval. My colData looks like this:

        condition age   sex   tissue PMI RIN brain_bank
   A016-93       disease 79 M FCBA89 48.00 6.400000       LND
   A032-95       disease 71 F FCBA89 65.00 6.550000       LND
   A046-94       disease 70 F FCBA89 100.00 5.400000       LND
   AN00216       disease 63 M FCBA89 26.16 6.150000       KBC
   AN02738       disease 85 F FCBA89 12.80 6.200000       KBC
   SD001-07    disease 37 F IF 46.00 6.200000       MII
   .
   .
   .
   AN03019       control 47 F FCBA89 31.85 5.300000       KBC
   SD004-10    control 50 M IF 63.00 5.400000       MII
   A046-91    control 41 M IF 49.00 5.450000       LND

Here condition, sex, tissue, brain_bank are as factors and age, RIN, PMI are numeric variables. The RIN values are averaged values because all the samples were sequenced in duplicates.

I want to include brain_bank as a covariate in the model but when I this: ~age + sex + tissue + brain_bank + PMI + condition, it gives me the following error:

   the design formula contains a numeric variable with integer values,
   specifying a model with increasing fold change for higher values.
   did you mean for this to be a factor? if so, first convert
   this variable to a factor using the factor() function
   Show Traceback

   Rerun with Debug
   Error in checkFullRank(modelMatrix) :
   the model matrix is not full rank, so the model cannot be fit as specified.
   One or more variables or interaction terms in the design formula are linear
   combinations of the others and must be removed.

I have seen this error several times but being novice in statistics I am never able to understand this error. If anyone can suggest a good reading material about it, would be great.

Anyways following are my questions:

1) Is there some way to incorporate brain_bank as a covariate in the analysis or I should simply go for the first model?

2) Should I use mean RIN values in the model to account for the effect? Any other suggestions?

3) Not sure if this should asked as a separate question. I also want to run DEXseq in order to find diff. exon usage between cases and controls. So can I use the following model:

~sample + exon + exon:condition + age:exon + sex:exon + tissue:exon + PMI:exon ## Full formula
~sample + exon + age:exon + sex:exon + tissue:exon + PMI:exon ## Reduced formula

Thanks in advance for helping.

deseq2 dexseq rnaseq differential gene expression differential exon usage • 3.9k views

ADD COMMENT • link updated 8.3 years ago by Michael Love 43k • written 8.3 years ago by fizer ▴ 40

1

Entering edit mode

"The model matrix is not full rank" means that brain_bank is not linearly independent from your other variables, so there is no way for the fitting to estimate the brain_bank effect if you also want to include all your other variables - it can be tricky to see why exactly this is, but from the excerpt of the data you provide, it looks like MII is only providing 'IF' tissues, so if this holds for the rest of the colData, then there is no way of knowing if the correlation is due to that bank or to the tissue. You'll have to omit the brain_bank variable: this means you're bundling up the brain_bank variability into the noise, but you may be susceptible to the error of claiming genes being differential due to tissue type, when they are actually partially due to systematic differences between MII and other brain_banks (unlikely, in my experience - you have multiple banks contributing IF samples, so the main concern is the slight increase in residual noise due to not correcting for bank, but that could well be negligible).

Similarly, inclusion of other terms such as RIN is part of the craft of model-building. Excluding it means that you might miss biologically relevant genes whose expression have a linear dependency on RIN (due to the increased residual noise); including the term means you might miss a gene with a spurious correlation with RIN but a real correlation with biology. It's hard to give generic advice, but if there's strong correlation between technical variables such as RIN and biological variables, then you need to think much harder about what's going on. If not, then inclusion/exclusion of the technical variables should make little practical difference (but if it does, then again, you'll need to think hard, and perhaps examine the behaviour of a few key genes that depend heavily on the choice of model).

ADD REPLY • link 8.3 years ago Gavin Kelly ▴ 690

score 1 · Answer 1 · 2017-01-04

For the DEXSeq question, I'd recommend a new post tagged with dexseq, just because there is a lot to parse here.

I can't tell from the information you've included which are the covariates that are colinear with the others in your model. If you're not aware of the problem of colinearity, you can read the section of the vignette that talks about this (if you are using the current version of DESeq2 it will point you to this section in the error). I'd also recommend you consult with a local statistician about which models you can fit and which not (there is nothing unique about DESeq2 in this context, you can just suppose that you are performing a simple linear model with gene expression data from a single gene).

The message about numeric covariates is not an error, just a message. We want users to be careful about using numeric covariates, and to make sure they know what they are doing. When you put in a numeric covariates in a model (+ RIN, or + PMI) you are making a strong assumption that the gene expression has constant multiplicative increases or decreases along units of that numeric covariate. Read the FAQ in the DESeq2 vignette about numeric covariates. So saturation or U-shaped trends are not possible to fit. In the vignette I suggest that a safer procedure is to bin the numeric covariate into meaningful intervals, which will then be fit it as a factor. This allows more flexibility. I don't have lots of experience with using + RIN in the model, but my instinct would be to just remove samples which have very low RIN, and not using RIN in the model.