Question

Converting numeric variables to factors in DESeq metadata

0

Entering edit mode

charles7744 • 0

@charles7744-23449

Last seen 4.0 years ago

Hi folks,

We are using pseudobulk DESeq2 (DESeq2_1.26.0) approach to perform DGE testing on single cell datasets.

We are interested in assessing condition induced differences, while controlling for other variables such as RIN, PMI, Age and tissue (shown in metadata).

   Age RIN PMI  Tissue  Condition
1   45 9.0  16      A   Disease
2   71 8.8  35      A   Control
3   47 9.3  25      A   Disease
4   38 7.3  14      A   Disease
5   58 7.4  21      A   Disease
6   85 8.8  26      A   Control
7   56 7.3  18      A   Disease
8   53 9.2  27      A   Disease
9   64 7.6  20      A   Control
10  51 9.4  26      A   Disease
11  54 9.3  15      A   Disease
12  67 7.8  23      A   Control
13  37 9.2  17      A   Disease
14  43 7.9  32      A   Disease
15  38 7.1  27      A   Control
16  43 8.2  32      A   Disease
17  37 8.6  32      A   Control
18  57 8.8  11      A   Control
19  42 8.2  21      A   Control
20  47 8.1  24      A   Control
21  53 9.2  27      A   Disease
22  67 7.2  23      B   Control
23  58 8.2  21      B   Disease
24  38 7.5  27      B   Control
25  67 7.2  12      B   Disease
26  64 7.3  20      B   Control
27  58 8.2  29      B   Disease
28  91 7.0  25      C   Control
29  89 7.4  34      C   Control
30  64 7.2  20      C   Control
31  81 6.9  10      C   Disease
32  74 6.8  29      C   Disease
33  84 7.6  28      C   Disease

However, when using:

dds <- DESeqDataSetFromMatrix(pseudobulk, sample_table, ~Age + RIN +PMI+tissue+disease)

following warning is shown:

  converting counts to integer mode
     the design formula contains one or more numeric variables with integer values,
     specifying a model with increasing fold change for higher values.
     did you mean for this to be a factor? if so, first convert
     this variable to a factor using the factor() function
     the design formula contains one or more numeric variables that have mean or
     standard deviation larger than 5 (an arbitrary threshold to trigger this message).
     it is generally a good idea to center and scale numeric variables in the design
     to improve GLM convergence.
    Warning message:
    In DESeqDataSet(se, design = design, ignoreRank) :
     some variables in design formula are characters, converting to factors

Going further with this approach, we were able to obtain some results

dds  <- DESeq(dds, test = "LRT", reduced = ~Age+ RIN +PMI+tissue, sfType = "poscounts", useT = T, minmu = 1e-6, minReplicatesForReplace=Inf)

As a second approach, in order to avoid above warning, we converted numeric variables to factors using factor() and call the same function

 dds <- DESeqDataSetFromMatrix(pseudobulk, sample_table, ~Age + RIN +PMI+tissue+disease)

following error is thrown:

 Error in checkFullRank(modelMatrix) : 
     the model matrix is not full rank, so the model cannot be fit as specified.
     One or more variables or interaction terms in the design formula are linear
     combinations of the others and must be removed.

In the first approach where some variables were numeric, the warning was given but it worked and we obtained some results.

In the second approach it didn’t work at all. Now we are not quite sure, why there is no linear combination when dealing with numerics, but it appears when converting them to factors.

So the question is, is it right, to convert the numeric values here as factors and if so, how we can check if there is really a linear combination? As we are dealing with more then 30 replicates, it is not obvious to check, if there are some, as most of the numeric variables are highly diverse between the replicates.

The second question is, is it ok to regress so many different variables to test only condition induced differences?

Thanks a lot for your help!

deseq2 • 4.2k views

ADD COMMENT • link updated 22 months ago by krushnach80 • 0 • written 4.0 years ago by charles7744 • 0

score 4 · Answer 1 · 2020-04-29

4

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 1 day ago

Republic of Ireland

When you convert the numeric data to factors, there will definitely be an issue, as reported by DESeq2. The problem is that, once converted to factors, each individual number is treated as its own category / level, and the model becomes impossible to fit; hence, the checkFullRank errror.

As per the initial warning message, you could try to scale these variables via the scale() function, which will, per variable, center the variable (subtract the mean), and then divide by the standard deviation - this is essentially a Z transform, and it makes the distribution of each variable more amenable to a regression.

A question back to you, though: do you need these extra covariates in the design formula? Does there exist preliminary or previously published evidence that they are likely confounders of gene expression in your disease / condition under study? I ask because I frequently see people filling up their design formula with many parameters (and I used to do it all of the time, too), when the best solution is to keep the model relatively simple and mitigate biasing factors via a solid study design.

Kevin

ADD COMMENT • link 4.0 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Hi Kevin,

thanks a lot for your answer.

I'm sorry if this is maybe basic, but does DESeq take into account how similar numeric values are? E.g. having 3 replicates with age 80,79 and 20, will the first two be treated somehow more similar? Then I understand, why scaling makes sense. I assumed, regression in DESeq just handles values, if they are same or not, so the difference between the three replicates in the example is always 1. This is, why I thought, having own categories to ages is good, so only exactly same aged replicates will be treated together.

I tried running it with scaling all numeric variables and it worked without warnings, thanks!

And regarding your question: Thats a good question. I saw in some related papers, that people are regressing everything, what can be extracted from the metadata. We have sex of the replicates as another value. When not regressing that, genes like XIST, which are clearly sex related, come up. I assume, that biological values like age and sex, quality ones like RIN and batch, and also tissue difference can somehow confound the analysis. So actually I'm not quite sure, what is the best approach. I tried with your scale suggestion now, regressing nothing (using Wald test), and then add different metadata columns. Overall the results are somehow pretty similar, however, using everything to reduce, genes like XIST come up again, so probably there are too many variables to fit then. But I'm not sure, how to weigh then the importance, what to regress and what not. Guess, I need to try a bit and compare results. Thanks again!

ADD REPLY • link 4.0 years ago charles7744 • 0

0

Entering edit mode

DESEq2 will just take whatever is the result of factor() applied to the continuous variable; so, even, for example, 20.2 will be different from 20.22.

You could dichotomise your continuous variables into meaningful groups, if desired. For example:

ages <- c(37.2, 35, 55, 45, 49.9, 75, 37.3, 76, 55.5)
factor(ifelse(ages > 55, 'GreaterThan55', 'LessThanOrEqual55'),
  levels = c('LessThanOrEqual55','GreaterThan55'))

[1] LessThanOrEqual55 LessThanOrEqual55 LessThanOrEqual55 LessThanOrEqual55
[5] LessThanOrEqual55 GreaterThan55     LessThanOrEqual55 GreaterThan55    
[9] GreaterThan55

-------------

Yes, sex can confound certain RNA-seq studies, with XIST and TSIX being among the primary genes that drive this. So, including sex is generally regarded as fine, if necessary. I am not too sure about including RIN in the model - I have heard mixed opinions on this, and have no opinion of my own.

With regard to tissue, sometimes, tissue-specific differences can be too great such that stratifying an analysis by tissue may be more appropriate. For example, I did a previous study where it became too difficult to analyse blood and synovial tissue together due to their vast transcriptomal differences. The globin genes will be the markers of blood tissue, unless these are depleted in the wet-lab protocol (blood samples in red, with haemoglobin alpha-1 subunit expression plotted):

ADD REPLY • link 4.0 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Thanks again for your answer! One last question then, from your first answer, I got the impression that you suggest using scaling for numeric variables with high range (like age), instead of factorizing it. And now you had this suggestion of making categories (I was already thinking of something similar). Do you have a hint, what might be more appropriate here?

I agree, sometimes the differences between tissues can be too high, I also experienced that. Thanks!

ADD REPLY • link 4.0 years ago charles7744 • 0

2

Entering edit mode

From my position, I cannot really say which is more appropriate for your data and study area. I would check how the samples are distributed via a PCA bi-plot based on age, though. In doing this, it may be inferred that age is indeed something for which we must control by including it in the design formula.

You could also simply regress age against your outcome to see if there is a statistically significant association, which would again merit the inclusion of age in the design formula. e.g., outside DESeq2:

summary(glm(outcome ~ age, data = mydata, family = binomial(link = 'logit')))

ADD REPLY • link 4.0 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

Perfect, thanks very much Kevin!

ADD REPLY • link 4.0 years ago charles7744 • 0

0

Entering edit mode

Hello kevin lets say I have a single factor in my design such as age how do I interpret the result do we need to do this "You could dichotomise your continuous variables into meaningful groups, if desired. " in case of numerical variable?

ADD REPLY • link 22 months ago krushnach80 • 0