DESeq2: deduce variable for design formula based on sample_info columns (Rscript)
1
0
Entering edit mode
@wouterdecoster-9406
Last seen 10.0 years ago
Belgium

Hello,

I often use DESeq2 for DE analysis and prefer to automate the processing using an Rscript.
The function would look like this:
deseq2 <- function(counts, sampleInfo) {
                suppressMessages(library(DESeq2))
                deseqdata <- DESeqDataSetFromMatrix(countData=counts, colData=sampleInfo, design=~condition)
                dds <- deseqdata[rowSums(counts(deseqdata)) > 1,]
                dds$condition <- relevel(dds$condition, ref="CON")
                
res <- results(DESeq(dds))

                ...

                ...           

                }

 

I would like to be able to deduce the design formula based on the sample info file I provide.For example, I sometimes should take batch effects into account or other group variables. Therefore, it would be the best if I can specify the design formula based on the columns (and their relative order) in the sample info. e.g. ‘sample’ ‘batch’ ‘condition’ would result in ~batch + condition and ‘sample’ ‘condition’ would result just in ~condition (etc.)

I don’t see an obvious way to achieve this. I tried to just use a variable in the design formula e.g 
Mydesign <- colnames(sampleInfo)[2:]
design=~mydesign

This is perhaps due to my quite limited knowledge of R, since I just started using R for writing scripts and have more experience with Python. Could someone point me in the direction to accomplish this goal?

Thanks in advance.

Kind regards,
Wouter De Coster

deseq2 • 2.1k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 4 days ago
United States

hi Wouter,

While DESeq2 automates a lot of steps, this is one really important step that we leave un-automated on purpose. Given a single sample table (colData), there are in fact many designs, as the design encodes modeling assumptions that the investigator makes and the biological question she wants to answer. For example, with continuous variables, it is in many cases preferable to use the cut() function to break into multi-level factor. This often makes more sense than the assumption that a single unit in the continuous variable corresponds to a constant 2^beta fold change in gene expression. The decision whether to include technical covariates, or whether to have interactions between variables can't be automated. colData is also used to store information we don't want to use for modeling, such as sample ID, or RIN values, etc.

That's all to explain why we don't provide this for DESeq2. I'm not sure how I would go about implementing this if I had to. Note that you can pass a matrix constructed by model.matrix to the design argument, which is supported when setting betaPrior=FALSE.

ADD COMMENT
0
Entering edit mode
Also, note you can provide a character string to formula()
ADD REPLY
0
Entering edit mode

Thanks for your reply. I understand that this part is not automated by design, but I would in my case use the order of the column in my sample info file to make the formula, without having to hard-code anything a priori.

Perhaps I could have "~batch+condition" hard-coded, and when no batch effect is to be corrected just use the same value in  sampleInfo$batch? I will have a look at the matrix and formula() options to specify the design.

ADD REPLY

Login before adding your answer.

Traffic: 973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6