Possible factor-related error message bug in DESeq2
1
0
Entering edit mode
thadryan • 0
@thadryan-23400
Last seen 2.6 years ago
New England

Hello, I am posting to report what I think is a bug in an error message in DESeq2. I suspect that DESeqDataSetFromMatrix misdiagnoses the input type when the user passes a factor, reporting to them they are passing a character.

I have a Google Drive share link to a PDF write-up here, in which I reproduce and inspect the issue, or a Git repo here.

edit: there is a better-formatted version on the PDF report here

note: seemingly random, but the reason the earlier commits in the repo refer to a blog post is because I was writing a tutorial on avoiding type-errors in RNA-Seq workflows when I fully understood what what happening.

deseq2 • 1.8k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

I'm trying to follow the post but I'm actually getting lost. Can you briefly summarize for me:

You are inputting a data.frame of factors instead of integers as countData to DESeqDataSetFromMatrix, right? Then what happens?

I get this:

> cts <- data.frame(x=factor(1:4),y=factor(4:1),z=factor(1:4))
> coldata <- data.frame(x=factor(c(1,1,2)))
> dds <- DESeqDataSetFromMatrix(cts, coldata, ~x)
Error in DESeqDataSet(se, design = design, ignoreRank) :
  counts matrix should be numeric, currently it has mode: character
Calls: DESeqDataSetFromMatrix -> DESeqDataSet
ADD COMMENT
0
Entering edit mode

Thanks for getting back to me so quickly!

Yes, that is what I get as well. I just wrapped it in tryCatch for the purposes of making the PDF/repo which contain what I think is a reproducible example.

The error message is what I wasn't sure what was intended behaviour - reporting they pass characters when they have actually passed factors. The reason it spooked me is that my first instinct upon reading the message would be to convert to integers or numeric. However, converting factors to numerics or integers corrupts your data given R's weird default behaviour.

Consider this toy examples to emulate getting factors from reading in the counts file:

> x <- c("1", "10", "100")
> x
[1] "1"   "10"  "100"
> x <- as.factor(x)
> x
[1] 1   10  100
Levels: 1 10 100
> x <- as.numeric(x)
> x
[1] 1 2 3

If someone forgot to use the legendary stringsAsFactors=FALSE, and then saw and error message saying they are giving characters and should be passing numerics/integers, attempting the implied conversion would silently ruin the dataset.

They should absolutely check their types themselves, but it seems like the error message is reporting incorrect feedback (they passed a factor not a character). In this particular case, it could lead them into a gotcha that will run fine and produce no warnings, but have bogus results.

ADD REPLY
1
Entering edit mode

I see now, this is about what the user does outside of DESeq2. I appreciate your report.

On the other hand, R will have stringsAsFactors set to FALSE in the latest release (R 4.0.0), so it seems like users will not accidentally have factors in their data read in with base R functions in the future.

https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/index.html

I can try to sneak in a fix for what is current release, but it will be a past release already in one week (Wed April 29).

ADD REPLY
0
Entering edit mode

I am happy to help in any way I can.

I'd heard rumors of that default changing - good news indeed! This will certainly make the worst case scenario much less likely. I don't think that will fix the error message, however. If you can stand any more of my tedium I will elaborate below. If not:

TL;DNR - the change in R is a big help and will decrease the odds of the worst-case-scenario I describe in the report (silent, bogus analysis). That said, I don't think the update to R will prevent incorrect error messages from being served.

I don't mean to harp on this or seem like I have an axe to gRind; I worry that in my enthusiasm for thoroughness I obfuscated my core point that I think the DESeq2 error message states the wrong type. Including what users might do with this misinformation was perhaps misguided on my part.

Re: "outside of DESeq2": Respectfully, I see that as a 'Yes and No' - the report contains pure DESeq2 elements and elements of what could go wrong if a user was served an incorrect error message - I used an example that included common user issues with R to make my (perhaps overly involved) report as thorough as possible by including issues that could result from the bug.

The No - Focusing solely on DESeq2: I think the type in the error message is incorrect, users aside. Based on my experiments, it is my understanding that passing a factor, no matter how it got there, will result in the coercion on line 342 of AllClasses.R by as.matrix (a silent cast from factor to character). There is a reproducible example in my repo/report. Using the snippet you provided:

> cts <- data.frame(x=factor(1:4),y=factor(4:1),z=factor(1:4))
> sapply(cts, class)
       x        y        z 
"factor" "factor" "factor" 
> m <- as.matrix(cts)
> sapply(m, class)
          1           2           3           4           4           3 
"character" "character" "character" "character" "character" "character" 
          2           1           1           2           3           4 
"character" "character" "character" "character" "character" "character"

Because this coercion happens in DESeqDataSetFromMatrix (line 342) before the call to DESeqDataSet (line 365), the reporting done by DESeqDataSet is done on a version of the input that has been changed since the user passed it. Hence, they pass a factor (it gets modified behind the scenes), they're told they passed a character.

The Yes - in the worst case scenario I documented, the real damage is done by factors being factors, well outside the scope of DESeq2. DESeq2 is not what derails the analysis at all I don't mean to imply that it does. I simply propose that fixing the seemingly incorrect error message decreases the change of someone creating a fictitious result. There is nothing to lose by kicking factors out, and it decreases the odds of error, especially if the error message from DESeq2 implies a safe conversion is possible when it isn't.

Anyway, thanks for your time, patience, and DESeq2!

Good luck out there: - - - - Thad

ADD REPLY
1
Entering edit mode

To be extra safe, I've pushed this into devel branch so it will be into the release. The new behavior is this:

!> dds <- DESeqDataSetFromMatrix(cts, coldata, ~x)
 Error in DESeqDataSet(se, design = design, ignoreRank) :
   counts matrix should be numeric, currently it has mode: character
 Calls: DESeqDataSetFromMatrix -> DESeqDataSet
 In addition: Warning message:
 In DESeqDataSetFromMatrix(cts, coldata, ~x) :

   'countData' is a data.frame with one or more factor columns.
   Be aware that converting directly to numeric can lead to errors.
   Provide matrices of integers to avoid this error.

This way no one should be surprised by any conversion of data they are doing outside of DESeq2.

ADD REPLY
0
Entering edit mode

Thanks for sticking with me! If the original message can't be changed, I think this will help. And having a warning about factors is big.

ADD REPLY

Login before adding your answer.

Traffic: 506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6