My counts matrix has 100+ transcripts with at least one value over the maximum R can convert to an integer. Some of these transcripts don't have any counts small enough for R to handle as an integer. We've tried to think of ways to shift the values down, but given so many have smaller counts for certain participants, it's not really feasible. We've considered truncating all massive values to the max integer value, but that eliminates any potential nuance in those transcripts. We even talked about filtering out transcripts with counts that high, but, again, we're concerned about losing valuable information.
I've tried searching for a work-around that lets me pass integer64 values instead, but haven't found anything. Is there something I'm missing or can DESeq2 just not handle counts this large?
Michael Love I didn't do any of the data collection or anything, I'm just developing the pipeline for expression analysis. I believe my PI is trying to look into that, just to make sure everything was done correctly, but I'm not sure when or if I'll ever find out.
Do you have any suggestions on alternative ways to handle large values? The only thing I can think of is some type of normalization from 1-2147483647, but I know DESeq2 specifically relies on raw counts, because it handles all the normalization on the backend.
Maybe use a log linear approach like limma.
I do not know how one gene could have a count of a billion. Rather than trying to push this data through I would take a step back and see if you have any problems in the processing.
Oh, it's around 140 genes out of the roughly 16,000 that have counts over 2.1 billion. The number of massive values range from 1 to 285, which is how many individuals we have sequencing data for. Everything else seems fine, it's just that this subgroup of genes that were massively expressed. I'll bring up limma in my next meeting and see how it goes. Thanks for the help!
I recommend to solve the underlying problem. RNA-seq does not produce these values, so something must be wrong in the preprocessing pipeline. Maybe one sample in thousands would show odd artifacts but in your entire cohort? That is indicative of a general processing problem. How did you do gene quantification?
Hi, I am facing a similar problem with my DESeq analysis. Large count numbers were introduced in the matrix when correcting for batch effect using Combatseq. I cannot create the Deseq object and am getting the error due to NAs introduced by large counts in the count matrix. How can I rectify this situation?
I don't actually recommend modifying the count values for batch correction, but instead using RUV or SVA factors in the design (see workflow).
The fact that count values over a billion were created isn't reassuring.
I am exactly facing the same problem and being stuck. Don't know how to make it work.
Did you use Combatseq?