Question

DESeq2 unable to handle large count values

0

Entering edit mode

robennet • 0

@166fd24c

Last seen 5 months ago

United States

My counts matrix has 100+ transcripts with at least one value over the maximum R can convert to an integer. Some of these transcripts don't have any counts small enough for R to handle as an integer. We've tried to think of ways to shift the values down, but given so many have smaller counts for certain participants, it's not really feasible. We've considered truncating all massive values to the max integer value, but that eliminates any potential nuance in those transcripts. We even talked about filtering out transcripts with counts that high, but, again, we're concerned about losing valuable information.

I've tried searching for a work-around that lets me pass integer64 values instead, but haven't found anything. Is there something I'm missing or can DESeq2 just not handle counts this large?

DESeq2 • 683 views

ADD COMMENT • link updated 5 weeks ago by Michael Love 41k • written 5 months ago by robennet • 0

score 2 · Accepted Answer · 2023-11-08

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 15 hours ago

United States

There is an issue on the GitHub to deal with this, but for the moment, we can't accommodate it. Out of curiosity, how did you generate a count for one gene and one sample of 1 billion?

ADD COMMENT • link 5 months ago Michael Love 41k

0

Entering edit mode

Michael Love I didn't do any of the data collection or anything, I'm just developing the pipeline for expression analysis. I believe my PI is trying to look into that, just to make sure everything was done correctly, but I'm not sure when or if I'll ever find out.

Do you have any suggestions on alternative ways to handle large values? The only thing I can think of is some type of normalization from 1-2147483647, but I know DESeq2 specifically relies on raw counts, because it handles all the normalization on the backend.

ADD REPLY • link 5 months ago robennet • 0

1

Entering edit mode

Maybe use a log linear approach like limma.

I do not know how one gene could have a count of a billion. Rather than trying to push this data through I would take a step back and see if you have any problems in the processing.

ADD REPLY • link 5 months ago Michael Love 41k

0

Entering edit mode

Oh, it's around 140 genes out of the roughly 16,000 that have counts over 2.1 billion. The number of massive values range from 1 to 285, which is how many individuals we have sequencing data for. Everything else seems fine, it's just that this subgroup of genes that were massively expressed. I'll bring up limma in my next meeting and see how it goes. Thanks for the help!

ADD REPLY • link 5 months ago robennet • 0

0

Entering edit mode

I recommend to solve the underlying problem. RNA-seq does not produce these values, so something must be wrong in the preprocessing pipeline. Maybe one sample in thousands would show odd artifacts but in your entire cohort? That is indicative of a general processing problem. How did you do gene quantification?

ADD REPLY • link 5 months ago ATpoint ★ 4.0k

0

Entering edit mode

Hi, I am facing a similar problem with my DESeq analysis. Large count numbers were introduced in the matrix when correcting for batch effect using Combatseq. I cannot create the Deseq object and am getting the error due to NAs introduced by large counts in the count matrix. How can I rectify this situation?

ADD REPLY • link 5 weeks ago Aravind • 0

1

Entering edit mode

I don't actually recommend modifying the count values for batch correction, but instead using RUV or SVA factors in the design (see workflow).

The fact that count values over a billion were created isn't reassuring.

ADD REPLY • link 5 weeks ago Michael Love 41k