Question

DESeq2 estimateSizeFactors iterate takes too long

0

Entering edit mode

Karthik • 0

@959b4cc0

Last seen 12 weeks ago

Sweden

Hello,

The data is related to my previous post. We decided to remove 3 genes from the sample count matrix as they were also present in negative controls in very high count. When running DESeq(dds), we got the following error -

estimating size factors
Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  : 
  every gene contains at least one zero, cannot compute log geometric means

It is worth noting, that our data is very sparse, and most of the counts are zero. Before running DESeq2, we filtered our data slightly differently. Instead of the default filtering strategy which is -

smallestGroupSize <- 3
keep <- rowSums(counts(dds) >= 10) >= smallestGroupSize
dds <- dds[keep,]

We decided to remove genes based on how many samples didn't show expression, i.e based on the number of 0 counts across the gene.

more_than_50_pct <- rowSums(counts(dds) == 0) <= ncol(dds) /2
dds <- dds[more_than_50_pct,]

This reduced the number of genes to 1058. (We have 200 samples in one group, around 104 samples in another group, along with 18 negative controls and 4 blank water samples)

There are recommendations to add a pseudocount of 1 to the count table and to use estimateSizeFactors(dds, type = 'iterate'), however my concerns are as follows,

Due to the sparse nature of data, I'm afraid if it will skew the results.
Even with a smaller subset of sample, it ran for 2+ hours and still couldn't finish the step :/

RNASeq DESeq2 • 456 views

ADD COMMENT • link 5 months ago Karthik • 0

1

Entering edit mode

I wouldn't add 1 to the matrix.

I would use ATpoint advice and use a sc method. Or type="poscounts".

ADD REPLY • link 5 months ago Michael Love 41k

score 1 · Answer 1 · 2023-11-23

1

Entering edit mode

ATpoint ★ 4.0k

@atpoint-13662

Last seen 23 hours ago

Germany

You have quite a custom dataset here and the support site is not meant to guide you through your analysis. So generally:

if data are counts and sparse then go into single-cell RNA-seq literature (such as the great OSCA book from Aaron Lun) and see which methods have been developed for normalization of sparse data
since you have some sort of input controls you could read ChIP-seq methodologies as this is conceptually similar to ChIP-seq IP vs chromatin/IgG input controls

Please understand that with such custom experiments one would essentially need to see the data at hand and then develop a strategy together. Consider to look for local collaboration if you need guidance, but the support site is not suited or intended for this.

ADD COMMENT • link 5 months ago ATpoint ★ 4.0k

0

Entering edit mode

Hello,

I agree with you, and we are in process of working with collaborators, but I wanted to get a general opinion on the data and get a feel for it (My past experience has been vastly with rare disorder diagnosis with exome / genome) so I am getting used to RNA-seq analysis. I came across the OSCA and SingleR resources when I was trying to assign cell types from the experiment data (That is another analysis aspect we're working on) but I will go through them both.

I will try to keep the discussions to technical support aspects! Thanks!

ADD REPLY • link 5 months ago Karthik • 0