Question

Pre-filtering genes counts vs normalised counts DESeq2

0

Entering edit mode

bekah ▴ 40

@bekah-12633

Last seen 6.9 years ago

Hi,

I am working on RNA-seq data and looking to prefilter my data - I was using the sum of the read counts >100, but then read that I could use the normalised read counts instead? Is this a better filter, as it is based on data normalised across all samples? I am struggling to be able to view the data after running dss <- estimateSizeFactors(dss) in order to choose a suitable threshold?

Best wishes,

Rebekah

deseq2 pre-filtering • 7.1k views

ADD COMMENT • link updated 7.8 years ago by Michael Love 43k • written 7.8 years ago by bekah ▴ 40

0

Entering edit mode

Oh sorry I think I found it

counts(dss, normalized =TRUE)

just in case anyone else was looking for this also

ADD REPLY • link 7.8 years ago bekah ▴ 40

score 1 · Answer 1 · 2018-05-08

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

Hi Bekah,

Yes you can use the normalized counts for pre-filtering.

ADD COMMENT • link 7.8 years ago Michael Love 43k

0

Entering edit mode

Hi Michael,

I have now read several posts on pre-filtering and have confused myself.
I understand that pre-filtering isn't necessary when using DESeq2 due to the filtering step that occurs in the DESeq function.

I have the script for comparing across 20samples for removing the genes with v low counts:

dss<-DESeqDataSetFromMatrix(countData = countsall, colData = samplesall, design =~condition)
colnames(dss)<-colnames(countsall)
dss<- dss[rowSums(counts(dsall))>10,]
dss<-DESeq(dss)

Is this a valid filter to be using? I have seen many posts where instead the filter is applied after running DESeq, but doesn't this mean that low counts are then still included?

Rebekah

ADD REPLY • link 7.5 years ago bekah ▴ 40

0

Entering edit mode

hi,

That pre-filter is fine.

You could also do:

keep <- rowSums(counts(dds) >= x) >= y

where x and y are meaningful for your data, e.g. x may be a count around 5, and y may be the smallest group size. But our LFC shrinkage methods and the fitting going on inside DESeq() don't technically require filtering.

I wouldn't manually filter after DESeq(). results() does an optimal filter for power when you call it, using either of two published methods (genefilter or IHW). The results() filtering can be turned off with independentFiltering=FALSE.

ADD REPLY • link 7.5 years ago Michael Love 43k

0

Entering edit mode

Cheers for clearing up my confusion!

ADD REPLY • link 7.5 years ago bekah ▴ 40

0

Entering edit mode

Hi Michael,

If I filter out a readcount of less than 50 for row sums

dst27<- dst27[rowSums(counts(dst27))>=50,]

I get a slightly higher number of DEG with padj<0.05 than when filtering with dst27<- dst27[rowSums(counts(dst27))>10,]

Is this still a valid filter or am I undermining the assumptions on which DESeq2 runs by applying a filter of rowsums 50 before passing the data through the package?

Best wishes,

Rebekah

ADD REPLY • link 7.5 years ago bekah ▴ 40

2

Entering edit mode

You can filter at whatever mean count you want, this doesn't disturb the statistical assumptions.

Remember, if you pre-filter too high, you could remove rows which look like: [0, 0, 0] vs [high, high, high].

ADD REPLY • link 7.5 years ago Michael Love 43k