Search
Question: Pre-filtering genes counts vs normalised counts DESeq2
0
gravatar for bekah
7 months ago by
bekah20
bekah20 wrote:

Hi,

​I am working on RNA-seq data and looking to prefilter my data - I was using the sum of the read counts >100, but then read that I could use the normalised read counts instead? Is this a better filter, as it is based on data normalised across all samples? I am struggling to be able to view the data after running dss <- estimateSizeFactors(dss) in order to choose a suitable threshold?

Best wishes,

​Rebekah

ADD COMMENTlink modified 7 months ago by Michael Love20k • written 7 months ago by bekah20

Oh sorry I think I found it

counts(dss, normalized =TRUE)

​just in case anyone else was looking for this also

ADD REPLYlink written 7 months ago by bekah20
1
gravatar for Michael Love
7 months ago by
Michael Love20k
United States
Michael Love20k wrote:

Hi Bekah, 

Yes you can use the normalized counts for pre-filtering.

ADD COMMENTlink written 7 months ago by Michael Love20k

Hi Michael,

I have now read several posts on pre-filtering and have confused myself.
I understand that pre-filtering isn't necessary when using DESeq2 due to the filtering step that occurs in the DESeq function.

I have the script for comparing across 20samples for removing the genes with v low counts:

dss<-DESeqDataSetFromMatrix(countData = countsall, colData = samplesall, design =~condition)
colnames(dss)<-colnames(countsall)
dss<- dss[rowSums(counts(dsall))>10,]
dss<-DESeq(dss)

Is this a valid filter to be using? I have seen many posts where instead the filter is applied after running DESeq, but doesn't this mean that low counts are then still included?

Rebekah

ADD REPLYlink written 4 months ago by bekah20

hi,

That pre-filter is fine.

You could also do:

keep <- rowSums(counts(dds) >= x) >= y

where x and y are meaningful for your data, e.g. x may be a count around 5, and y may be the smallest group size. But our LFC shrinkage methods and the fitting going on inside DESeq() don't technically require filtering.

I wouldn't manually filter after DESeq(). results() does an optimal filter for power when you call it, using either of two published methods (genefilter or IHW). The results() filtering can be turned off with independentFiltering=FALSE.

ADD REPLYlink written 4 months ago by Michael Love20k

Cheers for clearing up my confusion!

ADD REPLYlink written 4 months ago by bekah20

Hi Michael,

If I filter out a readcount of less than 50 for row sums

dst27<- dst27[rowSums(counts(dst27))>=50,]

I get a slightly higher number of DEG with padj<0.05 than when filtering with dst27<- dst27[rowSums(counts(dst27))>10,]


Is this still a valid filter or am I undermining the assumptions on which DESeq2 runs by applying a filter of rowsums 50 before passing the data through the package?

Best wishes,

Rebekah

ADD REPLYlink modified 4 months ago • written 4 months ago by bekah20
1

You can filter at whatever mean count you want, this doesn't disturb the statistical assumptions.

Remember, if you pre-filter too high, you could remove rows which look like: [0, 0, 0] vs [high, high, high].

ADD REPLYlink written 4 months ago by Michael Love20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 332 users visited in the last hour