Search
Question: Pre-filtering genes counts vs normalised counts DESeq2
0
gravatar for bekah
3 months ago by
bekah10
bekah10 wrote:

Hi,

​I am working on RNA-seq data and looking to prefilter my data - I was using the sum of the read counts >100, but then read that I could use the normalised read counts instead? Is this a better filter, as it is based on data normalised across all samples? I am struggling to be able to view the data after running dss <- estimateSizeFactors(dss) in order to choose a suitable threshold?

Best wishes,

​Rebekah

ADD COMMENTlink modified 3 months ago by Michael Love19k • written 3 months ago by bekah10

Oh sorry I think I found it

counts(dss, normalized =TRUE)

​just in case anyone else was looking for this also

ADD REPLYlink written 3 months ago by bekah10
1
gravatar for Michael Love
3 months ago by
Michael Love19k
United States
Michael Love19k wrote:

Hi Bekah, 

Yes you can use the normalized counts for pre-filtering.

ADD COMMENTlink written 3 months ago by Michael Love19k

Hi Michael,

I have now read several posts on pre-filtering and have confused myself.
I understand that pre-filtering isn't necessary when using DESeq2 due to the filtering step that occurs in the DESeq function.

I have the script for comparing across 20samples for removing the genes with v low counts:

dss<-DESeqDataSetFromMatrix(countData = countsall, colData = samplesall, design =~condition)
colnames(dss)<-colnames(countsall)
dss<- dss[rowSums(counts(dsall))>10,]
dss<-DESeq(dss)

Is this a valid filter to be using? I have seen many posts where instead the filter is applied after running DESeq, but doesn't this mean that low counts are then still included?

Rebekah

ADD REPLYlink written 16 days ago by bekah10

hi,

That pre-filter is fine.

You could also do:

keep <- rowSums(counts(dds) >= x) >= y

where x and y are meaningful for your data, e.g. x may be a count around 5, and y may be the smallest group size. But our LFC shrinkage methods and the fitting going on inside DESeq() don't technically require filtering.

I wouldn't manually filter after DESeq(). results() does an optimal filter for power when you call it, using either of two published methods (genefilter or IHW). The results() filtering can be turned off with independentFiltering=FALSE.

ADD REPLYlink written 15 days ago by Michael Love19k

Cheers for clearing up my confusion!

ADD REPLYlink written 14 days ago by bekah10

Hi Michael,

If I filter out a readcount of less than 50 for row sums

dst27<- dst27[rowSums(counts(dst27))>=50,]

I get a slightly higher number of DEG with padj<0.05 than when filtering with dst27<- dst27[rowSums(counts(dst27))>10,]


Is this still a valid filter or am I undermining the assumptions on which DESeq2 runs by applying a filter of rowsums 50 before passing the data through the package?

Best wishes,

Rebekah

ADD REPLYlink modified 7 days ago • written 7 days ago by bekah10
1

You can filter at whatever mean count you want, this doesn't disturb the statistical assumptions.

Remember, if you pre-filter too high, you could remove rows which look like: [0, 0, 0] vs [high, high, high].

ADD REPLYlink written 6 days ago by Michael Love19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 295 users visited in the last hour