Question

DESeq2, Still a most genes get padj=NA after declare cooksCutoff=FALSE in the results () function

0

Entering edit mode

colaneri ▴ 30

@colaneri-7770

Last seen 5.1 years ago

United States

Hi Im doing DGE analysis with DESeq2 to test difference in gene expression under 2 conditions. As other people reported in this foro I got a bunch of genes with NA. However in my case the NA value is in their majority for padj. (see table below) In other words I have a bunch of genes with very low pvalue but got a NA por padj.

I read in the vignette that, at least for p-values the NA represent genes with outliers count in one or more groups. I set the cooksCutoff=FALSE in the results() function but I still got the same results.

When I plot counts for one of these genes:

plotCounts(ddsIAAinCol0,"AT1G11090", intgroup = "treatment")

it does not seems to be an outlier

Any suggestion?

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
AT5G47435	185.6364162	2.115139359	0.366914049	5.764672585	8.18E-09	1.37E-05
AT5G66055	185.3092117	1.644995185	0.293873843	5.597623681	2.17E-08	1.81E-05
AT1G11090	47.78777684	1.94133198	0.364829883	5.321197823	1.03E-07	NA
AT2G29100	15.17214648	1.666409849	0.450228259	3.701255565	0.000214535	NA
AT1G72600	13.04997183	-1.767013802	0.479422186	-3.685715542	0.000228061	NA
AT5G57290	75.34066235	-1.224003948	0.332385985	-3.682477613	0.000230978	NA
AT2G39460	33.69752389	-1.517438704	0.412598155	-3.677764148	0.000235287	NA
AT3G07050	62.78888747	-1.318916065	0.359422	-3.669547401	0.00024298	NA
AT1G10400	8.787243217	-1.793799153	0.489431764	-3.665064845	0.000247276	NA
AT1G48920	128.8451058	-1.121043683	0.306312462	-3.659804357	0.000252408	0.011708921
AT5G51690	56.90610637	1.385338028	0.380859561	3.637398583	0.000275406	NA
AT2G33060	82.5823932	-1.080766384	0.297955375	-3.627276014	0.000286427	0.012927924
AT4G16630	71.83981053	-1.231184086	0.341456276	-3.605685917	0.00031133	NA
AT5G55730	27.34342486	-1.483301796	0.412955145	-3.591919884	0.000328251	NA
AT5G65360	37.24544013	-1.324569311	0.369607195	-3.583721667	0.000338733	NA
AT1G09210	67.97424741	-1.169578499	0.327204788	-3.574454109	0.000350959	NA
AT2G18270	2119.114644	0.967748299	0.271131214	3.569298738	0.000357938	0.015522311
AT1G01160	153.078235	1.016068318	0.285036805	3.564691653	0.000364284	0.015522311
AT2G34480	103.1868193	-1.066463342	0.299624136	-3.559337231	0.000371792	0.015522311
AT4G14890	47.15691661	1.259644836	0.354034502	3.557971978	0.000373729	NA
AT4G30280	46.33922434	1.270915217	0.358566567	3.544433127	0.000393458	NA
AT1G06430	5.003252608	1.744025568	0.493321579	3.535271195	0.000407357	NA
AT2G18290	544.7909982	0.891251585	0.253231577	3.519512043	0.000432341	0.017610004
AT4G37910	42.46839645	-1.389330143	0.396446696	-3.504456353	0.000457541	NA
AT5G23420	22.05896603	-1.579334138	0.450931699	-3.502379943	0.000461122	NA
AT4G14320	54.99615495	-1.230414257	0.351421993	-3.501244323	0.000463091	NA
AT1G29310	10.29744763	-1.675254517	0.479076434	-3.496841832	0.000470801	NA
AT3G15960	26.15442415	-1.598265985	0.45751882	-3.493333858	0.00047703	NA
AT3G62600	51.08063801	-1.271109756	0.364746638	-3.48491151	0.0004923	NA

deseq2 NA cooksCutoff • 4.6k views

ADD COMMENT • link updated 8.1 years ago by Michael Love 41k • written 8.1 years ago by colaneri ▴ 30

score 0 · Answer 1 · 2016-03-29

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

Please see the FAQ in the vignette.

ADD COMMENT • link 8.1 years ago Michael Love 41k

0

Entering edit mode

Dear Michael, according to the vignette the automatic independent filtering is setting the adjusted p values to NA because it considerer that those genes have a low mean normalized count. I found that only 1600 genes were considered to have a good number of counts. When I look at the data I found that 3000 genes with counts between 20 to 70 are considered low. It is that right? is a value of 20 a low mean normalized count?

I understand that if I remove the if I remove the independent filtering (independentFiltering=FALSE)) I have not more NA, but then I'm worry about where differences found in genes with mean normalized count between 20 to 70 can be trusted. Can you give me some insight?

I will appreciate your help

ADD REPLY • link 8.1 years ago colaneri ▴ 30

0

Entering edit mode

The threshold is chosen using the genefilter package. You can read more about the strategy in the citation for the genefilter package, or the section of the DESeq2 paper which discusses it.

Briefly: it is not that the genes with mean normalized count less than the data-driven threshold cannot be trusted.
"Trusted" implies that there would be a problem with specificity / false positives / Type I error if we included them. That is not the case. You can filter or not, and this won't affect the specificity (we have a plot showing the uniformity of p-values for various thresholds in the DESeq2 paper).

The reason for filtering low count genes in DESeq2 is to increase sensitivity for the dataset as a whole (i.e. reduces false negatives / Type II error).

However, it's up to you in the end as the data analyst. You can use the filtering or you can set independentFiltering=FALSE and remove the genes with low counts, or keep all the genes with at least a single count per row, etc.

ADD REPLY • link 8.1 years ago Michael Love 41k