Question

Adjusted p-values become NA when sub-setting samples

0

Entering edit mode

txema.heredia • 0

@01ec307d

Last seen 11 months ago

Netherlands

Hi,

I've asked this in Biostars but I've been told I might have better luck here. I'll copy my question below:

~~(I'm currently editing the post to paste it because the editors is giving me trouble during the first posting)~~

I have a bulk RNA dataset. Mouse data. 2 genetic backgrounds (WT and mutant(Dmin)) which are a model of human disease. For each background I have:

Young Control (6wk, AL)
Adult Control (22wk, AL)
Adult Treatment (22wk, DR)

4-5 samples each.

Although this data comes from a single sequencing experiment, the analysis was split into 2 papers led by different people (there is a lot of additional wet lab work on each of them, the RNA-seq is just a fraction of the analyses). Because of this, the first analysis was done using only the Young vs Adult (Control) samples (excluding Adult Treated samples from the DESeq2 object to prevent their interference and keep a balanced Young/Adult ratio) and has already been published.

The second analysis focuses on the Treatment part in Adults, but builds up on the results from the first analysis. It was decided that, similarly to what was done on the first analysis, here we would exclude the Young samples, and limit the DESeq2 object to only the Adult ones (both Treated and Control). However, an important part of the analysis focuses on comparing the DEG that changed in Mutant mice as they grew older (data generated ignoring adult+treatment mice), and how the treatment in adults affects some of those genes (data generated ignoring young mice). You can see where I'm going...

I have a couple of very interesting biologically relevant genes. If I analyze the data using all the samples (from now on deemed "ALL") both are significant. If I analyze the data using only the adult samples (now deemed "ADULT"), both genes have padj = NA. I am focusing mostly in the comparison 22DminDR vs 22DminAL (Adult Mutant samples, Treated vs Control)

This is how the normalized counts look like:

Normalized counts

The dashed line is the baseMean in ADULT. The solid line is in ALL. The color of the boxplots is the sample's age. the ADULT analysis uses only the blue ones. The groups are significant when using a naive and simple T-test.

And these are the results of DESeq2 for the 22DminDR vs 22DminAL comparison:

ALL samples:

> res_all[c("Spp1","Cd68"),]
log2 fold change (MAP): modelvar 22DminDR vs 22DminAL 
Wald test p-value: modelvar 22DminDR vs 22DminAL 
DataFrame with 2 rows and 5 columns
      baseMean log2FoldChange     lfcSE      pvalue        padj
     <numeric>      <numeric> <numeric>   <numeric>   <numeric>
Spp1   35.4312       -2.72427  0.473436 2.76300e-10 1.79236e-06
Cd68   39.0621       -1.08386  0.320647 2.81564e-05 4.35931e-03

ADULT samples:

> res_adult[c("Spp1","Cd68"),]
log2 fold change (MAP): modelvar 22DminDR vs 22DminAL 
Wald test p-value: modelvar 22DminDR vs 22DminAL 
DataFrame with 2 rows and 5 columns
      baseMean log2FoldChange     lfcSE      pvalue      padj
     <numeric>      <numeric> <numeric>   <numeric> <numeric>
Spp1   48.1062       -2.80547  0.336887 1.39669e-18        NA
Cd68   44.6694       -1.03332  0.305645 2.64104e-05        NA

I have been reading about this from several sources:

https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pvaluesNA

Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons:

If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.

If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook's distance. Customization of this outlier filtering and description of functionality for replacement of outlier counts and refitting is described below

If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below

And following a few links,

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilt

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#indfilttheory

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers

DESeq2 cooksCutoff ON/OFF: pairwise contrasts for many groups, some with only 2 reps

I have been checking all these values for both comparisons:

Cooks distance:

ADULT samples:

Control (Adult, Mutant)

> assays(dds_adult)[["cooks"]][c("Spp1","Cd68"), md[md$modelvar=="22DminAL",]$sample]
            005        013          004         026
Spp1 0.16294652 0.08470974 5.871258e-01 0.003260566
Cd68 0.09220205 0.03160875 6.519192e-05 0.025208679

Treatment (Adult, Mutant)

> assays(dds_adult)[["cooks"]][c("Spp1","Cd68"), md[md$modelvar=="22DminDR",]$sample]
             006          014        029        016          022
Spp1 0.097962191 0.0001170788 0.03350797 0.11415755 0.3792535252
Cd68 0.008037438 0.0421175078 0.04893443 0.08676744 0.0006556893

ALL samples:

Control (Adult, Mutant)

 > assays(dds_all)[["cooks"]][c("Spp1","Cd68"), md[md$modelvar=="22DminAL",]$sample]
            005        013         004         026
Spp1 0.05697454 0.04023993 0.250256327 0.002499021
Cd68 0.06793757 0.02701774 0.001844374 0.023258608

Treatment (Adult, Mutant)

> assays(dds_all)[["cooks"]][c("Spp1","Cd68"), md[md$modelvar=="22DminDR",]$sample]
             006          014        029        016         022
Spp1 0.052105796 1.310099e-05 0.02087917 0.06202494 0.165106367
Cd68 0.001446295 1.449497e-02 0.03282256 0.03355575 0.005948603

I don't think the cooks distance differences are large enough to justify removing these genes from the analysis

Independent Filtering

Number of rejections: (ADULT vs ALL)

plot(metadata(res_all)$filterNumRej, 
     type="b", ylab="number of rejections",
     xlab="quantiles of filter", main="ALL")
lines(metadata(res_all)$lo.fit, col="red")
abline(v=metadata(res_all)$filterTheta)

Rejections Adult

Rejections All

I honestly don't know what I'm really looking at here, but both these figures look wildly different with the example in the Bioconductor vignette.

Rejections bioconductor

Filter threshold

ADULT samples

> metadata(res_adult)$filterThreshold
75.5017% 
49.8858

# "natural" pval = NAs
> summary(res_adult[is.na(res_adult$pvalue),]$baseMean)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
  0.00000   0.00000   0.00000   0.08045   0.00000 133.13102 

# pval != NA, but padj = NA
> summary(res_adult[!is.na(res_adult$pvalue) & is.na(res_adult$padj),]$baseMean)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.01763  0.13640  0.73736  6.59878  6.21863 49.88463 

# padj != NA
> summary(res_adult[!is.na(res_adult$pvalue) & !is.na(res_adult$padj),]$baseMean)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   49.89    87.80   153.62   350.21   296.26 44713.67

ALL samples

> metadata(res_all)$filterThreshold
61.8497% 
5.917652

# "natural" pval = NAs
> summary(res_all[is.na(res_all$pvalue),]$baseMean)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.00000  0.00000  0.00000  0.06754  0.00000 74.84557 

# pval != NA, but padj = NA
> summary(res_all[!is.na(res_all$pvalue) & is.na(res_all$padj),]$baseMean)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01110 0.06538 0.23794 0.82457 0.97549 5.91727 

# padj != NA
> summary(res_all[!is.na(res_all$pvalue) & !is.na(res_all$padj),]$baseMean)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    5.92    30.31    80.68   231.81   196.71 42934.59

I don't know what I should be expecting here, but having the baseMean threshold jump from 5.9 to 49.8 looks like the exact reason why these genes are being excluded from the analysis.

It seems it is clearly driven by the independent filtering.

use <- res_all$baseMean > metadata(res_all)$filterThreshold
h1 <- hist(res_all$pvalue[!use], breaks=0:50/50, plot=FALSE)
h2 <- hist(res_all$pvalue[use], breaks=0:50/50, plot=FALSE)
colori <- c(`do not pass`="khaki", `pass`="powderblue")
barplot(height = rbind(h1$counts, h2$counts), beside = FALSE,
        col = colori, space = 0, main = "ALL", ylab="frequency")
text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)),
     adj = c(0.5,1.7), xpd=NA)
legend("topleft", fill=rev(colori), legend=rev(names(colori)))

pass baseMean threshold Adult

pass baseMean threshold All

Which are, again, wildly different from what is described in the vignette

pass baseMean threshold bioConductor

It seems clear that the choice of sample subsetting (or not) causes a change in the baseMean of the gene, but, most importantly, in the minimum baseMean threshold that the independent filter uses to remove genes from the analysis.

What am I really looking at here? What numbers should I be expecting here? Is a baseMean filter of 49.8 reasonable or is it extremely high? Should I keep using for the analysis only the ADULT samples, or should I convince my coauthors to use ALL the samples? (that might bias other interesting genes, though, I think that's in fact why we ended up choosing the ADULT subset).

Or is there something weird in this data that caused the independentFiltering to go crazy and I should simply ignore it and re-run the analysis with results(dds, independentFiltering=FALSE) ? Should I just leave it be and mention that these two genes are significantly up/down-regulated but their dispersion is too high to give an accurate log2FC estimate?

Thanks in advance

DESeq2 • 1.4k views

ADD COMMENT • link 14 months ago • updated 13 months ago txema.heredia • 0

0

Entering edit mode

I have been back to this after a break. I have run a series of pairwise comparisons using either all the samples together, or only a subset-of-interest of them. Then I looked at how filterThreshold changed on each of them:

comparison	Untreated	Adult	all_samples
modelvar_22DminAL_vs_22WTAL	1.3812616	1.762603	1.22434
modelvar_22DminAL_vs_6DminAL	2.8012694	NA	1.981619
modelvar_22DminDR_vs_22DminAL	NA	49.885802	5.917652
modelvar_22DminDR_vs_22WTDR	NA	1.376374	1.558095
modelvar_22WTAL_vs_6WTAL	6.2974953	NA	2.53396
modelvar_22WTDR_vs_22WTAL	NA	3.663104	3.335337
modelvar_6DminAL_vs_6WTAL	2.1852701	NA	1.981619
sex_F_vs_M	0.8826338	8.378841	4.394078

(modelvar is how I named the grouping variable, and I used a design = modelvar+sex because there is a (balanced) mix of male and female mice)

(The Untreated group only includes AL (control) samples both young and adult; the Adult group only includes Adult (22wk) samples both control and treated)

As you can see, only the modelvar_22DminDR_vs_22DminAL shows a relevant change in filterThreshold, but the new threshold is set to levels that I would deem unacceptably high. The rest of comparisons show very stable (and low) threshold levels.

Sex also shows an increase in that threshold, but it is 5x smaller, and it makes sense because sex-associated genes are more highly expressed in adults than in younger mice.

I have worked around this by applying the threshold I got when using All samples to the results of the same comparison in the Adult subset, and then calculating the adjusted p-value for the genes passing the new (lower) threshold:

res_adult$new_padj <- NA
res_adult[res_adult$baseMean > 5.917652,]$new_padj <- p.adjust(res_adult[res_adult$baseMean > 5.917652,]$pvalue, method = "BH")

This gives me:

Out of 378 significant genes (padj), 57 were significant with the old (high) threshold, but are no longer significant after multiple correction. They all had weak adjusted p-values very close to 0.05, that now are slightly above that.

> dim( res_adult[which(res_adult$padj < 0.05),])
[1] 378   6
> dim( res_adult[which(res_adult$padj < 0.05 & !(res_adult$new_padj < 0.05)),]) 
[1] 57  6
> summary( res_adult[which(res_adult$padj < 0.05 & !(res_adult$new_padj < 0.05)),]$padj) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.03743 0.04008 0.04340 0.04303 0.04514 0.04937 
> summary( res_adult[which(res_adult$padj < 0.05 & !(res_adult$new_padj < 0.05)),]$new_padj) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05047 0.05406 0.05800 0.05780 0.06066 0.06579

Out of 372 significant genes (new_padj), 51 were previously deemed padj = <NA>, and are now significant, with 15 genes with new_padj < 0.01.

> dim( res_adult[which(res_adult$new_padj < 0.05),])
[1] 372   6
> dim(res_adult[which( is.na(res_adult$padj) & (res_adult$new_padj < 0.05)),])
[1] 51  6
> summary(res_adult[which( is.na(res_adult$padj) & (res_adult$new_padj < 0.05)),]$padj)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
     NA      NA      NA     NaN      NA      NA      51 
> summary(res_adult[which( is.na(res_adult$padj) & (res_adult$new_padj < 0.05)),]$new_padj)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.007103 0.018914 0.020931 0.031943 0.048885 
> dim( res_adult[which( is.na(res_adult$padj) & (res_adult$new_padj < 0.01) ), ] )
[1] 15  6

Is this the correct way to go about it?

And, most importantly, how can I prevent this from happening again? I am afraid that some of the analyses I did in the past might be facing this issue? Should I be paranoid and store the filterThreshold of each comparison I run in DESeq2 moving forward and check them in case I find another super-high threshold like this one?

ADD REPLY • link 13 months ago txema.heredia • 0

0

Entering edit mode

Just saw this post, for some reason I didn't get it the first time.

ADD REPLY • link 13 months ago Michael Love 43k

score 1 · Answer 1 · 2024-10-28

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 9 hours ago

United States

One thing to do is just turn off independentFiltering altogether. If you know what are reasonable filters that maintain power by discarding genes with "too low" counts then go ahead and implement that with independentFiltering=FALSE. "Too low" is context dependent, so it's easier for an analyst to apply than for software to figure out. What this routine (from genefilter package) does is to compute too low based on power, but in some circumstances, like comparing across pairwise comparisons, it's more desirable to have consistent filtering.

ADD COMMENT • link 13 months ago Michael Love 43k

0

Entering edit mode

Thanks!

I've been looking into DESeq2's code and, indeed, I get matching results when running manually the code inherited from the genefilter package.

For completeness sake, I made some plots with the cutoff and numRej values generated by it:

cutoff x quantile

log110(cutoff) x quantile

density log10(baseMean)

numRej x quantile

And this is how the numRej fares in all the different pairwise comparisons using the same datasets:

numRej x quantile All comparisons

The solid lines are the two comparisons I showed in the plot above, the dashed lines are the rest.

The lines high on top are WT vs Mutant comparisons that have a ton of DEG.

theta x threshold All comparisons

And how the theta matches the baseMean threshold.

I suppose my samples-of-interest numRej distribution is not what the method expects, as it is the only one in which there is a "peak" very far removed from ~0.5.

I'll keep all this in mind for future analyses and double check it every time.

Thanks for your help

ADD REPLY • link 13 months ago txema.heredia • 0