competitive enrichment test yields too few significant results, self-contained too many
1
0
Entering edit mode
bbao ▴ 10
@8a90e79d
Last seen 11 weeks ago
United States

Hi,

I have RNA-seq data from two conditions for which I am trying to identify metabolic differences. I used DESeq2 to calculate log2FCs and thresholded to yield DEGs, which I then input into GO enrichment and GSEA. But not many metabolic gene sets emerged as statistically significant.

I reasoned that a self-contained test might reveal more differences than the competitive tests used by GO/GSEA, and so I decided to give fry a try. I re-processed the RNA-seq data starting from raw counts using the edgeR workflow, and then used fry. But out of ~11,000 gene sets that I tested, about 85% come out as statistically significant (FDR < 0.05). Is this reasonable/expected?

In the literature I have found a few papers that have said that self-contained tests can be too powerful and yield too many significant gene sets, which are not always relevant to the biology of interest. What should I do when a competitive test does not yield many results, and a self-contained test yields too many?

Thanks!

limma DESeq2 fry edgeR • 5.4k views
ADD COMMENT
1
Entering edit mode

Lets see your code. 8000 significant genesets indicate a flaw somewhere.

ADD REPLY
0
Entering edit mode

Hi ATpoint, I added my code in the comment below. Would appreciate any pointers if you see any glaring errors!

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

Actually, preranked GSEA is known to be highly anti-conservative, because it ignores inter-gene correlations, and can give significant results for gene sets that are not DE at all. edgeR's fry method on the other hand, accounts for inter-gene correlations and gives pretty good error rate control. Howevery, fry was never designed to be used with large databases of gene sets. Why not use edgeR's camera or cameraPR methods, which are designed for that situation?

As a general rule, competitive tests are most appropriate in situations where there are many DE genes and lots of gene sets, whereas self-contained are more appropriate when there is only modest DE or a focused collection of gene sets of key interest.

Getting 85% of gene sets as signficant from fry sounds very, very unusual, but it is possible depending on your data and on the gene sets, especially if many of the gene sets contain the same genes.

You haven't told us anything about your data, or about the gene sets, or the code you used, so really anything could be happening. You could, for example, have a dataset in which almost every gene is DE to the same extent. In that situation, competitive tests will never be significant but self-cointained tests will always be. Or, of course, there's always the possibility of a coding error, e.g. testing for an intercept instead of a contrast.

ADD COMMENT
0
Entering edit mode

Hi Gordon,

Thanks for the advice, I gave camera a try, and it yielded around ~1,000 statistically significant gene sets from the same 11,000 gene sets tested. This is still much larger than what I got with preranked GSEA, so it sounds like there might be an error in my code?

The gene sets I am using are the GO Biological Process terms, the KEGG gene sets, Reactome gene sets, and two custom lists. fry/camera are run for each of these groups of gene sets separately, and then FDR-controlled separately as well. Maybe the hierarchical nature of the GO terms is inflating the percentage of significant gene sets?

Regarding my data, from DESeq2 results I see that there is on the order of 1,000 genes up/down with a log2FC cutoff of 0.585 (equivalent to a FC cutoff of 1.5). I also have another contrast that yields much fewer DEGs (on the order of 100 up/down for the same 0.585 cutoff), but that contrast is also yielding about 1,000 significant gene sets, same as the first contrast.

Here is an edited version of my script with the general workflow:

# Create sample metadata data frame
# 4 replicates in each condition (16 samples total)
conditions <- c("Sample_A_Treatment_1", "Sample_A_Treatment_2", "Sample_B_Treatment_1", "Sample_B_Treatment_2")
run_list <- paste0("DW", sprintf("%02d", 1:16))    # sample IDs
repeated_conditions <- rep(conditions, each = 4)
sample_id_list <- rep(run_list, length.out = length(run_list) * length(conditions))
samples <- data.frame(
  run = run_list,
  condition = factor(rep(repeated_conditions, length.out = length(run_list))),
  sample_id = run_list)

# Import counts
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)

# Create DGEList object
y <- DGEList(counts = txi$counts, group = samples$condition)

# Filter lowly expressed genes
design_matrix <- model.matrix(~ 0 + group, data = y$samples)
colnames(design_matrix) <- levels(y$samples$group)
keep <- filterByExpr(y, design = design_matrix)
y <- y[keep, , keep.lib.sizes=FALSE]

# Normalize for library size, estimate dispersions, fit QL model
y <- calcNormFactors(y)
y <- estimateDisp(y, design_matrix, robust=TRUE)
fit <- glmQLFit(y, design_matrix, robust=TRUE)

# Define contrasts
contrast_1 = "Sample_A_Treatment_2-Sample_A_Treatment_1"
contrast_2 = "Sample_B_Treatment_2-Sample_B_Treatment_1"
all_contrasts <- c(contrast_1, contrast_2)
contrast_matrices <- makeContrasts(contrasts = all_contrasts, levels = design_matrix)

# Load gene sets from MSigDB and custom GMT files
all_gene_sets <- list()
# code to load gene sets (skipped for brevity)
indexed_gene_sets <- lapply(all_gene_sets, function(db) {
  ids2indices(db, identifiers = rownames(y))
})

# run fry
fry_results_list <- list() # Initialize an empty list to store all results

# Loop through each defined contrast
for (i in 1:length(all_contrasts)) {

  current_contrast_name <- all_contrasts[i]
  cat(paste("\n--- Processing Contrast: ", current_contrast_name, " ---\n"))

  # Get the specific contrast matrix for the current contrast
  current_contrast_matrix <- contrast_matrices[, i, drop = FALSE] # drop=FALSE keeps it a matrix

  # Loop through each gene set database (e.g., GO:BP, KEGG, Custom)
  for (db_name in names(indexed_gene_sets)) {

    cat(paste("  Running fry for Database:", db_name, "...\n"))

    # Run the fry test for the current contrast and database
    fry_res <- fry(y,
                   index = indexed_gene_sets[[db_name]],
                   design = design_matrix,
                   contrast = current_contrast_matrix)

    # Add the database name, pathway names, and the contrast name for clarity
    fry_res <- fry_res %>%
      rownames_to_column(var = "Pathway") %>%
      mutate(
        Database = db_name,
        Contrast = current_contrast_name # Add the contrast name here
      )

    # Append the results to main list
    fry_results_list[[paste(db_name, current_contrast_name, sep="_")]] <- fry_res
  }
}

# Combine all results into one table, calculate FDR, and sort by it.
final_fry_results <- bind_rows(fry_results_list) %>%
  # Group by Contrast and Database before adjusting p-values.
  group_by(Contrast, Database) %>%
  mutate(FDR = p.adjust(PValue, method = "BH")) %>%
  ungroup() %>% # Ungroup for easier downstream manipulation
  select(Contrast, Database, Pathway, NGenes, Direction, PValue, FDR) %>%
  arrange(Contrast, PValue) # Sort by contrast and then by p-value

# save fry results
write.csv(final_fry_results, output_file, row.names = FALSE)

# run camera using same loop as for fry
ADD REPLY
1
Entering edit mode

The heirarchical nature of GO tends to reduce significant results from competitive tests (because parents compete with children) and increase significant results from self-contained tests (because parents duplicate children). But the MSigDB has mitigated that to a large extent by removing GO terms with over 500 genes or with less than a minimum number of genes.

ADD REPLY
0
Entering edit mode

That makes sense!

ADD REPLY
1
Entering edit mode

Does it make sense to give the DGEList `y` to fry (or camera, roast etc) directly? It expects logcpms, so I would assume that you should be giving `edgeR::cpm(y, log = TRUE)` instead. Not sure it internally has a method for using the DGEList and the effective lib.size to produce those. It would sort of surprise me since the method lives in limma which predates edgeR and does not depend on it. Probably you get the odd results because counts are not log2 and not normalized. I would give it a try with the logcpms and see if that gives more realistic results.

Unrelated: I might be paranoid in these things, but for loops in R (and probably elsewhere) have the caveat that intermediate output goes to the global environment, so if the loop errors or produces unwanted output anywhere it can be that the variables from the previous iteration for the downstream operations. I will die on my hill suggesting people to almost always use lapply which stores intermediates in its local environment, avoiding these side effects.

ADD REPLY
2
Entering edit mode

fry() is a generic function with a method for DGEList objects, see ?fry.DGEList. It doesn't convert to logCPMs but uses another edgeR-specific method of computing normal-like deviates. There's a couple of examples of its use in the edgeR User's Guide.

ADD REPLY
0
Entering edit mode

Thanks ATpoint, I will keep this in mind!

ADD REPLY

Login before adding your answer.

Traffic: 466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6