Question

Assist me in determining whether the analysis process using the limma package has been executed correctly.

0

Entering edit mode

SSSJec • 0

@21f1d1c7

Last seen 11 months ago

United States

I utilized the R package GEOquery to download data from the GEO database, followed by DFG analysis using the limma package. However, I am uncertain whether this process is accurate. The DFG output indicates that each P-value is significant, but each adjusted P-value tends towards 1. I attempted this process with several datasets using the limma R package and encountered the same outcome consistently. Thus, I suspect an error in my analysis code. Any assistance in resolving this issue would be greatly appreciated.

Here is my analysis code and its corresponding output:

options(stringsAsFactors = F) 
suppressMessages(library(GEOquery))
library(stringr)
library(ggplot2)
library(reshape2)
library(limma)

gset <- getGEO('GSE98421', destdir=".", AnnotGPL = F, getGPL = F)
gse <- gset[[1]]
exp <- exprs(gse)

# Clinic data 
meta.data <- pData(gse) 
group_list <- sapply(meta.data$description, function(x) 
    unlist(strsplit(x," "))[1])
group_list <- factor(group_list,levels = c("NL","PCOS"))
table(group_list)

# change ID names
gpl <- getGEO('GPL570', destdir=".")
probe.gene <- Table(gpl)[,c("ID","Gene Symbol")]
colnames(probe.gene) <- c("ID","Symbol")
probe.gene <- subset(probe.gene, Symbol != '')
ids <- probe.gene
ids$Symbol <- data.frame(sapply(ids$Symbol, function(x) 
    unlist(strsplit(x,"///"))[1]), stringsAsFactors = F)[,1]
head(ids)

library(tidyverse)
exp <- as.data.frame(exp) 
exp <- exp %>% 
  mutate(ID = rownames(exp)) %>% 
  inner_join(ids, by = "ID") %>%
  select(ID, Symbol, everything())

table(duplicated(exp$Symbol))
exp.unique <- avereps(exp[, -c(1,2)], ID = exp$Symbol)

#############################################################
# DFG analysis
design <- model.matrix(~0+factor(group_list))
colnames(design) <- levels(factor(group_list))
rownames(design) <- colnames(exp.unique)

fit <- lmFit(exp.unique, design)
contrast.matrix <- makeContrasts(paste0(c("NL","PCOS"), collapse = "-"), levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix) 
fit2 <- eBayes(fit2) 
DiffEG <- topTable(fit2, coef=1, n=Inf) %>% na.omit()
head(DiffEG)

enter image description here

GEOquery ArrayExpress limma DifferentialExpression • 700 views

ADD COMMENT • link updated 12 months ago by James W. MacDonald 68k • written 12 months ago by SSSJec • 0

score 0 · Answer 1 · 2024-04-29

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 3 minutes ago

United States

A p-value>0.00012 isn't going to survive BH adjustment with 54k tests. Put another way, in most cases you will need at least one gene that achieves Bonferroni significance, which requires a p-value just under 2e-5. Your smallest p-value is an order of magnitude larger than that, so you won't have any FDR values that aren't essentially 1.

ADD COMMENT • link 12 months ago James W. MacDonald 68k

0

Entering edit mode

Thanks for your reply. I understand the principle of adjusted P-values, but I still want to address that issue. The description of the GSE98421 dataset indicates there are 120 differentially expressed genes in the analysis. So, I'm curious if there's an alternative method to analyze the array expression data.

ADD REPLY • link 12 months ago SSSJec • 0

0

Entering edit mode

The description of the analysis says there are 120 genes at a 1.5 fold change and a p<0.05. They don't mention any multiplicity correction. And if I analyze the data, I get the same results as you. I can't come up with 120 genes at a 1.5-fold difference and unadjusted p<0.05, but the data are apparently unpublished, so it's hard to say for sure what they actually did.

ADD REPLY • link 12 months ago James W. MacDonald 68k