Assist me in determining whether the analysis process using the limma package has been executed correctly.
1
0
Entering edit mode
SSSJec • 0
@21f1d1c7
Last seen 11 days ago
United States

I utilized the R package GEOquery to download data from the GEO database, followed by DFG analysis using the limma package. However, I am uncertain whether this process is accurate. The DFG output indicates that each P-value is significant, but each adjusted P-value tends towards 1. I attempted this process with several datasets using the limma R package and encountered the same outcome consistently. Thus, I suspect an error in my analysis code. Any assistance in resolving this issue would be greatly appreciated.

Here is my analysis code and its corresponding output:

options(stringsAsFactors = F) 
suppressMessages(library(GEOquery))
library(stringr)
library(ggplot2)
library(reshape2)
library(limma)

gset <- getGEO('GSE98421', destdir=".", AnnotGPL = F, getGPL = F)
gse <- gset[[1]]
exp <- exprs(gse)

# Clinic data 
meta.data <- pData(gse) 
group_list <- sapply(meta.data$description, function(x) 
    unlist(strsplit(x," "))[1])
group_list <- factor(group_list,levels = c("NL","PCOS"))
table(group_list)

# change ID names
gpl <- getGEO('GPL570', destdir=".")
probe.gene <- Table(gpl)[,c("ID","Gene Symbol")]
colnames(probe.gene) <- c("ID","Symbol")
probe.gene <- subset(probe.gene, Symbol != '')
ids <- probe.gene
ids$Symbol <- data.frame(sapply(ids$Symbol, function(x) 
    unlist(strsplit(x,"///"))[1]), stringsAsFactors = F)[,1]
head(ids)

library(tidyverse)
exp <- as.data.frame(exp) 
exp <- exp %>% 
  mutate(ID = rownames(exp)) %>% 
  inner_join(ids, by = "ID") %>%
  select(ID, Symbol, everything())

table(duplicated(exp$Symbol))
exp.unique <- avereps(exp[, -c(1,2)], ID = exp$Symbol)

#############################################################
# DFG analysis
design <- model.matrix(~0+factor(group_list))
colnames(design) <- levels(factor(group_list))
rownames(design) <- colnames(exp.unique)

fit <- lmFit(exp.unique, design)
contrast.matrix <- makeContrasts(paste0(c("NL","PCOS"), collapse = "-"), levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix) 
fit2 <- eBayes(fit2) 
DiffEG <- topTable(fit2, coef=1, n=Inf) %>% na.omit()
head(DiffEG)

enter image description here

GEOquery ArrayExpress limma DifferentialExpression • 272 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 days ago
United States

A p-value>0.00012 isn't going to survive BH adjustment with 54k tests. Put another way, in most cases you will need at least one gene that achieves Bonferroni significance, which requires a p-value just under 2e-5. Your smallest p-value is an order of magnitude larger than that, so you won't have any FDR values that aren't essentially 1.

0
Entering edit mode

Thanks for your reply. I understand the principle of adjusted P-values, but I still want to address that issue. The description of the GSE98421 dataset indicates there are 120 differentially expressed genes in the analysis. So, I'm curious if there's an alternative method to analyze the array expression data.

ADD REPLY
0
Entering edit mode

The description of the analysis says there are 120 genes at a 1.5 fold change and a p<0.05. They don't mention any multiplicity correction. And if I analyze the data, I get the same results as you. I can't come up with 120 genes at a 1.5-fold difference and unadjusted p<0.05, but the data are apparently unpublished, so it's hard to say for sure what they actually did.

ADD REPLY

Login before adding your answer.

Traffic: 739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6