Thanks in advance for any help someone more experienced with edgeR and DE analysis can offer.
I am new in a lab doing this kind of work and I have only been working in R for about a month. My team tells me this code performs differential expression analysis of two groups via a t-test and edgeR's Exact Test. We also report both sets of results in presentations of the analysis. This was my first concern. In the interest of responsible analysis I compared our code to the edgeR documentation and vignettes and I noticed a handful of what I would consider serious divergences from the suggested workflow. I am worried we are not doing the analysis we think we are doing or at least we are not doing it in the most efficient or statistically responsible way. Am I on track by being concerned or is this some other equally valid way to do this type of analysis?
Would anyone be willing to do me a serious solid and check out the code and let me know where you see problems? Thanks much!
selected_samples=c("Abc_04","Abc_05","Abc_06","Abc_07","Abc_08","Abc_09")
metafile=("metadata.txt")
sample_list <- read.table(metafile, header=TRUE, stringsAsFactors=FALSE)
samples <- sample_list[order(sample_list$Sample_ID),]
#sort by ID to organize data
#import counts from CountMerged.txt data from Step3-2
counts_file=("CountMerged.txt")
counts <- read.table(counts_file, header=TRUE, sep ="\t", stringsAsFactors = FALSE)
#filter "counts" and "samples" to match properly
counts <- counts[,c("id",selected_samples)]
samples=subset(samples, Sample_ID %in% selected_samples)
row.names(counts) <- counts$id
counts$id <- NULL
colnames(counts) <- samples$Sample_ID
#gene filtering only keep samples with >=10 counts in about 1/3 of the data
keep <- rowSums(counts >= 10) >= 3
counts <- counts[keep,]
#Function definition to be called later
tmm <- function(counts, groups=NA){
d <- DGEList(counts=counts, group=factor(groups))
d <- calcNormFactors(d, method="TMM")
return(d)
}
#normalize
dgList = tmm(counts, samples$Group)
#Reset library sizes by multiplying lib.size and norm.factors
dgList$samples$lib.size <- colSums(dgList$counts)
samples[samples$Group == "CONTROL","dm"] <- "1"
samples[samples$Group == "CASE","dm"] <- "2"
samples$dm
#create a matrix of data to do more testing and apply
#Fisher's exact test on
design <- model.matrix(~0+samples$dm)
colnames(design) <- levels(samples$dm)
design
#dg will be our dispersion estimate on the data set is the exact test calling
dg <- estimateDisp(dgList,design)
et <- exactTest(dg, pair=2:1)
#capture results in a variable
results_edgeR <- topTags(et, n = nrow(dgList$counts), sort.by = "none")
#CPM, Count Per Million
#log2 - transform the data and convert to CPM
d1 <- cpm(dgList, normalized.lib.sizes=TRUE,log=TRUE)
#Group1 sample ids (control should go here if using)
ctrl <- samples[samples$Group == "CONTROL","Sample_ID"]
#Group2 sample ids (case should go here if only case/control)
case <- samples[samples$Group == "CASE","Sample_ID"]
#T-test.
test_fun <- function(x) {
t.test(x[ctrl], x[case], alternative = c("two.sided"), paired = FALSE, var.equal=TRUE)$p.value
}
#call T-test function
all.p.genes = apply(d1, 1, test_fun)
#calculate FDR (adjusted p value in this case)
fdr.value = p.adjust(all.p.genes, method="fdr")
#Fold Change calculation
ctrl_mean <- rowMeans(d1[,ctrl])
grp_mean <- rowMeans(d1[,case])
fc.value <- grp_mean-ctrl_mean
#Combine results
fc.plus.pvals = cbind(fc.value,all.p.genes, fdr.value)
Why is making the DGEList and calculating size factors in its own function? dgList$samples$lib.size <- colSums(dgList$counts) doesn't seem to do anything. I don't think the code after is doing much either.
Thank you for your feedback, I appreciate you taking the time to review this. You have absolutely confirmed my worst fears and it's super helpful. I feel better since you confirmed some of what I noticed too and I am working on overhauling our workflow to adhere to the documentation.
Thank you for your help on this. I am in the process of addressing this with my team and making sure we are doing it right.
To clarify, you should log transform the CPM for visualization only and not for analysis and only use the raw counts for analysis steps like the exactTest and QLFtest?
It is unclear from the User Guide since the Guide suggests filtering on CPM but doesn't show transforming the raw counts to CPM anywhere. The results of the QLF test also show logCPM. None of the examples show raw counts being transformed to CPM prior to filtering or normalization so my question is, should counts be transformed to CPM before creation of the DGEList object or before analysis or is this transformation performed as part of the function? Thanks again.