Question: is our edgeR code doing what we think it does?
1
3 months ago by
acs199010
acs199010 wrote:

Thanks in advance for any help someone more experienced with edgeR and DE analysis can offer.

I am new in a lab doing this kind of work and I have only been working in R for about a month. My team tells me this code performs differential expression analysis of two groups via a t-test and edgeR's Exact Test. We also report both sets of results in presentations of the analysis. This was my first concern. In the interest of responsible analysis I compared our code to the edgeR documentation and vignettes and I noticed a handful of what I would consider serious divergences from the suggested workflow. I am worried we are not doing the analysis we think we are doing or at least we are not doing it in the most efficient or statistically responsible way. Am I on track by being concerned or is this some other equally valid way to do this type of analysis?

Would anyone be willing to do me a serious solid and check out the code and let me know where you see problems? Thanks much!

selected_samples=c("Abc_04","Abc_05","Abc_06","Abc_07","Abc_08","Abc_09")
samples <- sample_list[order(sample_list$Sample_ID),] #sort by ID to organize data #import counts from CountMerged.txt data from Step3-2 counts_file=("CountMerged.txt") counts <- read.table(counts_file, header=TRUE, sep ="\t", stringsAsFactors = FALSE) #filter "counts" and "samples" to match properly counts <- counts[,c("id",selected_samples)] samples=subset(samples, Sample_ID %in% selected_samples) row.names(counts) <- counts$id
counts$id <- NULL colnames(counts) <- samples$Sample_ID

#gene filtering only keep samples with >=10 counts in about 1/3 of the data
keep <- rowSums(counts >= 10) >= 3
counts <- counts[keep,]

#Function definition to be called later
tmm <- function(counts, groups=NA){
d <- DGEList(counts=counts, group=factor(groups))
d <- calcNormFactors(d, method="TMM")
return(d)
}

#normalize
dgList = tmm(counts, samples$Group) #Reset library sizes by multiplying lib.size and norm.factors dgList$samples$lib.size <- colSums(dgList$counts)

samples[samples$Group == "CONTROL","dm"] <- "1" samples[samples$Group == "CASE","dm"] <- "2"
samples$dm #create a matrix of data to do more testing and apply #Fisher's exact test on design <- model.matrix(~0+samples$dm)
colnames(design) <- levels(samples$dm) design #dg will be our dispersion estimate on the data set is the exact test calling dg <- estimateDisp(dgList,design) et <- exactTest(dg, pair=2:1) #capture results in a variable results_edgeR <- topTags(et, n = nrow(dgList$counts), sort.by = "none")

#CPM, Count Per Million
#log2 - transform the data and convert to CPM
d1 <- cpm(dgList, normalized.lib.sizes=TRUE,log=TRUE)

#Group1 sample ids (control should go here if using)
ctrl <- samples[samples$Group == "CONTROL","Sample_ID"] #Group2 sample ids (case should go here if only case/control) case <- samples[samples$Group == "CASE","Sample_ID"]

#T-test.
test_fun <- function(x) {
t.test(x[ctrl], x[case], alternative = c("two.sided"), paired = FALSE, var.equal=TRUE)$p.value } #call T-test function all.p.genes = apply(d1, 1, test_fun) #calculate FDR (adjusted p value in this case) fdr.value = p.adjust(all.p.genes, method="fdr") #Fold Change calculation ctrl_mean <- rowMeans(d1[,ctrl]) grp_mean <- rowMeans(d1[,case]) fc.value <- grp_mean-ctrl_mean #Combine results fc.plus.pvals = cbind(fc.value,all.p.genes, fdr.value)  ADD COMMENTlink modified 3 months ago by Yunshun Chen540 • written 3 months ago by acs199010 Answer: is our edgeR code doing what we think it does? 2 3 months ago by swbarnes2340 swbarnes2340 wrote: Doing T-tests on logged CPM data...I guess if you really have nothing but Excel, it will give you an idea..but this is not how you are supposed to do it. You can't publish based on results like this. You should follow what the EdgeR tutorials say, and let that software's algorithms do their stuff. You give EdgeR raw counts, it does its normalization, and it calculates fold changes and p-values. Here's a hint as to why T-test aren't right...The t-test result between (1,2,3) and (4,5,6) is the same as (10,20,30) and (40,50,60)...but in real life, with those raw counts, the latter is way more meaningful. CPM normalization has its uses for visualization...but it's not appropriate for calculating DE genes. Since you have only two categories, it looks like the exact test is appropriate. Your script should just stop at that point, and not bother with t-test on normalized counts. ADD COMMENTlink modified 3 months ago • written 3 months ago by swbarnes2340 Why is making the DGEList and calculating size factors in its own function? dgList$samples$lib.size <- colSums(dgList$counts) doesn't seem to do anything. I don't think the code after is doing much either.

Thank you for your feedback, I appreciate you taking the time to review this. You have absolutely confirmed my worst fears and it's super helpful. I feel better since you confirmed some of what I noticed too and I am working on overhauling our workflow to adhere to the documentation.

Thank you for your help on this. I am in the process of addressing this with my team and making sure we are doing it right.

To clarify, you should log transform the CPM for visualization only and not for analysis and only use the raw counts for analysis steps like the exactTest and QLFtest?

It is unclear from the User Guide since the Guide suggests filtering on CPM but doesn't show transforming the raw counts to CPM anywhere. The results of the QLF test also show logCPM. None of the examples show raw counts being transformed to CPM prior to filtering or normalization so my question is, should counts be transformed to CPM before creation of the DGEList object or before analysis or is this transformation performed as part of the function? Thanks again.

Answer: is our edgeR code doing what we think it does?
2
3 months ago by
Yunshun Chen540
Australia
Yunshun Chen540 wrote:

You should always refer to the edgeR user's guide. There are several detailed case studies in the user's guide with complete R code and explanation which you can easily follow.

We usually recommend the quasi-likelihood approach for differential expression analysis as we found it gives better FDR control.