Question

is our edgeR code doing what we think it does?

1

Entering edit mode

acs1990 ▴ 10

@acs1990-21798

Last seen 4.4 years ago

Thanks in advance for any help someone more experienced with edgeR and DE analysis can offer.

I am new in a lab doing this kind of work and I have only been working in R for about a month. My team tells me this code performs differential expression analysis of two groups via a t-test and edgeR's Exact Test. We also report both sets of results in presentations of the analysis. This was my first concern. In the interest of responsible analysis I compared our code to the edgeR documentation and vignettes and I noticed a handful of what I would consider serious divergences from the suggested workflow. I am worried we are not doing the analysis we think we are doing or at least we are not doing it in the most efficient or statistically responsible way. Am I on track by being concerned or is this some other equally valid way to do this type of analysis?

Would anyone be willing to do me a serious solid and check out the code and let me know where you see problems? Thanks much!

selected_samples=c("Abc_04","Abc_05","Abc_06","Abc_07","Abc_08","Abc_09")
metafile=("metadata.txt")
sample_list <- read.table(metafile, header=TRUE, stringsAsFactors=FALSE)
samples <- sample_list[order(sample_list$Sample_ID),]

#sort by ID to organize data
#import counts from CountMerged.txt data from Step3-2
counts_file=("CountMerged.txt")
counts <- read.table(counts_file, header=TRUE, sep ="\t", stringsAsFactors = FALSE)

#filter "counts" and "samples" to match properly
counts <- counts[,c("id",selected_samples)]
samples=subset(samples, Sample_ID %in% selected_samples)

row.names(counts) <- counts$id
counts$id <- NULL
colnames(counts) <- samples$Sample_ID

#gene filtering only keep samples with >=10 counts in about 1/3 of the data
keep <- rowSums(counts >= 10) >= 3
counts <- counts[keep,]

#Function definition to be called later
tmm <- function(counts, groups=NA){
  d <- DGEList(counts=counts, group=factor(groups))
  d <- calcNormFactors(d, method="TMM") 
  return(d)
}

#normalize
dgList = tmm(counts, samples$Group)

#Reset library sizes by multiplying lib.size and norm.factors
dgList$samples$lib.size <- colSums(dgList$counts)


samples[samples$Group == "CONTROL","dm"] <- "1"
samples[samples$Group == "CASE","dm"] <- "2"
samples$dm

#create a matrix of data to do more testing and apply 
#Fisher's exact test on
design <- model.matrix(~0+samples$dm)
colnames(design) <- levels(samples$dm)
design

#dg will be our dispersion estimate on the data set is the exact test calling
dg <- estimateDisp(dgList,design)
et <- exactTest(dg, pair=2:1)

#capture results in a variable
results_edgeR <- topTags(et, n = nrow(dgList$counts), sort.by = "none")

#CPM, Count Per Million
#log2 - transform the data and convert to CPM
d1 <- cpm(dgList, normalized.lib.sizes=TRUE,log=TRUE)

#Group1 sample ids (control should go here if using)
ctrl <- samples[samples$Group == "CONTROL","Sample_ID"]

#Group2 sample ids (case should go here if only case/control)
case <- samples[samples$Group == "CASE","Sample_ID"]

#T-test.
test_fun <- function(x) {
  t.test(x[ctrl], x[case], alternative = c("two.sided"), paired = FALSE, var.equal=TRUE)$p.value
}

#call T-test function
all.p.genes = apply(d1, 1, test_fun)

#calculate FDR (adjusted p value in this case)
fdr.value = p.adjust(all.p.genes, method="fdr")

#Fold Change calculation
ctrl_mean <- rowMeans(d1[,ctrl])
grp_mean <- rowMeans(d1[,case])
fc.value <-  grp_mean-ctrl_mean

#Combine results 
fc.plus.pvals = cbind(fc.value,all.p.genes, fdr.value)

edger differential expression rna-seq • 1.2k views

ADD COMMENT • link updated 4.6 years ago by Yunshun Chen ▴ 840 • written 4.6 years ago by acs1990 ▴ 10

score 2 · Answer 1 · 2019-09-03

2

Entering edit mode

swbarnes2 ★ 1.3k

@swbarnes2-14086

Last seen 12 hours ago

San Diego

Doing T-tests on logged CPM data...I guess if you really have nothing but Excel, it will give you an idea..but this is not how you are supposed to do it. You can't publish based on results like this. You should follow what the EdgeR tutorials say, and let that software's algorithms do their stuff. You give EdgeR raw counts, it does its normalization, and it calculates fold changes and p-values.

Here's a hint as to why T-test aren't right...The t-test result between (1,2,3) and (4,5,6) is the same as (10,20,30) and (40,50,60)...but in real life, with those raw counts, the latter is way more meaningful.

CPM normalization has its uses for visualization...but it's not appropriate for calculating DE genes.

Since you have only two categories, it looks like the exact test is appropriate. Your script should just stop at that point, and not bother with t-test on normalized counts.

ADD COMMENT • link 4.6 years ago swbarnes2 ★ 1.3k

0

Entering edit mode

Why is making the DGEList and calculating size factors in its own function? dgList$samples$lib.size <- colSums(dgList$counts) doesn't seem to do anything. I don't think the code after is doing much either.

ADD REPLY • link 4.6 years ago swbarnes2 ★ 1.3k

0

Entering edit mode

Thank you for your feedback, I appreciate you taking the time to review this. You have absolutely confirmed my worst fears and it's super helpful. I feel better since you confirmed some of what I noticed too and I am working on overhauling our workflow to adhere to the documentation.

ADD REPLY • link 4.6 years ago acs1990 ▴ 10

0

Entering edit mode

Thank you for your help on this. I am in the process of addressing this with my team and making sure we are doing it right.

To clarify, you should log transform the CPM for visualization only and not for analysis and only use the raw counts for analysis steps like the exactTest and QLFtest?

It is unclear from the User Guide since the Guide suggests filtering on CPM but doesn't show transforming the raw counts to CPM anywhere. The results of the QLF test also show logCPM. None of the examples show raw counts being transformed to CPM prior to filtering or normalization so my question is, should counts be transformed to CPM before creation of the DGEList object or before analysis or is this transformation performed as part of the function? Thanks again.

ADD REPLY • link 4.6 years ago acs1990 ▴ 10

score 2 · Answer 2 · 2019-09-03

2

Entering edit mode

Yunshun Chen ▴ 840

@yunshun-chen-5451

Last seen 5 weeks ago

Australia

You should always refer to the edgeR user's guide. There are several detailed case studies in the user's guide with complete R code and explanation which you can easily follow.

We usually recommend the quasi-likelihood approach for differential expression analysis as we found it gives better FDR control.

ADD COMMENT • link 4.6 years ago Yunshun Chen ▴ 840

0

Entering edit mode

Thank you so much. I really appreciate how detailed and clear the documentation is, especially the case studies. Even as an R noob I have been able to follow it very easily and the QL approach was the conclusion I came to as well. I am in the process of overhauling our workflow so that it more closely adheres to the documentation.

ADD REPLY • link 4.6 years ago acs1990 ▴ 10