edgeR: Combining time series analysis with mock and treated samples
2
0
Entering edit mode
venura • 0
@venura-22066
Last seen 10 months ago
United States

Hi,

I have two questions regarding the current dataset (output is coming from htseq) I am analyzing (I included the R code below). My experiment has two-time points (2hrs and 5hrs) and for each time point, there are two conditions (mock and treated). I want to compare the differential expression patterns compared to mock between different time points.

  1. Currently, I am doing the analysis for each time point and comparing between time points separately. Is there a way to combine both into a single run (Please note that I have mocks for both time points)?
  2. This may be a stupid question. When I defined the class I put TM (treated) first. The question is when I put TM first the output I am getting is compared to mock (reference) right? not the other way around. I always get scared thinking I am doing the analysis the opposite way.

Thank you in advance!

###R code###

# Read in count matrix
rawdata=read.table("/scratch/user/UPR/RNA_allign/RNA_ALIGN/htseq_counts_clean/gene_read_counts_table_all_2hr_final.tsv", header=TRUE, stringsAsFactors=FALSE, row.names=1)

# Check dimensions
dim(rawdata)

# Require at least 1/6 of samples to have expressed count >= 10
sample_cutoff <- (1/6)
count_cutoff <- 10

#Define a function to calculate the fraction of values expressed above the count cutoff
getFE <- function(data,count_cutoff){
  FE <- (sum(data >= count_cutoff)/length(data))
  return(FE)
}

#Apply the function to all genes, and filter out genes not meeting the sample cutoff
fraction_expressed <- apply(rawdata, 1, getFE, count_cutoff)
keep <- which(fraction_expressed >= sample_cutoff)
rawdata <- rawdata[keep,]

# Check dimensions again to see effect of filtering
dim(rawdata)

#################
# Running edgeR #
#################

# load edgeR
library('edgeR')

# make class labels
class <- c( rep("TM",3), rep("Mock",3) )

#Get common gene names
#Gene=rownames(rawdata)
#Symbol=mapping[Gene,1]
#gene_annotations=cbind(Gene,Symbol)

# Make DGEList object #y <- DGEList(counts=rawdata, genes=gene_annotations, group=class) genes component is removed
y <- DGEList(counts=rawdata, group=class)
nrow(y)

# TMM Normalization
y <- calcNormFactors(y)

# Estimate dispersion
y <- estimateCommonDisp(y, verbose=TRUE)
y <- estimateTagwiseDisp(y)

# Differential expression test
et <- exactTest(y)

# Extract raw counts to add back onto DE results
counts <- getCounts(y)


# Print number of up/down significant genes at FDR = 0.05  significance level
summary(de <- decideTestsDGE(et, adjust.method="BH", p=.05))

#Get output with BH-adjusted FDR values - all genes, any p-value, unsorted
out <- topTags(et, n = "Inf", adjust.method="BH", sort.by="none", p.value=1)$table

#Add raw counts back onto results for convenience (make sure sort and total number of elements allows proper join)
out2 <- cbind(out, counts)

#Limit to significantly DE genes
out3 <- out2[as.logical(de),]

# Order by p-value
o <- order(et$table$PValue[as.logical(de)],decreasing=FALSE)
out4 <- out3[o,]

# Save table
write.table(out4, file="DE_genes_2hr.txt", quote=FALSE, row.names=TRUE, sep="\t")
edgeR • 1.0k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

Take a look at section 3.3 in the edgeR User's Guide. That's a very similar experiment to yours, and you should be using the quasi-likelihood pipeline rather than the exact test anyway. It's not mentioned in that example, but you should do

y <- estimateDisp(y, design)

Which does all the dispersion estimation. Remember to include the design matrix though, so it knows you are fitting a GLM.

ADD COMMENT
0
Entering edit mode

Thank you, James 🙏. I was focusing on case studies and somehow missed that. I will get back if I come across any issues.

ADD REPLY
0
Entering edit mode

Hi James, I did my analysis as you & Gordon Smyth suggested using the following script. Thank you again. I have a small question regarding the identification of genes among both time points. After analyzing each time point (at p=0.05, logFC >= 1), I already know that there are 42 common genes. My question is why It didn't work when I tried below (was following page 38). It ended up giving the error "incorrect number of dimensions". (Only difference associated with mine compared to the example is that I don't use 0 hr mock/treatment).

qlfcommon <- glmQLFTest(fit, contrast = my.contrasts[, "TM.2hvsMock.2h", "TM.5hvsMock.5h"])
topTags(qlfcommon, adjust.method = "BH", sort.by = "PValue", p.value = 1)

I really appreciate it if you can shed some insight about it.

# Read in count matrix
rawdata=read.table("gene_read_counts_table_all_hours_final.tsv.txt", header=TRUE, stringsAsFactors=FALSE, row.names=1)

#check dimensions
dim(rawdata)

#require at least 1/12 samples to have expressed count >=10
sample_cutoff <- (1/12)
count_cutoff <- 10

#Define a function to calculate the fraction of values expressed above the count cutoff
getFE <- function(data,count_cutoff){
  FE <- (sum(data >= count_cutoff)/length(data))
  return(FE)
}

#Apply the function to all genes, and filter out genes not meeting the sample cutoff
fraction_expressed <- apply(rawdata, 1, getFE, count_cutoff)
keep <- which(fraction_expressed >= sample_cutoff)
rawdata <- rawdata[keep,]

#check dimensions again to see the effect of filtering

dim(rawdata)

# make labels/grouping

targets <- read.csv("groups.csv", header=TRUE, row.names=1 )

#grouping targets
group <- factor(paste(targets$Treatment, targets$Time, sep = "."))
cbind(targets,group=group)

#Make DEG List

y <- DGEList(counts=rawdata,group=group)
nrow(y)

# TMM Normalization
y <- calcNormFactors(y)


#making the design

design <- model.matrix(~0+group)
colnames(design) <- levels(group)

# Estimate dispersion
y <- estimateDisp(y, design)

#model fitting

fit <- glmQLFit(y,design)

#contrasts

my.contrasts <- makeContrasts(
                                TM.2hvsMock.2h = TM.2h-Mock.2h,
                                TM.5hvsMock.5h = TM.5h-Mock.5h,
levels=design)

#genes expressed at 2hr TM, 5hr

qlf2hr <- glmQLFTest(fit, contrast = my.contrasts[,"TM.2hvsMock.2h"])
topTags(qlf2hr, adjust.method = "BH", sort.by = "PValue", p.value = 1)


qlf5hr <- glmQLFTest(fit, contrast = my.contrasts[,"TM.5hvsMock.5h"])
topTags(qlf5hr, adjust.method = "BH", sort.by = "PValue", p.value = 1)
ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

First question

edgeR always analyses all the samples together. As James has pointed out, the easiest approach for your data is to combine your two factors time and condition into one factor with four levels. But any experimental design can be accommodated using generalized linear models (glms).

Second question

No, the order of the comparison is determined the pair argument of exactTest. It doesn't have anything to do with the order of the conditions in your class vector. Just type ?exactTest to see how it works.

If you want to make sure that you are comparing TM to Mock (TM - Mock) then you would specify pair=c("Mock", "TM"). Alternatively you can specify the order when you make class into a factor by:

class <- factor(class, levels=c("Mock", "TM"))

In both cases, the first level is treated as the reference or control and the second level is compared back to it. By default, the conditions will be sorted in alphabetical order, so you are getting the right order in this case, but only by accident it seems.

ADD COMMENT
0
Entering edit mode

Thank you, Gordon. Good thing I asked about the order. I will rerun the whole analysis as you and James suggested and get back if I run into any issues 🙏.

ADD REPLY

Login before adding your answer.

Traffic: 641 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6