Question

Help with edgeR code - RNA-Seq Differental Gene Expression

0

Entering edit mode

joseph.landry • 0

@josephlandry-17916

Last seen 5.5 years ago

Hi All,

I was wondering if you could look over my R code for differential gene expression using EdgeR. I am looking to determine differential gene expression between wild type (WT) cells and knockout cells (KO). Three biological replicates were grown for each cell line and RNA was harvested. The paired end reads were mapped using STAR. Exon counts were obtained using feature counts. The exon counts were then used for the R code. Would this be sufficient to determine differential gene expression between WT and KO? Am I doing anything invalid with this code?

library("edgeR")
library("gdata")
library("heatmaply")
library("ggplot2")
library("genefilter")
library("methylumi")
setwd("/Users/jwlandry2/Desktop/RNA-Seq\ ESC\ Data\ Analysis/R_Studio_Data_Sets")
counts=read.delim("BPTFKD_ESC_RNA_Seq_Counts_Final2.txt", header=T, row.names="Geneid")
head(counts)
group <- factor(c("WT","WT","WT","KO","KO","KO"))
dge = DGEList(counts=counts,group=group)
keep <- rowSums(cpm(dge)>2) >= 3
dge <- dge[keep, , keep.lib.sizes=FALSE]
#Normalization
y <- calcNormFactors(dge)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
y$common.dispersion
plotMDS(y)
plotBCV(y)
#quasi-likelihood F-test
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef=2)
topTags(qlf)
#output to text file
df <- qlf$table
write.csv(df, 'qlf.csv')
#Get normalizaed CPMs
mtx <- cpm(y, log = TRUE, normalized.lib.sizes = TRUE)
mtx_to_plot <- varFilter(mtx, var.cutoff= 0.95)
#Correlation matrix
IAC <- mtx_to_plot %>% cor(. , use = "pairwise.complete.obs", method = "pearson")
heatmaply(IAC) %>% layout(margin=list(l=200,b=150))
#Dendrogram
plot(hclust(as.dist(1-IAC), method="ward"))

edger rnaseq • 668 views

ADD COMMENT • link updated 5.5 years ago by Gordon Smyth 50k • written 5.5 years ago by joseph.landry • 0

score 0 · Answer 1 · 2018-10-24

Looks fine to me, though it makes more sense to call topTags(qlf, n=Inf) and save that to file. Otherwise the DE list isn't sorted by p-value, which is less useful for interpretation. (Of course, this depends on what you're using it for; if you're integrating with other results, then not sorting is better if it preserves the gene order.) I would also consider using robust=TRUE in glmQLFit, which can improve power by avoiding overestimation of the variance of the dispersions and unnecessarily conservative EB shrinkage.

score 0 · Answer 2 · 2018-10-25

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

As Aaron says, the your code looks ok, but you'd be slightly better of with

keep <- filterByExpr(dge)

in place of the code you have now. You might also like to try

mtx <- cpm(y, log=TRUE, prior.count=2)
coolmap(mtx)

instead of the complicated last 7 lines you have.

ADD COMMENT • link 5.5 years ago Gordon Smyth 50k