Help with edgeR code - RNA-Seq Differental Gene Expression
2
0
Entering edit mode
@josephlandry-17916
Last seen 5.5 years ago

Hi All,

I was wondering if you could look over my R code for differential gene expression using EdgeR. I am looking to determine differential gene expression between wild type (WT) cells and knockout cells (KO). Three biological replicates were grown for each cell line and RNA was harvested. The paired end reads were mapped using STAR. Exon counts were obtained using feature counts. The exon counts were then used for the R code. Would this be sufficient to determine differential gene expression between WT and KO? Am I doing anything invalid with this code?

library("edgeR")
library("gdata")
library("heatmaply")
library("ggplot2")
library("genefilter")
library("methylumi")
setwd("/Users/jwlandry2/Desktop/RNA-Seq\ ESC\ Data\ Analysis/R_Studio_Data_Sets")
counts=read.delim("BPTFKD_ESC_RNA_Seq_Counts_Final2.txt", header=T, row.names="Geneid")
head(counts)
group <- factor(c("WT","WT","WT","KO","KO","KO"))
dge = DGEList(counts=counts,group=group)
keep <- rowSums(cpm(dge)>2) >= 3
dge <- dge[keep, , keep.lib.sizes=FALSE]
#Normalization
y <- calcNormFactors(dge)
design <- model.matrix(~group)
y <- estimateDisp(y,design)
y$common.dispersion
plotMDS(y)
plotBCV(y)
#quasi-likelihood F-test
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef=2)
topTags(qlf)
#output to text file
df <- qlf$table
write.csv(df, 'qlf.csv')
#Get normalizaed CPMs
mtx <- cpm(y, log = TRUE, normalized.lib.sizes = TRUE)
mtx_to_plot <- varFilter(mtx, var.cutoff= 0.95)
#Correlation matrix
IAC <- mtx_to_plot %>% cor(. , use = "pairwise.complete.obs", method = "pearson")
heatmaply(IAC) %>% layout(margin=list(l=200,b=150))
#Dendrogram
plot(hclust(as.dist(1-IAC), method="ward"))
edger rnaseq • 668 views
ADD COMMENT
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 10 hours ago
The city by the bay

Looks fine to me, though it makes more sense to call topTags(qlf, n=Inf) and save that to file. Otherwise the DE list isn't sorted by p-value, which is less useful for interpretation. (Of course, this depends on what you're using it for; if you're integrating with other results, then not sorting is better if it preserves the gene order.) I would also consider using robust=TRUE in glmQLFit, which can improve power by avoiding overestimation of the variance of the dispersions and unnecessarily conservative EB shrinkage.

ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

As Aaron says, the your code looks ok, but you'd be slightly better of with

keep <- filterByExpr(dge)

in place of the code you have now. You might also like to try

mtx <- cpm(y, log=TRUE, prior.count=2)
coolmap(mtx)

instead of the complicated last 7 lines you have.

ADD COMMENT

Login before adding your answer.

Traffic: 1048 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6