RNA-seq analysis using EdgeR- I need to make sure that the codes that I use to analyze the data are correct
3
0
Entering edit mode
MEM • 0
@mem-9915
Last seen 8.0 years ago

Hi everyone I am have RNA seq data that I am trying to analyze using EdgeR. Here is the codes I used to analyze the data I just want to make sure that the codes are right and that I can rely on the output for the analysis. The summary of the experiments: I have 4 replicates of infected cells with Cryptosporidium and 4 replicates of control (uninfected)

rawdata <- read.csv("celldata.csv", check.names=FALSE, sep=",", header=T, stringsAsFactors=FALSE)
library(edgeR)
library(limma)
rownames( counts ) <- raw.data[ , 1 ]
> y <- DGEList(counts=rawdata[,2:9], genes=rawdata[,1])

filtering 

> o <- order(rowSums(y$counts), decreasing=TRUE)
> y <- y[o,]
> d <- duplicated(y$genes)
> y <- y[!d,]
> nrow(y)
> y$samples$lib.size <- colSums(y$counts)
>  rownames(y$counts) <- rownames(y$genes) <- y$genes$EntrezGene
> y$genes$EntrezGene <- NULL
>  y <- calcNormFactors(y)
>  y$samples
> plotMDS(y)

Design matrix

cells <- factor(c(1,2,3,4,5,6,7,8))
> status<- factor(c("i","i","i","i","c","c","c","c"))
> data.frame(Sample=colnames(y),cells,status)
> design <- model.matrix(~cells)
>  rownames(design) <- colnames(y)
> Estimating the dispersion 
>y d1$common.dispersion
> y <- estimateTrendedDisp(y)
> y <- estimateTrendedDisp(y)
>y <- estimateCommonDisp(y)
> y<- estimateCommonDisp(y)
> y <- estimateTrendedDisp(y)
> y<- estimateTagwiseDisp(y)
> plotBCV(y)
> fit <- glmFit(y, design)
>  lrt <- glmLRT(fit)
> topTags(lrt)

 

Really appreciate any feedback

 

rnaseq edger differential gene expression • 3.5k views
ADD COMMENT
3
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 18 hours ago
The city by the bay

There seem to be multiple problems here. Some are harmless, others are less so:

  • Your "filtering" step just removes duplicated genes. The real purpose of filtering is to remove low abundance genes, which you have not done. This will lead to suboptimal performance, as the overabundance of genes with near-zero counts will skew the mean-dispersion trend fit; reduce power by increasing the number of tests (and severity of the multiple testing correction) without increasing the number of rejections, as the counts are too low to reject the null; and increase computational time.
  • Your design matrix is constructed using the cells factor. This is almost certainly wrong, as you're basically saying that there are no replicates in your experiment (as each sample corresponds to a different level of cells). You should probably be constructing your design matrix from status.
  • I don't know why you're mixing classic edgeR (estimateTrendedDisp, estimateCommonDisp) with GLM edgeR (glmFit, glmLRT). While this doesn't cause too many problems for your two-group setup, you should generally stick with one or the other. Also, the line y d1$common.dispersion can't be syntactically valid.

Other things are probably okay but can be improved. In particular, you don't need to reorder by abundance; recalculation of the library sizes can be done with keep.lib.sizes=FALSE during subsetting; and there's no need to repeat the dispersion estimation calls.

ADD COMMENT
0
Entering edit mode

Many thanks for your quick response. I realized there is a problem with filtering since I get the same number before and after filtering. This means filtering did not make any changes or remove any reads. How do you recommend to change this codes? Regarding cells factor I understand the reason why it is wrong but I don't know how to fix it. 1,2,3 nd four belong to the infected group and 5,6,7,8 belong to the control group. Thanks, 

ADD REPLY
0
Entering edit mode

Something like the below would be sufficient:

keep <- aveLogCPM(y) > 0 # depending on your library size.
y <- y[keep,,keep.lib.sizes=FALSE]

Alternatively, read some of the case studies in the user's guide for an "at least X" approach to filtering.

As for the factor in the design matrix, just replace cells with status as I mentioned.

ADD REPLY
0
Entering edit mode
MEM • 0
@mem-9915
Last seen 8.0 years ago

Many thanks for your quick response. I realized there is a problem with filtering since I get the same number before and after filtering. This means filtering did not make any changes or remove any reads. How do you recommend to change this codes? Regarding cells factor I understand the reason why it is wrong but I don't know how to fix it. 1,2,3 nd four belong to the infected group and 5,6,7,8 belong to the control group. Thanks, 

 

 

 

ADD COMMENT
0
Entering edit mode

In order to fix the "cells factor" problem, make another covariate named "group" or whatever where you explicitly encode "treatment" or "control" as you've outlined in your answer here.

ADD REPLY
0
Entering edit mode

This is what I get                     

infected<- factor(c(1,1,1,1,0,0,0,0))
> cells<- factor(c(1,2,3,4,5,6,7,8))

design <- model.matrix(~cells+infected)
> design
  (Intercept) cells2 cells3 cells4 cells5 cells6 cells7 cells8 infected1
1           1      0      0      0      0      0      0      0         1
2           1      1      0      0      0      0      0      0         1
3           1      0      1      0      0      0      0      0         1
4           1      0      0      1      0      0      0      0         1
5           1      0      0      0      1      0      0      0         0
6           1      0      0      0      0      1      0      0         0
7           1      0      0      0      0      0      1      0         0
8           1      0      0      0      0      0      0      1         0

 

ADD REPLY
0
Entering edit mode

If I exclude the cells factor I get this output...design <- model.matrix(~infected)
> design
  (Intercept) infected1
1           1         1
2           1         1
3           1         1
4           1         1
5           1         0
6           1         0
7           1         0
8           1         0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$infected
[1] "contr.treatment"

 

ADD REPLY
1
Entering edit mode

So now you should be all setup to do your DGE analysis.

To put all the suggestions in one place, you would do:

y <- DGEList(counts=rawdata[,2:9], genes=rawdata[,1])

## your duplicate filtering
o <- order(rowSums(y$counts), decreasing=TRUE)
y <- y[o,]
d <- duplicated(y$genes)
y <- y[!d,]

## Aaron's expression filtering
keep <- aveLogCPM(y) > 0 # depending on your library size.
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)

## Proper design setup
design <- model.matrix(~infected)
y <- estimateDisp(y, design)

Now you are ready to run the tests. I like to use the quasi-likelihood framework, and a good tutorial on it can be found in Aaron's "It's DE-licious" paper

fit <- glmQLFit(y, design, robust=TRUE)
res <- glmQLFTest(fit, coef='infected1')
topTags(res)

... and the rest will be gravy. You definitely should read through (and take the time to understand) the edgeRUserGuide and the DE-licious paper I linked to for more details around each step before you trust these results, though.

ADD REPLY
0
Entering edit mode

 

 

Thanks Steve! I have read the EdgeR user guide and followed the instructions there but I found it difficult to interpret the output and understand the codings. I also went through the article you sent which was quite useful. However comparing a couple of my output with their case study make me confused. For instance I created the infected factor as it appears above. Here is the article case study output after fitting QL glm in page 13 you see each column represent each sample. In my case it shows everything in one column 

it <- glmQLFit(y, design, robust=TRUE)
>  head(fit$coefficients)
  (Intercept)     infected1
1  -12.284949 -1.476186e-02
3  -14.962891  3.926424e-01
5  -13.812587 -1.519542e-01
6  -10.063114  7.659401e-05
7   -9.508792 -6.364900e-02
9  -10.678536  1.593839e-03

I cannot follow the exact instructions in that article as I am working with a table of raw reads and annotated genes. Thanks!

 

ADD REPLY
1
Entering edit mode

There is a column-per-sample on the example from the page you reference because the design matrix they built was one without an intercept (eg. design <- model.matrix(~ 0 + whatever)), whereas yours has an Intercept (notice that the formula that constructs your design is missing the `0 +`).

The difference between the two different designs has been explained dozens of times on these forums, and is also explained in both the limmaUsersGuide and (pretty sure) the edgeRUsersGuide. Both of these guides are great reads for people who are trying to get a better handle on the basics of linear modeling.

Instead of re-explaining it here, I'll just recommend (again) to read these valuable resources and (further) browse through the extensive history of help on this forum. You might start by searching the forums for "intercept" and reading through those results.

ADD REPLY
0
Entering edit mode

Thanks!I have done GO analysis and am now interested in metabolic pathway analysis. Aside from EdgeR user guide do you have any other ref to go through in that regards?

 

 

ADD REPLY
0
Entering edit mode

Look through the packages available in the GeneSetEnrichment view.

ADD REPLY
0
Entering edit mode

Thanks! 

ADD REPLY
0
Entering edit mode

Thanks for providing useful information. Here is my finalized codes. I got differentially expressed genes, GO terms and metabolic pathways based on these codes. How can I conform that these are correct findings?

rawdata <- read.csv("celldata.csv", check.names=FALSE, sep=",", header=T, stringsAsFactors=FALSE)

> y <- DGEList(counts=rawdata[,2:9], genes=rawdata[,1])

keep <- aveLogCPM(y) > 2

y <- y[keep,,keep.lib.sizes=FALSE]

y <- calcNormFactors(y)

 design <- model.matrix(~infected)

y <- estimateDisp(y, design)

fit <- glmQLFit(y, design, robust=TRUE)

res <- glmQLFTest(fit, coef='infected1')

topTags(res)

keg <- kegga(qlf)

topKEGG(keg)

go <- goana(qlf)

> topGO(go)

ADD REPLY
0
Entering edit mode
MEM • 0
@mem-9915
Last seen 8.0 years ago

Thanks for providing useful information. Here is my finalized codes. I got differentially expressed genes, GO terms and metabolic pathways based on these codes. How can I conform that these are correct findings?

rawdata <- read.csv("celldata.csv", check.names=FALSE, sep=",", header=T, stringsAsFactors=FALSE)

> y <- DGEList(counts=rawdata[,2:9], genes=rawdata[,1])

keep <- aveLogCPM(y) > 2

y <- y[keep,,keep.lib.sizes=FALSE]

y <- calcNormFactors(y)

 design <- model.matrix(~infected)

y <- estimateDisp(y, design)

fit <- glmQLFit(y, design, robust=TRUE)

res <- glmQLFTest(fit, coef='infected1')

topTags(res)

keg <- kegga(qlf)

topKEGG(keg)

go <- goana(qlf)

> topGO(go)

 

 

 

ADD COMMENT
2
Entering edit mode

I don't mean to sound too snarky, but now it's time to actually look at your results, cross reference with literature, talk to your collaborators, etc. to see if these make sense.

It seems you are quite new to all of this, so I would even start by plotting the individual expression estimates per sample among a sampling of your differentially expressed genes to convince yourself that your code worked. Plot the expression estimates of genes within a pathway (vs not) to convince yourself that your pathway analysis is correct, etc.

There is an infinite number of ways to explore the results of your analysis, and no one can give you a few lines of code that should be proof positive enough for you to say what what you've done is correct.

ADD REPLY
0
Entering edit mode

 

 

ADD REPLY
0
Entering edit mode

This forum is the only source of support I have at present!!!!

ADD REPLY

Login before adding your answer.

Traffic: 795 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6