Dear all
I 'm trying to run a differential expression analysis with edgeR/voom (combining guidelines from https://www.bioconductor.org/help/workflows/RNAseq123/ and http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html) and using count tables from 10 RNAseq samples (5 "EXP" (treatment) samples and 5 "CON" (control) samples). These count tables were obtained by mapping RNA reads against a de novo transcriptome (116886 transcripts) and counting with htseqcount. Please find below the dashed line the workflow of my diff. expression analysis.
Several question came to my mind while running this analysis:
- overall: is this workflow sound, any other parameter I should take into account for the fact that reads were mapped against a de novo transcriptome with > 100000 transcripts?
- based on the PCAplot and heatmap BEFORE filtering for transcripts with low number of reads, I would consider R4 as an outlier and exclude it from further analysis while based on the heatmap AFTER filtering for transcripts with low number of reads and taking into account that S1 has the lowest sample weight (0.17) I would consider both S1 and R4 as outliers and exclude them from the DE analysis. Is this correct?
- using voomwithQualityweights (see outputgraph here) and limma eBayes I obtained 559 DE transcripts (fc>2, adj. p value < 0.05), however, using voom (not included in the workflow above) I only have 3 DE transcripts...making me wonder whether I can trust the DE analysis (but I guess only qPCR validation could provide an answer for this question)
-last, in the RNAseq123 guidelines it is mentioned to use the treat method to calculate p-values from empirical Bayes moderated t-statistics with a minimum log-FC requirement. However, when I use the treat method (tfit <- treat(vfit, lfc=1), not included in the workflow above) I only obtained 3 DE transcripts compared to 559 DE transcripts using efit as explained here and here.
Any suggestion/answer would be much appreciated!
Regards
Wannes
-----------------------------------------------------------------
> directory="/Users/WD/"
> ##import htseq count files
> files <- grep("counts.txt", list.files(directory), value = TRUE)
> x <- readDGE(files, columns=c(1,2),header=FALSE)
Meta tags detected: __no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique
> #remove HTSeq Metatags:__no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique
> MetaTags <- grep("^__", rownames(x))
> x <- x[-MetaTags, ]
> dim(x)
[1] 116886 10
> ##organisation of samples
> #abbreviate samplenames
> samplenames <- substring(colnames(x),1,2)
> samplenames
[1] "R1" "R2" "R3" "R4" "R5" "S1" "S2" "S3" "S4" "S5"
> #assign abbreviated samplenames to colnames of x
> colnames(x) <- samplenames
> #make groups
> group <- as.factor(c("EXP", "EXP", "EXP", "EXP", "EXP", "CON","CON", "CON", "CON", "CON"))
> x$samples$group <- group
> x$samples
> ##heatplot 500 most variable genes before filtering
> lcpm<-cpm(x,log=TRUE)
> #select most variable genes
> var_genes <- apply(lcpm, 1, var)
> select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
> #subsetting
> highly_variable_lcpm <- lcpm[select_var,]
> dim(highly_variable_lcpm)
[1] 500 10
> #setup colors
> mypalette <- brewer.pal(11,"RdYlBu")
> morecols <- colorRampPalette(mypalette)
> library(gplots) > heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",scale="row")
> ##MDS_plot_before_filtering
> col.group <- group
> levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
> col.group <- as.character(col.group)
> pdf("MDS_plot_before_filtering.pdf")
> plotMDS(lcpm, labels=samplenames, col=col.group)
> title(main="A. Sample groups")
> ##remove transcripts with low reads (cpm above 1 in at least 5 out of 10 samples)
> cpm <- cpm(x)
> keep.exprs <- rowSums(cpm>1)>=5
> x <- x[keep.exprs, keep.lib.sizes=FALSE]
> cpm <- cpm(x)
> dim(x)
[1] 15926 10
> ##heatplot 500 most variable genes after filtering
> lcpm<-cpm(x,log=TRUE)
> #select most variable genes
> var_genes <- apply(lcpm, 1, var)
> select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
> #subsetting
> highly_variable_lcpm <- lcpm[select_var,]
> dim(highly_variable_lcpm)
[1] 500 10
> #setup colors
> mypalette <- brewer.pal(11,"RdYlBu")
> morecols <- colorRampPalette(mypalette)
> library(gplots)
> heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",scale="row")
> ##calculate normalisation factors
> x <- calcNormFactors(x, method = "TMM")
> ##setup design
> design <- model.matrix(~0+group)
> #remove group in design colnames
> colnames(design) <- gsub("group", "",colnames(design))
> design
CON EXP
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 1 0
7 1 0
8 1 0
9 1 0
10 1 0
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
> ##setup contrasts
> contr.matrix <- makeContrasts(EXPvsCON = EXP-CON,levels = colnames(design))
> contr.matrix
Contrasts
Levels EXPvsCON
CON -1
EXP 1
> ##voom WithQualityWeights
> v <- voomWithQualityWeights(counts = x, design = design, normalize.method = "none", plot = TRUE)
> v
An object of class "EList"
$targets
files group lib.size norm.factors
R1 R1_Mixed_no_strand_counts.txt EXP 17826523 0.7893778
R2 R2_Mixed_no_strand_counts.txt EXP 18369554 0.8584186
R3 R3_Mixed_no_strand_counts.txt EXP 19888650 1.0453065
R4 R4_Mixed_no_strand_counts.txt EXP 31397203 1.7394248
R5 R5_Mixed_no_strand_counts.txt EXP 9456578 0.5311490
S1 S1_Mixed_no_strand_counts.txt CON 17908499 0.8544335
S2 S2_Mixed_no_strand_counts.txt CON 19757373 1.0549119
S3 S3_Mixed_no_strand_counts.txt CON 36983226 1.8358870
S4 S4_Mixed_no_strand_counts.txt CON 13900378 0.7013372
S5 S5_Mixed_no_strand_counts.txt CON 26794838 1.3166885
$E
Samples
Tags R1 R2 R3 R4 R5 S1 S2 S3 S4 S5
locus_000001.1 4.317752 4.619337 4.455964 4.693660 4.619768 4.960255 4.645507 4.746124 4.216968 4.790420
locus_000002.1 13.983810 13.686776 14.098252 13.247817 15.196448 13.060765 12.863594 12.685275 14.536459 13.116237
locus_000003.1 2.695796 2.817563 2.820553 2.601083 2.918553 3.234032 2.890438 3.381788 2.073312 3.303241
locus_000005.1 4.856671 4.827279 4.772263 4.816154 4.601032 2.951170 4.539602 4.478576 4.549461 4.407133
locus_000006.1 2.311652 2.167077 2.434319 2.326644 2.629046 2.414856 1.990301 1.785554 2.620800 2.250470
15921 more rows ...
$weights
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 17.718627 39.148292 41.651738 34.292450 21.710199 3.6379280 8.329925 57.114196 5.771753 64.777525
[2,] 1.942678 4.192361 4.204877 2.672345 4.173634 0.5159848 1.100322 5.185488 1.001878 7.011152
[3,] 9.552051 21.208064 22.935908 21.786614 11.197957 2.2399819 5.267825 44.535453 3.388637 45.161314
[4,] 18.620894 41.036334 43.500842 35.012678 23.480366 3.3555401 7.749195 55.712939 5.218605 61.707257
[5,] 8.225308 18.257714 19.727219 18.585605 9.399346 1.4550842 3.410234 29.234620 2.192492 29.042470
15921 more rows ...
$design
CON EXP
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 1 0
7 1 0
8 1 0
9 1 0
10 1 0
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
$sample.weights
1 2 3 4 5 6 7 8 9 10
0.8846822 1.9333165 2.0043199 1.4800311 1.4496108 0.1738191 0.3869143 2.3825648 0.3018141 2.8114087
> vfit <- lmFit(v, design)
> vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
> efit <- eBayes(vfit)
> plotSA(efit, main="Final model Mean variance trend")
> #differential expression analysis with efit (see e.g. https://www.biostars.org/p/175792/)
> summary(decideTests(efit))
EXPvsCON
-1 1314
0 13041
1 1571
> top <- topTable(efit, coef = "EXPvsCON", adjust = "fdr", sort.by = "P", lfc = 1, p.value = 0.05, number = Inf)
-----------------------------------------------------------------
A couple of additional pointers. Unless you care about differential transcript expression (as compared to differential gene expression), you might want to collapse the transcripts to genes using the tximport package. If you are using Trinity to do the de novo assembly (and as Aaron suggests, you should pool all your samples for that part), then you can use the Trinity IDs to map transcripts to genes.
Unfortunately, a Trinity 'gene' isn't really what one would normally call a gene, as there can be multiple Trinity genes that map to the same gene if you use blast. My usual pipeline for de novo stuff is to use tximport to collapse to genes, then filter out the low expressing genes. This usually gets you to a tractable number of genes that you can then blast against the nt database (I have a locally installed blast - if you are doing this sort of thing, I would recommend the same for you). I just use a representative transcript from each gene for the blast step.
After you blast, you will almost surely have more duplicated genes, and you can reformulate your tx2gene matrix to accommodate the updated information, and re-collapse. You can also remove any genes that don't align to anything in nt, if you like, as those are things that are by definition unknown, and possibly FAKE NEWS. You end up with GI numbers, which are semi-useful, as you can use the reutils package from CRAN to do queries to the nuccore database to get the gene name and symbol.
Thank you Aaron and James for your explanations/help. Much Appreciated!