Question: differential expression analysis of htseq data with edgeR/voom
2
2.4 years ago by
wd30
Germany
wd30 wrote:

Dear all

I 'm trying to run a differential expression analysis with edgeR/voom (combining guidelines from https://www.bioconductor.org/help/workflows/RNAseq123/  and http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html) and using count tables from 10 RNAseq samples (5 "EXP" (treatment) samples and 5 "CON" (control) samples). These count tables were obtained by mapping RNA reads against a de novo transcriptome (116886 transcripts) and counting with htseqcount. Please find below the dashed line the workflow of my diff. expression analysis.

Several question came to my mind while running this analysis:

- overall: is this workflow sound, any other parameter I should take into account for the fact that reads were mapped against a de novo transcriptome with > 100000  transcripts?

- based on the PCAplot and heatmap BEFORE filtering for transcripts with low number of reads, I would consider R4 as an outlier and exclude it from further analysis while based on the heatmap AFTER filtering for transcripts with low number of reads and taking into account that S1 has the lowest sample weight (0.17) I would consider both S1 and R4 as outliers and exclude them from the DE analysis. Is this correct?

- using voomwithQualityweights (see outputgraph here) and limma eBayes I obtained 559 DE transcripts (fc>2, adj. p value < 0.05), however, using voom (not included in the workflow above) I only have 3 DE transcripts...making me wonder whether I can trust the DE analysis (but I guess only qPCR validation could provide an answer for this question)

-last, in the RNAseq123 guidelines it is mentioned to use the treat method to calculate p-values from empirical Bayes moderated t-statistics with a minimum log-FC requirement. However, when I use the treat method (tfit <- treat(vfit, lfc=1), not included in the workflow above) I only obtained 3 DE transcripts compared to 559 DE transcripts using efit as explained here and here.

Any suggestion/answer would be much appreciated!

Regards

Wannes

-----------------------------------------------------------------

> directory="/Users/WD/"
> ##import htseq count files
> files <- grep("counts.txt", list.files(directory), value = TRUE)
Meta tags detected: __no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique
> #remove HTSeq Metatags:__no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique
> MetaTags <- grep("^__", rownames(x))
> x <- x[-MetaTags, ]
> dim(x)
[1] 116886     10

> ##organisation of samples
> #abbreviate samplenames
> samplenames <- substring(colnames(x),1,2)
> samplenames
[1] "R1" "R2" "R3" "R4" "R5" "S1" "S2" "S3" "S4" "S5"
> #assign abbreviated samplenames to colnames of x
> colnames(x) <- samplenames
> #make groups
> group <- as.factor(c("EXP", "EXP", "EXP", "EXP", "EXP", "CON","CON", "CON", "CON", "CON"))
> x$samples$group <- group
> x$samples > ##heatplot 500 most variable genes before filtering > lcpm<-cpm(x,log=TRUE) > #select most variable genes > var_genes <- apply(lcpm, 1, var) > select_var <- names(sort(var_genes, decreasing=TRUE))[1:500] > #subsetting > highly_variable_lcpm <- lcpm[select_var,] > dim(highly_variable_lcpm) [1] 500 10 > #setup colors > mypalette <- brewer.pal(11,"RdYlBu") > morecols <- colorRampPalette(mypalette) > library(gplots) > heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",scale="row") > ##MDS_plot_before_filtering > col.group <- group > levels(col.group) <- brewer.pal(nlevels(col.group), "Set1") > col.group <- as.character(col.group) > pdf("MDS_plot_before_filtering.pdf") > plotMDS(lcpm, labels=samplenames, col=col.group) > title(main="A. Sample groups") > ##remove transcripts with low reads (cpm above 1 in at least 5 out of 10 samples) > cpm <- cpm(x) > keep.exprs <- rowSums(cpm>1)>=5 > x <- x[keep.exprs, keep.lib.sizes=FALSE] > cpm <- cpm(x) > dim(x) [1] 15926 10 > ##heatplot 500 most variable genes after filtering > lcpm<-cpm(x,log=TRUE) > #select most variable genes > var_genes <- apply(lcpm, 1, var) > select_var <- names(sort(var_genes, decreasing=TRUE))[1:500] > #subsetting > highly_variable_lcpm <- lcpm[select_var,] > dim(highly_variable_lcpm) [1] 500 10 > #setup colors > mypalette <- brewer.pal(11,"RdYlBu") > morecols <- colorRampPalette(mypalette) > library(gplots) > heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",scale="row") > ##calculate normalisation factors > x <- calcNormFactors(x, method = "TMM") > ##setup design > design <- model.matrix(~0+group) > #remove group in design colnames > colnames(design) <- gsub("group", "",colnames(design)) > design CON EXP 1 0 1 2 0 1 3 0 1 4 0 1 5 0 1 6 1 0 7 1 0 8 1 0 9 1 0 10 1 0 attr(,"assign") [1] 1 1 attr(,"contrasts") attr(,"contrasts")$group
[1] "contr.treatment"

> ##setup contrasts
> contr.matrix <- makeContrasts(EXPvsCON = EXP-CON,levels = colnames(design))
> contr.matrix
Contrasts
Levels EXPvsCON
CON       -1
EXP        1

> ##voom WithQualityWeights
> v <- voomWithQualityWeights(counts = x, design = design, normalize.method = "none", plot = TRUE)
> v
An object of class "EList"
$targets files group lib.size norm.factors R1 R1_Mixed_no_strand_counts.txt EXP 17826523 0.7893778 R2 R2_Mixed_no_strand_counts.txt EXP 18369554 0.8584186 R3 R3_Mixed_no_strand_counts.txt EXP 19888650 1.0453065 R4 R4_Mixed_no_strand_counts.txt EXP 31397203 1.7394248 R5 R5_Mixed_no_strand_counts.txt EXP 9456578 0.5311490 S1 S1_Mixed_no_strand_counts.txt CON 17908499 0.8544335 S2 S2_Mixed_no_strand_counts.txt CON 19757373 1.0549119 S3 S3_Mixed_no_strand_counts.txt CON 36983226 1.8358870 S4 S4_Mixed_no_strand_counts.txt CON 13900378 0.7013372 S5 S5_Mixed_no_strand_counts.txt CON 26794838 1.3166885$E
Samples
Tags                    R1        R2        R3        R4        R5        S1        S2        S3        S4        S5
locus_000001.1  4.317752  4.619337  4.455964  4.693660  4.619768  4.960255  4.645507  4.746124  4.216968  4.790420
locus_000002.1 13.983810 13.686776 14.098252 13.247817 15.196448 13.060765 12.863594 12.685275 14.536459 13.116237
locus_000003.1  2.695796  2.817563  2.820553  2.601083  2.918553  3.234032  2.890438  3.381788  2.073312  3.303241
locus_000005.1  4.856671  4.827279  4.772263  4.816154  4.601032  2.951170  4.539602  4.478576  4.549461  4.407133
locus_000006.1  2.311652  2.167077  2.434319  2.326644  2.629046  2.414856  1.990301  1.785554  2.620800  2.250470
15921 more rows ...

$weights [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 17.718627 39.148292 41.651738 34.292450 21.710199 3.6379280 8.329925 57.114196 5.771753 64.777525 [2,] 1.942678 4.192361 4.204877 2.672345 4.173634 0.5159848 1.100322 5.185488 1.001878 7.011152 [3,] 9.552051 21.208064 22.935908 21.786614 11.197957 2.2399819 5.267825 44.535453 3.388637 45.161314 [4,] 18.620894 41.036334 43.500842 35.012678 23.480366 3.3555401 7.749195 55.712939 5.218605 61.707257 [5,] 8.225308 18.257714 19.727219 18.585605 9.399346 1.4550842 3.410234 29.234620 2.192492 29.042470 15921 more rows ...$design
CON EXP
1    0   1
2    0   1
3    0   1
4    0   1
5    0   1
6    1   0
7    1   0
8    1   0
9    1   0
10   1   0
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$group [1] "contr.treatment"$sample.weights
1         2         3         4         5         6         7         8         9        10
0.8846822 1.9333165 2.0043199 1.4800311 1.4496108 0.1738191 0.3869143 2.3825648 0.3018141 2.8114087

> vfit <- lmFit(v, design)
> vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
> efit <- eBayes(vfit)
> plotSA(efit, main="Final model Mean variance trend")

> #differential expression analysis with efit (see e.g. https://www.biostars.org/p/175792/)
> summary(decideTests(efit))
EXPvsCON
-1     1314
0     13041
1      1571
> top <- topTable(efit, coef = "EXPvsCON", adjust = "fdr", sort.by = "P", lfc = 1, p.value = 0.05, number = Inf)
-----------------------------------------------------------------

limma edger voom htseq rnaseq123 • 1.8k views
modified 9 months ago by Gordon Smyth38k • written 2.4 years ago by wd30
Answer: differential expression analysis of htseq data with edgeR/voom
5
2.4 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

The specifics of the read alignment shouldn't have a major effect on the performance of the DE analysis, provided that the alignment (and the prior de novo assembly) was done "blind", i.e., without using any information about the experimental design. Otherwise, you could get one assembly for one group and another assembly for another group, in which case you would obviously get (meaningless) differential expression.

Don't do more outlier removal than you have to. That's a deep rabbit hole to go down; every time you remove outliers, the remaining samples look more like outliers (by virtue of having nothing else to compare to). And once you start removing outliers iteratively, when do you stop? All replicate samples will have some measure of variability, so obviously some samples will be further away from the group mean than others - yet these samples would not be considered to be outliers. Unless you have a good technical reason for removing them (e.g., not sequenced enough), I would leave them in. Rely on arrayWeights to downweight them in the linear model fit.

voomWithQualityWeights, as the name suggests, will run voom with quality weights (computed by arrayWeights) where a sample is downweighted if its expression values consistently deviate more from the group mean more than that of its replicates. Presumably the S1 sample is quite low quality and gets downweighted by arrayWeights to reduce its impact on the linear model fit. In contrast, voom by itself will not use quality weights, so the variability of S1 will inflate the variance estimate and reduce detection power. If your samples were all high quality, these two functions should give similar results; clearly this is not the case.

Finally, treat will give you fewer DE genes than applying an ad hoc log-fold change threshold. This is obvious because treat tests whether the absolute log-fold change is significantly greater than the specified threshold. Applying a threshold just asks whether or not the estimated absolute log-fold change is greater than the threshold. For example, if I got a gene with a log fold change of 1.00001, it would pass my ad hoc threshold of 1 but not get detected by treat, because 1.00001 would not be significantly greater than 1 in typical experiments. I usually set lfc to log2(1.5), with the intent of avoiding detection of DE genes with low log-fold changes.

2

A couple of additional pointers. Unless you care about differential transcript expression (as compared to differential gene expression), you might want to collapse the transcripts to genes using the tximport package. If you are using Trinity to do the de novo assembly (and as Aaron suggests, you should pool all your samples for that part), then you can use the Trinity IDs to map transcripts to genes.

Unfortunately, a Trinity 'gene' isn't really what one would normally call a gene, as there can be multiple Trinity genes that map to the same gene if you use blast. My usual pipeline for de novo stuff is to use tximport to collapse to genes, then filter out the low expressing genes. This usually gets you to a tractable number of genes that you can then blast against the nt database (I have a locally installed blast - if you are doing this sort of thing, I would recommend the same for you). I just use a representative transcript from each gene for the blast step.

After you blast, you will almost surely have more duplicated genes, and you can reformulate your tx2gene matrix to accommodate the updated information, and re-collapse. You can also remove any genes that don't align to anything in nt, if you like, as those are things that are by definition unknown, and possibly FAKE NEWS. You end up with GI numbers, which are semi-useful, as you can use the reutils package from CRAN to do queries to the nuccore database to get the gene name and symbol.

Thank you Aaron and James for your explanations/help. Much Appreciated!

Answer: differential expression analysis of htseq data with edgeR/voom
3
9 months ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

It isn't correct to use htseq-count with transcript-level annotation. This is because reads cannot be reliably assigned to a transcript when transcripts overlap. You have three choices:

1. If you know which gene each transcript is associated with, then you could run htseq-count or (better) featureCounts to get gene-level counts rather than transcript-level counts.

2. You could run Salmon on the transcript annotation and then us tximport to assemble into gene-level counts. Again, you need to know which gene each transript belongs to.

3. You could run Salmon with bootstrapping, then use edgeR::catchSalmon to read the results, then use edgeR or voom to do a transcript-level DE analysis.