Question

differential expression analysis of htseq data with edgeR/voom

2

Entering edit mode

wd ▴ 30

@wd-7410

Last seen 5.3 years ago

Germany

Dear all

I 'm trying to run a differential expression analysis with edgeR/voom (combining guidelines from https://www.bioconductor.org/help/workflows/RNAseq123/ and http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html) and using count tables from 10 RNAseq samples (5 "EXP" (treatment) samples and 5 "CON" (control) samples). These count tables were obtained by mapping RNA reads against a de novo transcriptome (116886 transcripts) and counting with htseqcount. Please find below the dashed line the workflow of my diff. expression analysis.

Several question came to my mind while running this analysis:

- overall: is this workflow sound, any other parameter I should take into account for the fact that reads were mapped against a de novo transcriptome with > 100000 transcripts?

- based on the PCAplot and heatmap BEFORE filtering for transcripts with low number of reads, I would consider R4 as an outlier and exclude it from further analysis while based on the heatmap AFTER filtering for transcripts with low number of reads and taking into account that S1 has the lowest sample weight (0.17) I would consider both S1 and R4 as outliers and exclude them from the DE analysis. Is this correct?

- using voomwithQualityweights (see outputgraph here) and limma eBayes I obtained 559 DE transcripts (fc>2, adj. p value < 0.05), however, using voom (not included in the workflow above) I only have 3 DE transcripts...making me wonder whether I can trust the DE analysis (but I guess only qPCR validation could provide an answer for this question)

-last, in the RNAseq123 guidelines it is mentioned to use the treat method to calculate p-values from empirical Bayes moderated t-statistics with a minimum log-FC requirement. However, when I use the treat method (tfit <- treat(vfit, lfc=1), not included in the workflow above) I only obtained 3 DE transcripts compared to 559 DE transcripts using efit as explained here and here.

Any suggestion/answer would be much appreciated!

Regards

Wannes

-----------------------------------------------------------------

> directory="/Users/WD/"
> ##import htseq count files
> files <- grep("counts.txt", list.files(directory), value = TRUE)
> x <- readDGE(files, columns=c(1,2),header=FALSE)
Meta tags detected: __no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique
> #remove HTSeq Metatags:__no_feature, __ambiguous, __too_low_aQual, __not_aligned, __alignment_not_unique
> MetaTags <- grep("^__", rownames(x))
> x <- x[-MetaTags, ]
> dim(x)
[1] 116886 10

> ##organisation of samples
> #abbreviate samplenames
> samplenames <- substring(colnames(x),1,2)
> samplenames
[1] "R1" "R2" "R3" "R4" "R5" "S1" "S2" "S3" "S4" "S5"
> #assign abbreviated samplenames to colnames of x
> colnames(x) <- samplenames
> #make groups
> group <- as.factor(c("EXP", "EXP", "EXP", "EXP", "EXP", "CON","CON", "CON", "CON", "CON"))
> x$samples$group <- group
> x$samples

> ##heatplot 500 most variable genes before filtering
> lcpm<-cpm(x,log=TRUE)
> #select most variable genes
> var_genes <- apply(lcpm, 1, var)
> select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
> #subsetting
> highly_variable_lcpm <- lcpm[select_var,]
> dim(highly_variable_lcpm)
[1] 500 10
> #setup colors
> mypalette <- brewer.pal(11,"RdYlBu")
> morecols <- colorRampPalette(mypalette)
> library(gplots)                                                                                                                                                                            > heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",scale="row")

> ##MDS_plot_before_filtering
> col.group <- group
> levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")
> col.group <- as.character(col.group)
> pdf("MDS_plot_before_filtering.pdf")
> plotMDS(lcpm, labels=samplenames, col=col.group)
> title(main="A. Sample groups")

> ##remove transcripts with low reads (cpm above 1 in at least 5 out of 10 samples)
> cpm <- cpm(x)
> keep.exprs <- rowSums(cpm>1)>=5
> x <- x[keep.exprs, keep.lib.sizes=FALSE]
> cpm <- cpm(x)
> dim(x)
[1] 15926    10

> ##heatplot 500 most variable genes after filtering
> lcpm<-cpm(x,log=TRUE)
> #select most variable genes
> var_genes <- apply(lcpm, 1, var)
> select_var <- names(sort(var_genes, decreasing=TRUE))[1:500]
> #subsetting
> highly_variable_lcpm <- lcpm[select_var,]
> dim(highly_variable_lcpm)
[1] 500 10
> #setup colors
> mypalette <- brewer.pal(11,"RdYlBu")
> morecols <- colorRampPalette(mypalette)
> library(gplots)
> heatmap.2(highly_variable_lcpm,col=rev(morecols(50)),trace="none", main="Top 500 most variable genes across samples",scale="row")

> ##calculate normalisation factors
> x <- calcNormFactors(x, method = "TMM")

> ##setup design
> design <- model.matrix(~0+group)
> #remove group in design colnames
> colnames(design) <- gsub("group", "",colnames(design))
> design
   CON EXP
1    0   1
2    0   1
3    0   1
4    0   1
5    0   1
6    1   0
7    1   0
8    1   0
9    1   0
10   1   0
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

> ##setup contrasts
> contr.matrix <- makeContrasts(EXPvsCON = EXP-CON,levels = colnames(design))
> contr.matrix
      Contrasts
Levels EXPvsCON
   CON       -1
   EXP        1

> ##voom WithQualityWeights
> v <- voomWithQualityWeights(counts = x, design = design, normalize.method = "none", plot = TRUE)
> v
An object of class "EList"
$targets
                           files group lib.size norm.factors
R1 R1_Mixed_no_strand_counts.txt   EXP 17826523    0.7893778
R2 R2_Mixed_no_strand_counts.txt   EXP 18369554    0.8584186
R3 R3_Mixed_no_strand_counts.txt   EXP 19888650    1.0453065
R4 R4_Mixed_no_strand_counts.txt   EXP 31397203    1.7394248
R5 R5_Mixed_no_strand_counts.txt   EXP 9456578    0.5311490
S1 S1_Mixed_no_strand_counts.txt   CON 17908499    0.8544335
S2 S2_Mixed_no_strand_counts.txt   CON 19757373    1.0549119
S3 S3_Mixed_no_strand_counts.txt   CON 36983226    1.8358870
S4 S4_Mixed_no_strand_counts.txt   CON 13900378    0.7013372
S5 S5_Mixed_no_strand_counts.txt   CON 26794838    1.3166885

$E
Samples
Tags R1 R2 R3 R4 R5 S1 S2 S3 S4 S5
locus_000001.1 4.317752 4.619337 4.455964 4.693660 4.619768 4.960255 4.645507 4.746124 4.216968 4.790420
locus_000002.1 13.983810 13.686776 14.098252 13.247817 15.196448 13.060765 12.863594 12.685275 14.536459 13.116237
locus_000003.1 2.695796 2.817563 2.820553 2.601083 2.918553 3.234032 2.890438 3.381788 2.073312 3.303241
locus_000005.1 4.856671 4.827279 4.772263 4.816154 4.601032 2.951170 4.539602 4.478576 4.549461 4.407133
locus_000006.1 2.311652 2.167077 2.434319 2.326644 2.629046 2.414856 1.990301 1.785554 2.620800 2.250470
15921 more rows ...

$weights
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 17.718627 39.148292 41.651738 34.292450 21.710199 3.6379280 8.329925 57.114196 5.771753 64.777525
[2,] 1.942678 4.192361 4.204877 2.672345 4.173634 0.5159848 1.100322 5.185488 1.001878 7.011152
[3,] 9.552051 21.208064 22.935908 21.786614 11.197957 2.2399819 5.267825 44.535453 3.388637 45.161314
[4,] 18.620894 41.036334 43.500842 35.012678 23.480366 3.3555401 7.749195 55.712939 5.218605 61.707257
[5,] 8.225308 18.257714 19.727219 18.585605 9.399346 1.4550842 3.410234 29.234620 2.192492 29.042470
15921 more rows ...

$design
   CON EXP
1    0   1
2    0   1
3    0   1
4    0   1
5    0   1
6    1   0
7    1   0
8    1   0
9    1   0
10   1   0
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

$sample.weights
1 2 3 4 5 6 7 8 9 10
0.8846822 1.9333165 2.0043199 1.4800311 1.4496108 0.1738191 0.3869143 2.3825648 0.3018141 2.8114087

> vfit <- lmFit(v, design)
> vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
> efit <- eBayes(vfit)
> plotSA(efit, main="Final model Mean variance trend")

> #differential expression analysis with efit (see e.g. https://www.biostars.org/p/175792/)
> summary(decideTests(efit))
   EXPvsCON
-1     1314
0     13041
1      1571
> top <- topTable(efit, coef = "EXPvsCON", adjust = "fdr", sort.by = "P", lfc = 1, p.value = 0.05, number = Inf)
-----------------------------------------------------------------

voom htseq edgeR limma RNASeq123 • 5.1k views

ADD COMMENT • link updated 6.5 years ago by Gordon Smyth 52k • written 8.1 years ago by wd ▴ 30

score 5 · Answer 1 · 2017-06-08

The specifics of the read alignment shouldn't have a major effect on the performance of the DE analysis, provided that the alignment (and the prior de novo assembly) was done "blind", i.e., without using any information about the experimental design. Otherwise, you could get one assembly for one group and another assembly for another group, in which case you would obviously get (meaningless) differential expression.

Don't do more outlier removal than you have to. That's a deep rabbit hole to go down; every time you remove outliers, the remaining samples look more like outliers (by virtue of having nothing else to compare to). And once you start removing outliers iteratively, when do you stop? All replicate samples will have some measure of variability, so obviously some samples will be further away from the group mean than others - yet these samples would not be considered to be outliers. Unless you have a good technical reason for removing them (e.g., not sequenced enough), I would leave them in. Rely on arrayWeights to downweight them in the linear model fit.

voomWithQualityWeights, as the name suggests, will run voom with quality weights (computed by arrayWeights) where a sample is downweighted if its expression values consistently deviate more from the group mean more than that of its replicates. Presumably the S1 sample is quite low quality and gets downweighted by arrayWeights to reduce its impact on the linear model fit. In contrast, voom by itself will not use quality weights, so the variability of S1 will inflate the variance estimate and reduce detection power. If your samples were all high quality, these two functions should give similar results; clearly this is not the case.

Finally, treat will give you fewer DE genes than applying an ad hoc log-fold change threshold. This is obvious because treat tests whether the absolute log-fold change is significantly greater than the specified threshold. Applying a threshold just asks whether or not the estimated absolute log-fold change is greater than the threshold. For example, if I got a gene with a log fold change of 1.00001, it would pass my ad hoc threshold of 1 but not get detected by treat, because 1.00001 would not be significantly greater than 1 in typical experiments. I usually set lfc to log2(1.5), with the intent of avoiding detection of DE genes with low log-fold changes.

score 3 · Answer 2 · 2019-01-09

It isn't correct to use htseq-count with transcript-level annotation. This is because reads cannot be reliably assigned to a transcript when transcripts overlap. You have three choices:

If you know which gene each transcript is associated with, then you could run htseq-count or (better) featureCounts to get gene-level counts rather than transcript-level counts.
You could run Salmon on the transcript annotation and then us tximport to assemble into gene-level counts. Again, you need to know which gene each transript belongs to.
You could run Salmon with bootstrapping, then use edgeR::catchSalmon to read the results, then use edgeR or voom to do a transcript-level DE analysis.