Question

negative values for trended dispersion with edgeR

0

Entering edit mode

Emmanouela Repapi ▴ 20

@emmanouela-repapi-6515

Last seen 3.3 years ago

United Kingdom

Hello,

I have a dataset with very high coverage on each sample for which I want to run edgeR on to get the differentially expressed transcripts between 2 types of cells with 3 replicates each.
The analysis runs smoothly and I have many DE transcripts, but when I plot the BCV I get a plot (which is slightly weird with the trended dispersion line being bumpy - see attached) and the warning:

Warning message: In sqrt(y$trended.dispersion) : NaNs produced

When I look into it, it turns out I have many negative values for the trended dispersion (for 160 transcripts) but the actual counts for these transcripts don't look unusual and I can't tell what is wrong.

My code looks like this:

counts_data <- DGEList(counts=round(all_samples[,2:7]),group=rep(c("PC","GS"), each=3), genes=my_genes) 
counts_cpm <- cpm(counts_data)
filter_cpm <- rowSums(counts_cpm> 1)>= min(table(group))
counts_data <- counts_data[filter_cpm,]
counts_data$samples$lib.size <- colSums(counts_data$counts)
counts_cpm <- cpm(counts_data) 
counts_data_norm <- calcNormFactors(counts_data)
set_d <- estimateCommonDisp(counts_data_norm, verbose=T)
set_d <- estimateTrendedDisp(set_d)
set_d <- estimateTagwiseDisp(set_d)

Some of the transcripts with the negative trended dispersion look like this:

set_d[set_d$trended.dispersion<0]
An object of class "DGEList"
$counts
PC1_EstimatedNumReads PC2_EstimatedNumReads PC3_EstimatedNumReads GS1_EstimatedNumReads GS2_EstimatedNumReads GS3_EstimatedNumReads
ENST00000196551 178336 170990 162887 100756 77829 81842
ENST00000202773 164480 154960 155601 102093 84593 89253
ENST00000216181 247587 287651 278037 285772 258325 262454
ENST00000216281 829577 717654 834776 504985 388358 412351
ENST00000216962 66190 73460 60030 165816 152397 175372
155 more rows ...

$samples
group lib.size norm.factors
PC1_EstimatedNumReads PC 2.17e+08 0.982
PC2_EstimatedNumReads PC 2.21e+08 0.996
PC3_EstimatedNumReads PC 2.18e+08 0.986
GS1_EstimatedNumReads GS 2.18e+08 1.031
GS2_EstimatedNumReads GS 1.89e+08 1.013
GS3_EstimatedNumReads GS 1.97e+08 0.993

$genes
Transcript Length Geneid Transcript_name
359 ENST00000196551 740 RPS5 RPS5-201
401 ENST00000202773 947 RPL6 RPL6-001
585 ENST00000216181 7501 MYH9 MYH9-001
607 ENST00000216281 3379 HSP90AA1 HSP90AA1-002
679 ENST00000216962 4134 PYGB PYGB-001
155 more rows ...

$common.dispersion
[1] 0.0511

$pseudo.counts
PC1_EstimatedNumReads PC2_EstimatedNumReads PC3_EstimatedNumReads GS1_EstimatedNumReads GS2_EstimatedNumReads GS3_EstimatedNumReads
ENST00000196551 175182 162868 159302 94074 85077 87631
ENST00000202773 161571 147600 152177 95323 92471 95566
ENST00000216181 243208 273988 271918 266821 282381 281018
ENST00000216281 814904 683566 816405 471496 424523 441517
ENST00000216962 65019 69971 58709 154820 166589 187776
155 more rows ...

$pseudo.lib.size
[1] 2.1e+08

$AveLogCPM
[1] 9.25 9.21 10.33 11.50 9.12
155 more elements ...

$trended.dispersion
[1] -0.00180 -0.00159 -0.00782 -0.01435 -0.00112
155 more elements ...

$prior.n
[1] 2.5

$tagwise.dispersion
[1] 0.00601 0.00559 0.00604 0.00720 0.00785
155 more elements ...

How could it be that I am getting negative values for the trended dispersion (especially when the respective tagwise dispersions are small but positive). Is this behaviour alarming or is it OK to be using the results from the analysis? Its probably worth mentioning that using a cpm>5 cutoff for the filtering reduces the problem ( I only get 5 negative trended dispersion values) but does not get rid of it. Increasing it more also doesn't help.

Any help would be very much appreciated!

Best wishes,

Emmanouela

Session Info:

R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit)

locale: [1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1 [5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1 LC_PAPER=en_GB.ISO-8859-1 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C

attached base packages: [1] splines stats graphics grDevices utils datasets methods base

other attached packages: [1] edgeR_3.8.5 limma_3.22.1

loaded via a namespace (and not attached): [1] tools_3.1.1

rnaseq estimatedispersions edgeR • 1.8k views

ADD COMMENT • link updated 9.8 years ago by Yunshun Chen ▴ 900 • written 9.8 years ago by Emmanouela Repapi ▴ 20

0

Entering edit mode

You're a version out of date. See if the problem shows up in the latest version of edgeR (3.10.2) and Bioconductor (3.2). You'll also have to update R to version 3.2.0 or higher.

ADD REPLY • link 9.8 years ago Aaron Lun ★ 28k

score 2 · Accepted Answer · 2015-09-21

The way estimateTrendedDisp() works is as follows:

It first splits the data into a number of bins by their expression level. It then estimates the common dispersion for each bin and fits a spline curve through all those bin-wise common dispersions to get a dispersion trend.

The tagwise dispersion values are quite large for the lowly expressed genes in your data. One would suspect the unusual behaviour of the dispersion trend might be driven by the extremely high common dispersion estimate in one (or some) of the bins at the lower end. You also mentioned that increasing the filtering threshold reduces the problem.

However, the problem still remains after the threshold is increased. That indicates the lack of filtering might not be the reason why the trended dispersion goes below zero. I can't tell what the reason is without seeing the data. Besides, the library sizes are all around 200 million. Personally, I would decrease the filtering threshold instead.

We've noticed that estimating the trended dispersion by the spline method is not always stable. That's why we recommend the weighted likelihood approach in which the common, trended and tagwise dispersions can be estimated in one run. It is implemented in estimateDisp(). Try

set_d <- estimateDisp(counts_data_norm)

and see whether it solves the problem.