Question

DESeq2: Extremely large log2 expression values with little or no read evidence

0

Entering edit mode

wlorenz • 0

@wlorenz-14522

Last seen 3.0 years ago

United States

I've tried to find if anyone else has seen this, but none describe what I'm seeing. I have a pretty simple/small data set: 6 groups, 3 reps/group (2 ctrl and 4 exptl). Read data were mapped to a Trinity assembly w/bt2 and the expected counts generated with RSEM (gene_results) were used to generate the matrix.

I like to bring over the normalized counts and group average fpkms (calc from RSEM fpkms) to use for downstream filtering/decisions as to which genes to pursue. In this dataset there are usually one or two with high log2 and low FDR, but with no or little read evidence. I'd like to know why this is happening and how to prevent it. Pehaps there is something to add to my code that will prevent it? I have also run the analysis with and without pre-filtering the dds and i see the same thing. I have also checked (several times) import of the norm'd read and avg fpkm values, so no mistake there.

Finally, exploratory analysis of this dataset shows that there is a HUGE amount of variance and PCA looks terrible... so there's that.

If I'm missing something or if anyone has ideas or suggestions as to how to mitigate this, I'd love to know.

My core DESeq2 code is shown below the these outputs.

gene_id	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	CTL_FPKM_AVG	PA_FPKM_AVG	CTL1	CTL2	CTL3	PA1	PA2	PA3
TRINITY_DN79696_c1_g4	43.9	28.7	5.3	5.4	5.31E-008	2.08E-004	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
TRINITY_DN78128_c3_g1	260.8	25.8	4.1	6.3	3.47E-010	1.92E-006	30.7	0.0	673.2	643.0	886.2	0.0	0.0	0.0
gene_id	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj	CTL_FPKM_AVG	KS_FPKM_AVG	CTL1	CTL2	CTL3	KS1	KS2	KS3
TRINITY_DN71439_c1_g5	32.6	29.0	5.3	5.5	3.86E-008	1.10E-004	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
TRINITY_DN68768_c2_g1	19.7	26.8	5.3	5.1	3.69E-007	7.17E-004	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
TRINITY_DN85457_c12_g6	22.3	25.0	5.3	4.8	2.02E-006	2.62E-003	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
TRINITY_DN69591_c4_g1	137.2	24.5	5.1	4.8	1.75E-006	2.41E-003	30.5	0.0	468.2	0.0	578.2	0.0	0.0	0.0

DESeq2 code is as follows:

COUNTS <- as.matrix(read.table(file="count_matrix.txt", sep="\t", header=TRUE, row.names=1))
COUNTS_rd<-round(COUNTS)
mode(COUNTS_rd) <- "integer"
META <- read.table(file ="metadata.txt", sep="\t", header=TRUE, row.names=1)

META

sample_id	Comparison
CTL1	CTL
CTL2	CTL
CTL3	CTL
CTLAG1	CTLAG
CTLAG2	CTLAG
CTLAG3	CTLAG
PA1	PA
PA2	PA
PA3	PA
PAAG1	PAAG
PAAG2	PAAG
PAAG3	PAAG
KS1	KS
KS2	KS
KS3	KS
KSAG1	KSAG
KSAG2	KSAG
KSAG3	KSAG

dds <- DESeqDataSetFromMatrix(countData=COUNTS_rd, colData=META, design = ~Comparison)

nrow(dds) #rows = 136177

idx <- rowSums( counts(dds, normalized=TRUE) >= 5 ) >= 3 # prefiltering

dds <- dds[idx,]

nrow(dds) ##rows = 74283

dds <- DESeq(dds)
resultsNames(dds)

counts_norm <- counts(dds,normalized=TRUE)

write.table(counts_norm,file="normed_counts.txt",row.names=TRUE,col.names=TRUE,quote=FALSE,sep="\t")

res1 <- results(dds, contrast=c("Comparison","CTL", "PA")) # first one of four comparisons.

Not sure if it's needed, but here's my session info:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] vsn_3.44.0 ggplot2_2.2.1
[3] dplyr_0.7.2 amap_0.8-14
[5] ape_4.1 genefilter_1.58.1
[7] PoiClaClu_1.0.2 RColorBrewer_1.1-2
[9] pheatmap_1.0.8 DESeq2_1.16.1
[11] SummarizedExperiment_1.6.3 DelayedArray_0.2.7
[13] matrixStats_0.52.2 Biobase_2.36.2
[15] GenomicRanges_1.28.4 GenomeInfoDb_1.12.2
[17] IRanges_2.10.2 S4Vectors_0.14.3
[19] BiocGenerics_0.22.0

loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.4.2 Formula_1.2-2
[4] assertthat_0.2.0 affy_1.54.0 latticeExtra_0.6-28
[7] blob_1.1.0 GenomeInfoDbData_0.99.0 RSQLite_2.0
[10] backports_1.1.0 lattice_0.20-35 limma_3.32.3
[13] glue_1.1.1 digest_0.6.12 XVector_0.16.0
[16] checkmate_1.8.3 colorspace_1.3-2 preprocessCore_1.38.1
[19] htmltools_0.3.6 Matrix_1.2-10 plyr_1.8.4
[22] pkgconfig_2.0.1 XML_3.98-1.9 zlibbioc_1.22.0
[25] xtable_1.8-2 scales_0.4.1 affyio_1.46.0
[28] BiocParallel_1.10.1 htmlTable_1.9 tibble_1.3.3
[31] annotate_1.54.0 nnet_7.3-12 lazyeval_0.2.0
[34] survival_2.41-3 magrittr_1.5 memoise_1.1.0
[37] nlme_3.1-131 foreign_0.8-69 BiocInstaller_1.26.1
[40] tools_3.4.2 data.table_1.10.4 stringr_1.2.0
[43] munsell_0.4.3 locfit_1.5-9.1 cluster_2.0.6
[46] AnnotationDbi_1.38.1 bindrcpp_0.2 compiler_3.4.2
[49] rlang_0.1.1 grid_3.4.2 RCurl_1.95-4.8
[52] htmlwidgets_0.9 labeling_0.3 bitops_1.0-6
[55] base64enc_0.1-3 gtable_0.2.0 DBI_0.7
[58] R6_2.2.2 gridExtra_2.2.1 knitr_1.16
[61] bit_1.1-12 bindr_0.1 Hmisc_4.0-3
[64] stringi_1.1.5 Rcpp_0.12.12 geneplotter_1.54.0
[67] rpart_4.1-11 acepack_1.4.1

deseq2 • 845 views

ADD COMMENT • link updated 6.4 years ago by Michael Love 41k • written 6.4 years ago by wlorenz • 0

score 0 · Answer 1 · 2017-12-01

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 14 hours ago

United States

Can you send me the 'dds' to maintainer("DESeq2") and I'll take a look at what results table I get? I'd be interested to see if I can reproduce on my end.

ADD COMMENT • link 6.4 years ago Michael Love 41k

0

Entering edit mode

Thanks, Michael. Just sent you an e-mail with DropBox link to data.

ADD REPLY • link 6.4 years ago wlorenz • 0

0

Entering edit mode

I took a look, and for these genes there are large counts combined with 0's, within a single group, which are enough of a problem for the negative binomial fit, and then there are not enough samples to allow for the outlier removal techniques. The bad fit is throwing off the Wald statistic.

The problem can be avoided if you use likelihood ratio tests instead here.

So you can run the following code one time, then re-use 'dds' for multiple comparisons:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds) # this can use a design=~Comparison

Then for example to contrast PA vs CTL, you would use the following code:

full <- model.matrix(~Comparison, colData(dds))
dds.pa.vs.ctl <- nbinomLRT(dds, full=full,
                           reduced=subset(full, select=-ComparisonPA))
res.pa.vs.ctl <- results(dds.pa.vs.ctl, name="ComparisonPA")

The LFC will still be large, but the p-values will be appropriate.

For comparing against CTLAG, you just re-level (you don't need to re-run dispersion estimation):

Comp2 <- dds$Comparison
Comp2 <- relevel(Comp2, "CTLAG")
full2 <- model.matrix(~Comp2)
subset(full2, select=-ComparisonKSAG) # now use this as reduced

ADD REPLY • link 6.4 years ago Michael Love 41k

0

Entering edit mode

Thank you for the code, I'll try it.

So, if I understand what you're saying (maybe not), the first gene in my example has other matrix groups (not being compared) with norm count values of 0 1 300, 0 57 0 and 0 133 1 and it is these groups and their lack of sufficient replication (is this impacting/preventing proper Cook's filtering?) that force a bad fit for this particular gene in the comparison groups? And, when I see this occur, I should handle those affected group comparisons independently/differently as in above and relevel for all others?

ADD REPLY • link 6.4 years ago wlorenz • 0

0

Entering edit mode

The mix of zeros and large counts within groups, and the small replicate number means I would use the LRT for all comparisons in this dataset.

Also recommend LRT for similarly highly noisy datasets of similar small replicate number in the future.

ADD REPLY • link 6.4 years ago Michael Love 41k