Question

DESeq2 input- normalized counts that were processed

0

Entering edit mode

karenchait841 • 0

@karenchait841-12675

Last seen 7.1 years ago

Hello,

I have data that was processed after normalization (with DESeq2). The data had contamination of melanoma cells so we subtracted the % of contamination of each sample from the counts of each gene.

What is the best way to continue the analysis with DESeq2 (DE analysis) using this data and not the raw data?

1- To round the values and use it as input to DESeq2

2- Reversing the values back to the raw data values (approximately) using the size factor.

3- Other options...

Thank you for your help,

Karen

Session info:

R version 3.3.2 (2016-10-31)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows >= 8 x64 (build 9200)

locale:

[1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255

[3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C

[5] LC_TIME=Hebrew_Israel.1255

attached base packages:

[1] parallel stats4 stats graphics grDevices utils datasets methods

[9] base

other attached packages:

[1] BiocInstaller_1.24.0 ggplot2_2.2.1 gplots_3.0.1

[4] RColorBrewer_1.1-2 DESeq2_1.14.1 SummarizedExperiment_1.4.0

[7] Biobase_2.34.0 GenomicRanges_1.26.4 GenomeInfoDb_1.10.3

[10] IRanges_2.8.2 S4Vectors_0.12.2 BiocGenerics_0.20.0

loaded via a namespace (and not attached):

[1] genefilter_1.56.0 gtools_3.5.0 locfit_1.5-9.1

[4] splines_3.3.2 lattice_0.20-34 colorspace_1.3-2

[7] htmltools_0.3.5 base64enc_0.1-3 survival_2.41-2

[10] XML_3.98-1.5 foreign_0.8-67 DBI_0.6

[13] BiocParallel_1.8.1 plyr_1.8.4 stringr_1.2.0

[16] zlibbioc_1.20.0 munsell_0.4.3 gtable_0.2.0

[19] caTools_1.17.1 htmlwidgets_0.8 memoise_1.0.0

[22] labeling_0.3 latticeExtra_0.6-28 knitr_1.15.1

[25] geneplotter_1.52.0 AnnotationDbi_1.36.2 htmlTable_1.9

[28] Rcpp_0.12.9 KernSmooth_2.23-15 acepack_1.4.1

[31] xtable_1.8-2 scales_0.4.1 backports_1.0.5

[34] checkmate_1.8.2 gdata_2.17.0 Hmisc_4.0-2

[37] annotate_1.52.1 XVector_0.14.1 gridExtra_2.2.1

[40] digest_0.6.12 stringi_1.1.2 grid_3.3.2

[43] tools_3.3.2 bitops_1.0-6 magrittr_1.5

[46] lazyeval_0.2.0 RCurl_1.95-4.8 tibble_1.2

[49] RSQLite_1.1-2 Formula_1.2-1 cluster_2.0.5

[52] Matrix_1.2-7.1 data.table_1.10.4 assertthat_0.1

[55] rpart_4.1-10 nnet_7.3-12

deseq2 • 854 views

ADD COMMENT • link updated 7.1 years ago by Ryan C. Thompson ★ 7.9k • written 7.1 years ago by karenchait841 • 0

score 1 · Answer 1 · 2017-03-23

I'm not sure what the point of scaling all the counts in a sample is. If you're subtracting the same percent from every gene's count for a given sample, it will have no net effect on the fold change calculations, since the size factors will undo this scaling. The only effect will be to make the dispersion estimation less accurate, reducing your statistical power. You should definitely analyze the original raw counts, not any transformation. If you want to control for the effect of contamination, you should include it as a covariate in your model. I'm not sure exactly of the right way to do this, since the percent contamination should be linearly related to gene expression while the negative binomial GLM coefficients are fit on a log scale. You could use the ns function from the splines package to fit a non-linear function of contamination percent, or you could use the sva package to estimate the confounding effect on the proper log scale from the data itself. Perhaps others will weigh in on the best way to incorporate the contamination effect into your model, or perhaps you have a statistician in your lab who can advise you.