Question

DESeq analysis starting from DESeq normalized counts

0

Entering edit mode

smithnickh • 0

@smithnickh-12435

Last seen 7.2 years ago

I'm currently trying to work with data from GSE82227. Which as part of it's supplementary data provides the normalized DESeq counts. I'm trying to use these in order to do the final differential expression as well as other analysis. However, I can't seem to create a DESeq object because the normalized counts are non-integers. And since I don't have the raw counts or the size estimates I can't seem to work backwards either.

Any suggestions would be very helpful. Thanks in advance

deseq2 normalization • 1.8k views

ADD COMMENT • link updated 7.2 years ago by Ryan C. Thompson ★ 7.9k • written 7.2 years ago by smithnickh • 0

score 1 · Answer 1 · 2017-02-23

If you're absolutely sure that your table was generated by starting with a table of integer counts and scaling each column by a single normalization factor, you can sort the values in each column and use the intervals between consecutive unique values to infer the normalization factor that was used. Very importantly, this method assumes that each column has a sufficient density of counts that consecutive integers appear very often. This should generally be true of most RNA-seq count data. But as an example, if all the counts were somehow multiples of 10, this method would fail and give the wrong answers, since it would infer 10 counts as 1. If the original counts were non-integers, such as estimated gene counts generated from RSEM, Kallisto, or Salmon, then this method may also fail.

Luckily, your data set seems to satisfy all the above requirements, so here's the code to recover the original counts (probably):

library(assertthat)
library(magrittr)

infer.counts <- function(x, digits=3) {
    assert_that(all(x >= 0))
    assert_that(digits >= 2)
    ## Get all diffs between successive unique values
    diffs <- x %>% sort %>% unique %>% diff
    ## Round to a few digits to work around inexact representation
    approxdiffs <- signif(diffs, digits)
    ## Find the rounded interval that occurs most often
    approxguess <- approxdiffs %>% table %>% .[which.max(.)] %>% names %>% as.numeric
    ## Find all the intervals that were rounded to the selected one, and take their mean
    unit.guess <- diffs[approxdiffs == approxguess] %>% mean
    message("Guessing 1 count = ", unit.guess)
    ## Divide the original vector by the unit guess, and round to a
    ## few significant digits, which should ideally round everything
    ## to integers.
    round(x / unit.guess, digits)
}

file.url <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE82227&format=file&file=GSE82227%5Fcounts%5FDESeq%2Enormalized%2Ecsv%2Egz"
normcounts <- read.csv(textConnection(readLines(gzcon(url(file.url)))), row.names="id")
counts <- apply(normcounts, 2, infer.counts)

And here's the first few rows & columns of the data before and after:

> normcounts[1:5,1:5]
data.frame with 5 rows and 5 columns
                   SR094_01    SR094_02     SR094_03     SR094_05     SR094_06
                  <numeric>   <numeric>    <numeric>    <numeric>    <numeric>
ENSG00000000003     2.80095     1.78999     2.171321     4.691401     2.887229
ENSG00000000419  1125.98170  1024.76918   976.008977   933.588869  1022.078941
ENSG00000000457    81.22754    63.54464   260.558570   302.595387   341.655390
ENSG00000000460   171.79157   191.52891   169.363071   145.433442   110.677098
ENSG00000000938 15206.35491 14216.99427 12689.202370 18185.044490 20616.737360
> counts[1:5,1:5]
                SR094_01 SR094_02 SR094_03 SR094_05 SR094_06
ENSG00000000003        3        2        2        4        3
ENSG00000000419     1206     1145      899      796     1062
ENSG00000000457       87       71      240      258      355
ENSG00000000460      184      214      156      124      115
ENSG00000000938    16287    15885    11688    15505    21422

I'll stress again that it doesn't take much to fool this simplistic method. A few non-integer or even just badly-rounded values could cause it to give the wrong answer. So any time you use this, give the inferred counts some thorough scrutiny to make sure they look like real counts.

(I'm sure there's a more robust method that would work on data that was originally only mostly and/or approximately integer counts, such as RSEM output, but the above is quick and dirty and seems to work for this particular data set.)