Question

Validity of using DESeq2 with directly normalized EDASeq counts rather than offset

0

Entering edit mode

vizio12341337 • 0

@vizio12341337-22616

Last seen 4.2 years ago

Hi,

I realize that when correcting for GC bias via EDAseq and doing DE in DESeq2, the recommended method is to pass EDAseq's offsets to your DESeq object as NormalizationFactors as per the DESeq2 manual. However, we have a previous analysis that was done a few years ago, where the EDAseq normalized counts were passed directly into tximports' count slot prior to creation of the DESeq object. While I know this is not the current recommended pipeline, is this completely wrong or just less ideal (I know this was recommended in original DESeq prior to NormalizationFactors)? I only ask because we have since validated a number of these targets that are significant in the old analysis but are not when doing it the recommended way.

Thanks, I appreciate your time!

deseq2 edaseq • 1.1k views

ADD COMMENT • link updated 4.3 years ago by Michael Love 41k • written 4.3 years ago by vizio12341337 • 0

score 1 · Answer 1 · 2019-12-28

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 5 hours ago

United States

Can you give more details? There’s not enough here for me to say if it’s valid. Why is tximport involved? What was passed to EDASeq? Please post the code so I can look it over.

ADD COMMENT • link 4.3 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

Thanks for the quick reply. I realize how the above could be a bit confusing without some code:

Tximport was used to import RSEM genes.results to txi.rsem and uCovar contains the gc percents.

> head(uCovar)
                gccontent
ENSG00000000003     40.40
ENSG00000000419     39.85
ENSG00000000457     40.14
ENSG00000000460     39.22
ENSG00000000938     52.97
ENSG00000000971     35.20

> roundedCounts <- as.matrix(round(txi.rsem$counts,0))
> eda <- newSeqExpressionSet(roundedCounts,featureData=uCovar, phenoData=data.frame(group = sampleTableFull$group, row.names = row.names(sampleTableFull)))
> dataWithin <- withinLaneNormalization(eda,"gccontent", which="full")
> #Replace RSEM object's counts with EDASeq normalized counts
> txi.rsem$counts <- dataWithin@assayData$normalizedCounts
> #Run initial DESeq2 analysis
> dds36 <- DESeqDataSetFromTximport(txi.rsem, sampleTableFull, ~group)
using counts and average transcript lengths from tximport
> dds36 <- DESeq(dds36)
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

The total library size changed only very slightly with what was done:

colSums(originalCounts)/colSums(NormalizedCounts)
  CTR-A   CTR-B   CTR-C   CTR-D   CTR-E   CTR-F   MUT-A   MUT-B   MUT-C   MUT-D   MUT-E   MUT-F
0.9997960 0.9998470 0.9995003 0.9994305 0.9989726 0.9989360 0.9997949 0.9999131 0.9995712 0.9997287 0.9991146 0.9992895

These data just seem to be finicky, but this analysis being run this way lines up much better with our more recent scRNA-Seq datasets and nanostring validation. Thanks again!

ADD REPLY • link 4.3 years ago vizio12341337 • 0

0

Entering edit mode

And how did you run the quantification tool and tximport — this is also relevant.

ADD REPLY • link 4.3 years ago Michael Love 41k

0

Entering edit mode

For running RSEM, STAR transcriptome aligned BAMs were quantified with:

rsem-calculate-expression --bam --no-bam-output -p 8 --paired-end \
        --forward-prob 0.5  ${bamList[$counter]} /annotation/human-genome-rsem91/human \
        RSEM-out/$currentFile

The genes.results files were imported with tximport via:

> #Read in RSEM gene output
> temp = list.files(pattern="genes.results")
> txi.rsem <- tximport(temp, type = "rsem")
It looks like you are importing RSEM genes.results files, setting txIn=FALSE
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 
> txi.rsem$countsFromAbundance
[1] "no"

Thanks again and sorry for lacking important details. Let me know if anything else would be helpful.

ADD REPLY • link 4.3 years ago vizio12341337 • 0

0

Entering edit mode

Ok this looks fine. Tximport here is just calculating a transcript-usage based offset, and EDASeq is fixing GC in place.

So i dont have any issue with your pipeline that doesn’t use EDASeq offsets.

For other users with other cases, its a bit more complicated, e.g. depending on whether counts are generated from abundance, whether GC bias correction is applied by Salmon, what input goes into the R based bias modeler, etc.