Question: abnormal deseq2 normalization
0
9 weeks ago by
Charls0
Charls0 wrote:

Hi everyone,

I've been trying to use DESeq2 to look at expression differences across different gene, with 17 samples across 2 groups.

My count data looks as follows:

> head(phylum_top10)
NGm1 NGm2  NGm3  NGm4  NGm6 NGm7  NGm8 NGm9 NGs1  NGs2  NGs3  NGs4  NGs5  NGs6  NGs7  NGs8 NGs9
gen1 18185 9787 15737 21988 15669 8490 13135 6740 9324 23279 14517 17634 21207 16377 10719 10721 6750
gen2     0    0     0     0     0    0     0    0    0     0     0     0     0     0     0     0    0
gen3     8    0   332   678     0    0     0    0    0     0     0     0     0     0     0    79    0
gen4     0    0   175   203   123    0     0    0    0     0     0    64     0     0     0     0    0
gen5     0    0     0     0     0    0     0    0    0     0     0     0     0     0     0     0    0
gen6     0    0   162    22   616    0     0    0    0     0     0     0     0     0     0     6    0

> colData
Site Species
NGm1 Site1      Gm
NGm2 Site1      Gm
NGm3 Site1      Gm
NGm4 Site2      Gm
NGm6 Site2      Gm
NGm7 Site3      Gm
NGm8 Site3      Gm
NGm9 Site3      Gm
NGs1 Site1      Gs
NGs2 Site1      Gs
NGs3 Site1      Gs
NGs4 Site2      Gs
NGs5 Site2      Gs
NGs6 Site2      Gs
NGs7 Site3      Gs
NGs8 Site3      Gs
NGs9 Site3      Gs


I tried:

dds <- DESeqDataSetFromMatrix(phylum_top10, colData, design = ~Species)
htseq_rawData <- counts(dds, F)
dds <- dds[rowSums(counts(dds)) > 0]
NGm1 NGm2  NGm3  NGm4  NGm6 NGm7  NGm8 NGm9 NGs1  NGs2  NGs3  NGs4  NGs5  NGs6  NGs7  NGs8 NGs9
gen1  18185 9787 15737 21988 15669 8490 13135 6740 9324 23279 14517 17634 21207 16377 10719 10721 6750
gen3      8    0   332   678     0    0     0    0    0     0     0     0     0     0     0    79    0
gen4      0    0   175   203   123    0     0    0    0     0     0    64     0     0     0     0    0
gen6      0    0   162    22   616    0     0    0    0     0     0     0     0     0     0     6    0
gen10     0    0     0    19     0    0     0    0    0     0     0     0     0     0     0     0    0


the above is normal, but the following normalization is not.

> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 3 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
Warning message:
In lfproc(x, y, weights = weights, cens = cens, base = base, geth = geth,  :
Estimated rdf < 1.0; not estimating variance
> deseq_rawData <- counts(dds, T)
> deseq_rawData <- as.data.frame(deseq_rawData)
NGm1     NGm2       NGm3        NGm4       NGm6     NGm7     NGm8     NGm9     NGs1     NGs2     NGs3
gen1  13174.911005 13174.91 13174.9110 13174.91101 13174.9110 13174.91 13174.91 13174.91 13174.91 13174.91 13174.91
gen3      5.795947     0.00   277.9482   406.24839     0.0000     0.00     0.00     0.00     0.00     0.00     0.00
gen4      0.000000     0.00   146.5088   121.63484   103.4217     0.00     0.00     0.00     0.00     0.00     0.00
gen6      0.000000     0.00   135.6253    13.18210   517.9491     0.00     0.00     0.00     0.00     0.00     0.00
gen10     0.000000     0.00     0.0000    11.38454     0.0000     0.00     0.00     0.00     0.00     0.00     0.00
NGs4     NGs5     NGs6     NGs7        NGs8     NGs9
gen1  13174.91101 13174.91 13174.91 13174.91 13174.91101 13174.91
gen3      0.00000     0.00     0.00     0.00    97.08217     0.00
gen4     47.81639     0.00     0.00     0.00     0.00000     0.00
gen6      0.00000     0.00     0.00     0.00     7.37333     0.00
gen10     0.00000     0.00     0.00     0.00     0.00000     0.00


I don't know why the count of gene 1 in all samples are the same after normalization. I can't figure out what the issue is, and would really appreciate any help! I'd just like to investigate differential expression across different species. Thanks so much.

deseq2 • 106 views
modified 9 weeks ago by Michael Love25k • written 9 weeks ago by Charls0
1
9 weeks ago by
Michael Love25k
United States
Michael Love25k wrote:

This gene is being removed essentially with the outlier replacement procedure. I would suggest to set minReplicatesForReplace=Inf, and then examine the top genes to make sure they aren't being driven by outliers. If outliers appear to be a problem, I would suggest the following approaches: We've developed a new method in the lab called swish in the fishpond Bioconductor package which offers better protection against outliers than our DESeq2 procedure I think. Or you could use SAMseq here as well.