Question

Variance assumptions in DESeq2 with blocks of linkage disequilibrium/chromosomal inversions

0

Entering edit mode

jrmerri • 0

@jrmerri-22256

Last seen 4.5 years ago

This is more of a conceptual question. I am trying to study the effects of a large-scale (100 mb; > 1,000 genes) chromosomal inversion on differential ATAC-seq peaks in two genotypes: in animals that are hets for the inversion, and those that are homozygous for the standard arrangement. We see substantial allele-specific expression (via RNA-seq) along the inversion in the hets, and accordingly, they have higher within-genotype variance than the homozygotes. I see the same distribution in my ATAC-seq data. In other words, I expect to see differential within-group variance in a subset of clustered genes within the genome that are due to my experimental design. A second major concern is that we sequenced our hets at 2x the depth of the homozygotes in order to have equivalent coverage of the alleles.

At a whole-genome level, I see relatively few significantly different peaks (<1000 in a 1.1e+9 genome) running a model with standard settings dds <- DESeqDataSetFromMatrix(countData = ctsmatrix, colData = manifest, design = ~ genotype) ; however, using estimatesizeFactors to normalize the raw counts in my count matrix, I see many more differential peaks (>10,000 as tested by a basic t-test) on the chromosome with the inversion that are significant after a FDR correction, that are not even close to significant (unadjusted p-values > 0.3) using DESeq2. Visualizing these discrepant peaks in my dds object using plotCounts reveals that this is probably due to flattened variation after normalization in the hets that I do not see with the simpler sizeFactor adjustment. I suspect that this is due to the assumptions of equal variance between groups and potentially ignoring outliers. I have read in the vignette about the VST and rlog data transformations, and keeping outliers in the dataset, but I'm not sure if that's malpractice or appropriate given my unusual experimental design. What are best practices for testing genotype differences in this unusual case? Any advice is appreciated.

deseq2 normalization • 692 views

ADD COMMENT • link updated 4.5 years ago by Michael Love 41k • written 4.5 years ago by jrmerri • 0

score 0 · Answer 1 · 2019-11-03

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 14 hours ago

United States

How many samples do you have?

I'm gleaning that you think there should be more DE calls in the ATAC-seq from DESeq2 on the chromosome with the inversion.

this is probably due to flattened variation after normalization in the hets that I do not see with the simpler sizeFactor adjustment

Can you show plotCounts for a region that you think should have a small p-value but does not?

ADD COMMENT • link 4.5 years ago Michael Love 41k

0

Entering edit mode

I have 12 homozygotes and 9 hets. As an example, here is a peak in a gene we're very interested in, p = 0.002 by standard t-test (EstimateSizeFactorsNormalization.png). For this analysis I simply divided the raw read count by the estimatesizeFactors value calculated for each sample. Here is exactly the same peak and data by DESeq2 plotCounts, unadjusted p = 0.120 (PlotCounts.png).

ADD REPLY • link 4.5 years ago jrmerri • 0

0

Entering edit mode

I have 12 homozygotes and 9 hets. As an example, here is a peak in a gene we're very interested in, p = 0.002 by standard t-test (EstimateSizeFactorsNormalization.png). For this analysis I simply divided the raw read count by the estimatesizeFactors value calculated for each sample. Here is exactly the same peak and data by DESeq2 plotCounts, unadjusted p = 0.120 (PlotCounts.png).

ADD REPLY • link 4.5 years ago jrmerri • 0

0

Entering edit mode

Sorry for double posting!

ADD REPLY • link 4.5 years ago jrmerri • 0

0

Entering edit mode

You can see why it’s not DE according to normalized counts as calculated by DESeq2. I wonder if you’re calculating differently to our normalized counts.

ADD REPLY • link 4.5 years ago Michael Love 41k

0

Entering edit mode

Of course -- it is obvious why this is not significant using the DESeq2 model upon plotting. Following Anders & Huber (2010), I have simply divided the raw read counts of that peak by the size factor calculated by DESeq2 for each sample. I believe that DESeq2 also does this same calculation (https://hbctraining.github.io/DGEworkshop/lessons/02DGEcountnormalization.html).

My understanding is that there are additional normalization steps taken by DESeq2 (e.g., removing outliers), which is obvious when comparing the two plots in the example peak above. My question is if there are underlying assumption(s) about the data in DESeq2 that are violated by my experimental design. Why could there be such a large discrepancy between these two normalization methods?

ADD REPLY • link 4.5 years ago jrmerri • 0

0

Entering edit mode

Can you turn off outlier replacement to test this assumption that this is the cause? minRep=Inf when running DESeq() will work.

ADD REPLY • link 4.5 years ago Michael Love 41k

0

Entering edit mode

Thanks! I there were a few outliers that were being removed, but actually the cause of the issue was a downstream coding error. I reran the pipeline and my results nearly perfectly match the estimate from DESeq. Sorry for the error!

ADD REPLY • link 4.5 years ago jrmerri • 0