This is more of a conceptual question. I am trying to study the effects of a large-scale (100 mb; > 1,000 genes) chromosomal inversion on differential ATAC-seq peaks in two genotypes: in animals that are hets for the inversion, and those that are homozygous for the standard arrangement. We see substantial allele-specific expression (via RNA-seq) along the inversion in the hets, and accordingly, they have higher within-genotype variance than the homozygotes. I see the same distribution in my ATAC-seq data. In other words, I expect to see differential within-group variance in a subset of clustered genes within the genome that are due to my experimental design. A second major concern is that we sequenced our hets at 2x the depth of the homozygotes in order to have equivalent coverage of the alleles.
At a whole-genome level, I see relatively few significantly different peaks (<1000 in a 1.1e+9 genome) running a model with standard settings dds <- DESeqDataSetFromMatrix(countData = ctsmatrix, colData = manifest, design = ~ genotype)
; however, using estimatesizeFactors
to normalize the raw counts in my count matrix, I see many more differential peaks (>10,000 as tested by a basic t-test) on the chromosome with the inversion that are significant after a FDR correction, that are not even close to significant (unadjusted p-values > 0.3) using DESeq2. Visualizing these discrepant peaks in my dds
object using plotCounts
reveals that this is probably due to flattened variation after normalization in the hets that I do not see with the simpler sizeFactor adjustment. I suspect that this is due to the assumptions of equal variance between groups and potentially ignoring outliers. I have read in the vignette about the VST and rlog data transformations, and keeping outliers in the dataset, but I'm not sure if that's malpractice or appropriate given my unusual experimental design. What are best practices for testing genotype differences in this unusual case? Any advice is appreciated.
I have 12 homozygotes and 9 hets. As an example, here is a peak in a gene we're very interested in, p = 0.002 by standard t-test (EstimateSizeFactorsNormalization.png). For this analysis I simply divided the raw read count by the
estimatesizeFactors
value calculated for each sample. Here is exactly the same peak and data by DESeq2plotCounts
, unadjusted p = 0.120 (PlotCounts.png).I have 12 homozygotes and 9 hets. As an example, here is a peak in a gene we're very interested in, p = 0.002 by standard t-test (EstimateSizeFactorsNormalization.png). For this analysis I simply divided the raw read count by the
estimatesizeFactors
value calculated for each sample. Here is exactly the same peak and data by DESeq2plotCounts
, unadjusted p = 0.120 (PlotCounts.png).Sorry for double posting!
You can see why it’s not DE according to normalized counts as calculated by DESeq2. I wonder if you’re calculating differently to our normalized counts.
Of course -- it is obvious why this is not significant using the DESeq2 model upon plotting. Following Anders & Huber (2010), I have simply divided the raw read counts of that peak by the size factor calculated by DESeq2 for each sample. I believe that DESeq2 also does this same calculation (https://hbctraining.github.io/DGEworkshop/lessons/02DGEcountnormalization.html).
My understanding is that there are additional normalization steps taken by DESeq2 (e.g., removing outliers), which is obvious when comparing the two plots in the example peak above. My question is if there are underlying assumption(s) about the data in DESeq2 that are violated by my experimental design. Why could there be such a large discrepancy between these two normalization methods?
Can you turn off outlier replacement to test this assumption that this is the cause?
minRep=Inf
when running DESeq() will work.Thanks! I there were a few outliers that were being removed, but actually the cause of the issue was a downstream coding error. I reran the pipeline and my results nearly perfectly match the estimate from DESeq. Sorry for the error!