Question

DESeq2: Sequencing depth as confounding factor after size factor normalization

0

Entering edit mode

ipso • 0

@ipso-9761

Last seen 5.6 years ago

Hi,

I am performing RNA-Seq analysis on a bulk dataset comprising 4 different groups (a, b, c, d) with 5-6 biological replicates per group. When performing QC plotting, I find that PCA and clustering of vst-transformed values identifies some samples as quite different from the others:

# Read dds
dds <- readRDS(dds.path)

# Variance stabilizing transformation
dds.transform <- varianceStabilizingTransformation(dds, blind=TRUE)

# Plotting PCA
p.pca <- plotPCA(dds.transform,
                 intgroup=c("group"), ntop=500)
nudge <- position_nudge(y=5, x=5)
p.pca + geom_label(aes(label = name), position = nudge)

# Plotting heatmap + clustering
df <- data.frame(group = colData(dds.transform)[,"group"],
                 row.names = colnames(dds.transform))
select.200 = order(rowMeans(counts(dds, normalized=TRUE)),
                   decreasing=TRUE)[1:200]
pheatmap(assay(dds.transform)[select.200,], cluster_rows=F, show_rownames=F,
         cluster_cols=T, annotation_col=df)

Interestingly the samples are the ones with the highest amount of reads (and largely belonging to the same batch). Looking at the heatmaps it also is apparent that they cluster differently because of generally higher reads:

I thought the size factor normalization should take care of this? Removing the batch effect using limma's removeBatchEffect() is also not really helping (makes it worse even):

# Remove batch effect
assay(dds.transform) <- removeBatchEffect(assay(dds.transform), batch=dds.transform$batch)

Is this kind of variation to be expected or do I need to take care of this? Both intra- and intergroup differences theoretically should not be that big. What I am eventually interested in are mainly the differentially expressed genes.

Thank you!
Best,
Joe

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] limma_3.36.2                vsn_3.48.1                  RColorBrewer_1.1-2          gridExtra_2.3              
 [5] pheatmap_1.0.10             gplots_3.0.1                ggplot2_3.0.0               DESeq2_1.20.0              
 [9] SummarizedExperiment_1.10.1 DelayedArray_0.6.5          BiocParallel_1.14.2         matrixStats_0.54.0         
[13] Biobase_2.40.0              GenomicRanges_1.32.6        GenomeInfoDb_1.16.0         IRanges_2.14.10            
[17] S4Vectors_0.18.3            BiocGenerics_0.26.0

deseq2 sizefactors pca normalization library size factor • 1.5k views

ADD COMMENT • link updated 5.6 years ago by Michael Love 41k • written 5.6 years ago by ipso • 0

score 0 · Answer 1 · 2018-09-07

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 6 hours ago

United States

What kind of RNA-seq is this by the way? I see you have .5 million reads for most samples.

Can you plot just raw counts from b1 vs b6, c1 vs c6, d1 vs d2? I'm wondering if something else is different about those samples beyond x4 higher depth.

ADD COMMENT • link 5.6 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

It's data from hematopoietic stem cells. The dataset is already some years old so the depth is really not that great. Still I need to reanalyse it as it has some unique treatment.

I wasn't really sure what you meant by plotting the raw counts so I attached some basic density plots of the count distribution (log2).

ADD REPLY • link 5.6 years ago ipso • 0

0

Entering edit mode

Sorry, I meant plot(b1, b6, log="xy"). Scaling normalization would help you when these plots of replicates mostly indicate a shift (a multiplicative factor on the raw count scale).

ADD REPLY • link 5.6 years ago Michael Love 41k

0

Entering edit mode

Of course, here you go:

For me, c1 vs. c6 and d1 vs. d2 look somewhat linearly correlated. b1 vs. b6 however looks in fact quite weird.

ADD REPLY • link 5.6 years ago ipso • 0

0

Entering edit mode

All of these look highly variable for replicates. This isn't a sequencing depth issue I think, but some other quality control problem. I'd look into if something was special about these high depth samples.

What do you see with e.g. b5 vs b6?

ADD REPLY • link 5.6 years ago Michael Love 41k

0

Entering edit mode

You were right. The variability is much less when comparing between the libraries with lower depth. It seems that b1, c1 and d1 are highly different from everything else. b2 to a lesser extend also. a2, c2 and d5 maybe as well, but I believe their deviation should be still acceptable.

What could be the reason for this high variation?