Question

[Deseq2] Estimating size factors using samples which have been excluded for QC

0

Entering edit mode

owenchapman1 • 0

@owenchapman1-18252

Last seen 6.1 years ago

I'm running analysis of a small number of samples using RNA-seq and ATAC-seq, which produce reads mapping to genes and open chromatin, respectively. I use DESeq2 to estimate scaling factors for each sample, so that I can compare e.g. the number of reads for region r in sample A to region r in sample B. However, several samples failed ATAC-seq QC and were excluded from downstream analysis. Now I have 2 questions:

1. Is it valid to include the failed QC samples when estimating size factors for the ATAC-seq? Reading note S1 of doi:10.1101/gr.133744.111, this would probably have a modest effect on the counts for all bins in the virtual reference, yielding a slightly better estimate of each size factor (because more samples) but a slight scaling of all size factors (because the virtual reference is shifted). Should adding these samples give me a better estimate of size factors?

2. Is it valid to include the failed ATAC-QC samples when estimating size factors for the RNA-seq? The samples all have fine RNAseq descriptive statistics, so I wouldn't see a problem with using them to estimate the virtual reference, even if I don't use some in downstream analysis. But I'd still want to make sure that it's a good idea, given that I'm not using the failed QC samples in downstream analysis.

deseq2 estimatesizefactors • 1.1k views

ADD COMMENT • link updated 6.1 years ago by Ryan C. Thompson ★ 7.9k • written 6.1 years ago by owenchapman1 • 0

score 0 · Answer 1 · 2018-11-10

If you have samples that have failed QC, I would be careful about including them in the normalization. Presumably they failed QC because they have an abnormal distribution of read counts, so normalizing them against normal samples might not make sense. The main concern I would have is that the failed samples will affect the counts in the virtual reference. Contrary to what you say, I don't believe this would improve the size factor estimation, since it would be making the virtual reference less representative of a read distribution from an acceptable-quality sample. The fact that the virtual reference, and therefore the size factors, will also be shifted slightly is of no concern: all normalized log counts or log CPMs are only comparable within a single dataset anyway. So including the samples that failed QC into the normalization probably won't help anything, and it might hurt, depending on how bad the count distributions are distorted. Of course, you don't need to rely on theory. Just run it both ways and compare the size factors to see how much of a difference it makes.

For your second question, as long as the RNA-seq libraries for the samples seem good, including them in the normalization should be fine. Again, you can try normalizing with and without them to see how much difference it makes.