[Deseq2] Estimating size factors using samples which have been excluded for QC
1
0
Entering edit mode
@owenchapman1-18252
Last seen 6.0 years ago

I'm running analysis of a small number of samples using RNA-seq and ATAC-seq, which produce reads mapping to genes and open chromatin, respectively.  I use DESeq2 to estimate scaling factors for each sample, so that I can compare e.g. the number of reads for region in sample A to region in sample B.  However, several samples failed ATAC-seq QC and were excluded from downstream analysis. Now I have 2 questions:

1. Is it valid to include the failed QC samples when estimating size factors for the ATAC-seq? Reading note S1 of doi:10.1101/gr.133744.111, this would probably have a modest effect on the counts for all bins in the virtual reference, yielding a slightly better estimate of each size factor (because more samples) but a slight scaling of all size factors (because the virtual reference is shifted). Should adding these samples give me a better estimate of size factors?

2. Is it valid to include the failed ATAC-QC samples when estimating size factors for the RNA-seq? The samples all have fine RNAseq descriptive statistics, so I wouldn't see a problem with using them to estimate the virtual reference, even if I don't use some in downstream analysis.  But I'd still want to make sure that it's a good idea, given that I'm not using the failed QC samples in downstream analysis.

deseq2 estimatesizefactors • 1.0k views
ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 28 days ago
Icahn School of Medicine at Mount Sinai…

If you have samples that have failed QC, I would be careful about including them in the normalization. Presumably they failed QC because they have an abnormal distribution of read counts, so normalizing them against normal samples might not make sense. The main concern I would have is that the failed samples will affect the counts in the virtual reference. Contrary to what you say, I don't believe this would improve the size factor estimation, since it would be making the virtual reference less representative of a read distribution from an acceptable-quality sample. The fact that the virtual reference, and therefore the size factors, will also be shifted slightly is of no concern: all normalized log counts or log CPMs are only comparable within a single dataset anyway. So including the samples that failed QC into the normalization probably won't help anything, and it might hurt, depending on how bad the count distributions are distorted. Of course, you don't need to rely on theory. Just run it both ways and compare the size factors to see how much of a difference it makes.

For your second question, as long as the RNA-seq libraries for the samples seem good, including them in the normalization should be fine. Again, you can try normalizing with and without them to see how much difference it makes.

ADD COMMENT

Login before adding your answer.

Traffic: 793 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6