Question

Scaling before normalisation in RNASeq

1

Entering edit mode

tg369 ▴ 10

@tg369-9320

Last seen 4.7 years ago

United Kingdom

I have one sample that we ran for sequencing separately and having more coverage and more reads. At the end this sample is coming as an outlier when I prepared a cluster diagram. This problem still persist after I did normalisation (DESeq method).

This can be biological reason, but we also suspect this is happening because of much more coverage or sequencing depth.

One option is to test that, we can select reads randomly in that sample and at the end make the total reads in that sample similar for rest of my samples. Then do normalisation and differential expression analysis.

I wonder if any such scripts are available. I would highly appreciate any of your advice.

Thank you in advance.

Tanay

normalization • 2.7k views

ADD COMMENT • link 8.8 years ago tg369 ▴ 10

0

Entering edit mode

I don't know if your outlier sample is a biological replicate of other samples in your experiment? And I don't know exactly what your experimental design is, but probably adding a "batch effect" to your model would be useful.

ADD REPLY • link 8.8 years ago b.nota ▴ 370

score 4 · Answer 1 · 2016-01-21

I would be surprised if the difference was due to library size alone. Normalization by library size (which the size factors from DESeq should do, amongst other things) should take care of library size differences between samples. In particular, the more sophisticated normalization methods (e.g., size factors in DESeq, TMM in edgeR) should only fail if the majority of genes are DE in your data set, or if you have many, many zeroes in your count matrix after abundance-based filtering. This is usually not the case for most analyses.

A more common explanation for non-biological differences between samples involves a batch effect due to separate processing. This means that your affected sample has read counts that are systematically different from the other samples - however, these differences are not consistent across genes and cannot be removed by scaling normalization with a global scaling factor. As a result, your sample will cluster separately from everyone else, even after normalization. Such batch effects are frequently introduced in genomics experiments, e.g., if you send two identical samples to different sequencing centers, or to the same sequencing center at different times, or for processing by different operators, or at different phases of the moon, etc. and will be present regardless of whether you downsample the library.

b.nota's suggestion of blocking on the batch is only relevant if you have multiple samples in each batch. If you only have one sample in a batch, then all information in that sample will be used to estimate the blocking term for the batch effect. This means that the sample will effectively be useless for estimating the variance, detecting DE, etc. However, you can't treat it as a direct replicate either, because the batch effect will inflate the estimated variance and distort all of your downstream statistics. You're stuck between a rock and a hard place here; it's a fundamental flaw in the experimental design that cannot be fixed at the analysis stage.

score 2 · Answer 2 · 2016-01-21

"select reads randomly in that sample and at the end make the total reads in that sample similar for rest of my samples"

The edgeR function thinCounts() does this. If x is a matrix of counts, and you want to reduce the library size for the 10th column to be 20 million, you would use:

x[,10] <- thinCounts(x[,10], target.size=20e6)

The thinning is done in such a way that the distributional assumptions about the counts are preserved.

However a sample shouldn't appear as an outlier just because it was sequenced to a larger depth. If you are seeing this, then it is likely that there is a biological explanation or else you have made a mistake in the way you have prepared the data for clustering.

score 1 · Answer 3 · 2016-01-21

1

Entering edit mode

Hotz, Hans-Rudolf ▴ 400

@hotz-hans-rudolf-3951

Last seen 4.1 years ago

Switzerland

Hi Tanay

You don't need a script to select reads randomly, try the 'FastqSampler' function from the "ShortRead" package.

Regards, Hans-Rudolf

ADD COMMENT • link 8.8 years ago Hotz, Hans-Rudolf ▴ 400

score 0 · Answer 4 · 2016-01-22

0

Entering edit mode

tg369 ▴ 10

@tg369-9320

Last seen 4.7 years ago

United Kingdom

Thank you b.nota, Aaron Lun, Gordon Smyth and Hans-Rudolf for discussing this issue and your advice and suggestions. We plan to repeat these experiments. However, in the meantime, I just try to select reads randomly and let us see the result. Will let you know as soon as I have my result. Tanay

ADD COMMENT • link 8.8 years ago tg369 ▴ 10

score 0 · Answer 5 · 2016-01-29

Dear All,

I did down sampling of reads for the outlier sample. Cluster diag generated after normalisation, using normalised raw counts, looks like the outlier issue is resolved. However, that was bit confusing. Therefore, I performed PCA; also generated a heat map by taking 1000 highly variable genes and found that my sample is still appeared as an outlier in all cases. I therefore agree that this type of batch effect (where one sample was run in a separate batch and rest 5 samples were run together in another batch) might not be fixed at the analysis stage.

Kind regards,

Tanay