Do I have clustering by library size?
Hi,

I have a question regarding large differences in library size amongst my samples. These samples are from invivo animal infection experiment and some samples had to be resequenced to get enough reads. I do not have a case/infected and control/uninfected scenario in this experiment, they are just all infected samples. The range of library size in my experiment is 100,000 to 81 million. I generated a PCA plot and colored them by the sequencing depth and wanted to check with the experts if this seems to be a case wherein samples with similar sequencing depth are clustering together? Also, I tried filtering genes that have 0 reads for >50% of the samples and replotted PCA, but it looked exactly the same.

Greatly appreciate help in this regard.

PCA DESeq2 Normalization librarysize • 567 views
To me it doesn't look like it. Couple of notes though: (i) if you end up doing any sort of differential expression analysis, you might want to add the sequencing batch of the libraries as a control (and colour those in a separate PCA) to exclude batch effects, if any exists; (ii) the range of library sizes is quite large. You might need to take special care when analysing.

Thank you Antonio for your suggestions. Greatly appreciate it.

@mikelove
Last seen 2 days ago
United States

I don't think there is too much clustering by library size, but the one very highly sequenced sample may continue to drive PC1. You could see what happens if you down-sample that one.

This code will generate a new sample downsampling by p. For example, if you want to bring down the counts by 1/10 for sample j you would set p=.1.

new.cts <- rbinom( nrow(dds), prob=p, size=counts(dds)[ , j] )


Then you can do:

mode(new.cts) <- "integer"
counts(dds)[,j] <- new.cts

Thanks a lot, Micheal. Downsampling is a great idea. Thank you so much for the code, I will give it a try and report back.

I have regenerated the PCA after downsampling the sample "P7" by 1/3rd (to a similar sequencing depth as its replicate). I do not see any major improvement, what do you think? In our experiment, we had to resequence some samples as they were obtained from total RNA (host+pathogen) and we needed only the pathogen reads. Since pathogen accounted only for a small percentage, we had to resequence multiple samples to bump up the reads. In your opinion, how much variability in sequencing depth across samples is okay to have?

Did you put P7 in twice? I would just replace the original P7 with the downsampled one.

Apologize for the confusion. The 2 P7s on the PCA are biological replicates. I have updated the "before" and "after" downsampling PCAs to reflect the same.

So then I agree that P7 is still driving PC1 to some extent.

Indeed. Thanks a lot for the code. I have other datasets that may have the same issue with library size variation. Moving forward, in order to maintain consistency across datasets, how do I decide if downsampling is needed?
Look at the PCA's or should I look at the range and downsample all samples that exceed the minimum sequencing depth by a factor of 10? I am hoping to apply the same rule across the datasets as they will be presented in the same manuscript.

Hmm not sure if I have a hard and fast rule. I look at PCA and library size distribution for all datasets.