Highly similar RNA-seq samples in PCA - pooling or technical duplication?
1
0
Entering edit mode
@0d763478
Last seen 2 hours ago
Germany

Hello,

I have an RNAseq experiment of 10 samples (2 donors, 5 doses). In the PCA plot, we observed that two samples from the same dose group (one from each donor) cluster extremely close together. We are assuming that they were pooled together by mistake. Could there be any other explanation for why these samples are very close to each others?

We suggest that these samples may have been pooled or are technical duplicates, but we want to consider other possibilities.

Questions:

Could there be alternative explanations for why these samples are so similar?

What are the recommended checks to confirm whether they are truly identical or independent?

If confirmed identical, should we exclude them from the analysis, or can we still save them for differential expression since they are from the same dose group and should be replicates of each other?

We have attached the PCA plot for reference. Any guidance would be greatly appreciated.

The PCA plot is attached below.

Thanks in advance.

PCA plot

RNASeq DESeq2 DifferentialExpression RNASeqData • 62 views
ADD COMMENT
0
Entering edit mode

The donor effect is crystal clear, and absent in these suspicious samples. It strongly argues for a technical problem. If you really need some hard facts, maybe find some genes that are known to have lots of person-specific polymorphisms, and then see whether a) in the clear samples these are different between the two donors, and then b) in the suspicious ones you find heterozygosity. Question is if you need this since the overall picture seems so clear.

ADD REPLY
0
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 1 hour ago
The Cave, 181 Longwood Avenue, Boston, …

There are alternative explanations for why these samples are so similar. The similarity could result from low biological variability in gene expression at that dose, leading to comparable profiles despite different donors. Another possibility is that batch effects or technical artifacts, such as similar sequencing depths or library preparation conditions, minimized differences. However, given the donor-specific effects visible in other samples, these explanations are less likely than a technical error like pooling.

To confirm whether they are truly identical or independent, perform these checks. First, calculate the Pearson correlation coefficient between the normalized counts of the two samples using DESeq2's plotPCA or base R functions; a correlation near 1 indicates high similarity. Second, inspect raw FASTQ files for identical read sequences by computing MD5 hashes or using tools like fastq-dump and diff. Third, examine genes with donor-specific polymorphisms: identify SNPs differentiating the donors from variant calling (e.g., via GATK), then check if the suspicious samples show heterozygosity instead of donor-specific homozygosity, as suggested in related discussions.

If confirmed identical, exclude one sample from the analysis to avoid inflating statistical power with non-independent data. Do not use them as replicates for differential expression, as they likely represent a technical duplicate rather than biological replicates from different donors, which could bias results. Instead, proceed with the remaining samples per dose group.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 841 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6