Question

Highly similar RNA-seq samples in PCA - pooling or technical duplication?

0

Entering edit mode

Ahmed Salah • 0

@0d763478

Last seen 12 days ago

Germany

Hello,

I have an RNAseq experiment of 10 samples (2 donors, 5 doses). In the PCA plot, we observed that two samples from the same dose group (one from each donor) cluster extremely close together. We are assuming that they were pooled together by mistake. Could there be any other explanation for why these samples are very close to each others?

We suggest that these samples may have been pooled or are technical duplicates, but we want to consider other possibilities.

Questions:

Could there be alternative explanations for why these samples are so similar?

What are the recommended checks to confirm whether they are truly identical or independent?

If confirmed identical, should we exclude them from the analysis, or can we still save them for differential expression since they are from the same dose group and should be replicates of each other?

We have attached the PCA plot for reference. Any guidance would be greatly appreciated.

The PCA plot is attached below.

Thanks in advance.

PCA plot

RNASeq DESeq2 DifferentialExpression RNASeqData • 947 views

ADD COMMENT • link written 5 weeks ago by Ahmed Salah • 0

1

Entering edit mode

The donor effect is crystal clear, and absent in these suspicious samples. It strongly argues for a technical problem. If you really need some hard facts, maybe find some genes that are known to have lots of person-specific polymorphisms, and then see whether a) in the clear samples these are different between the two donors, and then b) in the suspicious ones you find heterozygosity. Question is if you need this since the overall picture seems so clear.

ADD REPLY • link 5 weeks ago ATpoint ★ 5.0k

score 1 · Answer 1 · 2025-11-20

1

Entering edit mode

Kevin Blighe ★ 4.0k

@kevin

Last seen 5 weeks ago

The Cave, 181 Longwood Avenue, Boston, …

There are alternative explanations for why these samples are so similar. The similarity could result from low biological variability in gene expression at that dose, leading to comparable profiles despite different donors. Another possibility is that batch effects or technical artifacts, such as similar sequencing depths or library preparation conditions, minimized differences. However, given the donor-specific effects visible in other samples, these explanations are less likely than a technical error like pooling.

To confirm whether they are truly identical or independent, perform these checks. First, calculate the Pearson correlation coefficient between the normalized counts of the two samples using DESeq2's plotPCA or base R functions; a correlation near 1 indicates high similarity. Second, inspect raw FASTQ files for identical read sequences by computing MD5 hashes or using tools like fastq-dump and diff. Third, examine genes with donor-specific polymorphisms: identify SNPs differentiating the donors from variant calling (e.g., via GATK), then check if the suspicious samples show heterozygosity instead of donor-specific homozygosity, as suggested in related discussions.

If confirmed identical, exclude one sample from the analysis to avoid inflating statistical power with non-independent data. Do not use them as replicates for differential expression, as they likely represent a technical duplicate rather than biological replicates from different donors, which could bias results. Instead, proceed with the remaining samples per dose group.

Kevin

ADD COMMENT • link 5 weeks ago Kevin Blighe ★ 4.0k

0

Entering edit mode

Thank you very much, @ATpoint and Kevin , for your valuable insights. I have carefully checked all your suggestions.

First, I examined the expression of some HLA genes, which are highly polymorphic and individual-specific (see attached) HLA_A HLA_B . At this particular dose, it appears that the two suspicious samples were mixed. Additionally, visual inspection in IGV shows many shared SNPs between these two samples that are not shared in the clean ones.

I checked MD5 checksums for the FASTQ files, and they are different between the samples.

I also checked the correlation between these 2 suspicious samples - namely NA_0.25Gy_24h and NS_0.25Gy_24h- and 2 good ones (code attached)

# checking the correlation between the suspicious samples
> norm <- counts(dds_neutron_t2, normalized=TRUE)

> cor(norm[ , "NS_025Gy_24h"], norm[ , "NA_025Gy_24h"], method="pearson")
[1] 0.9975267

# checking the correlation between 2 good samples
> cor(norm[ , "NS_05Gy_24h"], norm[ , "NA_05Gy_24h"], method="pearson")
[1] 0.9852654

Next, I performed a brief SNP check for these 2 samples plus other good ones using bcftools. The discordance between these two suspicious samples is significantly lower compared to other pairwise comparisons, as summarized below:

Query Sample	Genotyped Sample	Discordance	Sites Compared	Matching Genotypes
NS_0.25Gy_24h	NS_0.5Gy_24h	61011	26005	19956
NS_0.25Gy_24h	NA_0.5Gy_24h	43625	28373	23035
NA_0.25Gy_24h	NS_0.5Gy_24h	57846	23486	17851
NA_0.25Gy_24h	NA_0.5Gy_24h	43220	25959	20908
NA_0.25Gy_24h	NS_0.25Gy_24h	17794	27317	23293

Overall, given these results, I think it seems that the samples are pooled together.

So, my question now is, since these samples are valuable, would you collapse these 2 samples as technical replicates in DESeq2::collapseReplicates() -since they are from the same dose group-, or eliminate them from differential expression?

I am trying to find a way to retain the data if possible, but if they can/will mess up with the analysis, then of course it wouldn't make sense to keep them.

Thanks again for your guidance.

ADD REPLY • link 5 weeks ago Ahmed Salah • 0

1

Entering edit mode

Since the donor effect is so clear I assume that a paired analysis within donor is strictly necessary to get DEGs, and therefore one would probably need to exclude these samples. After all, if they're mixed then they're noise, not signal. Not much value in it. You still have the other dosages and can estimate genes that respond to the irradiation.