Highly similar RNA-seq samples in PCA - pooling or technical duplication?
1
0
Entering edit mode
@0d763478
Last seen 33 minutes ago
Germany

Hello,

I have an RNAseq experiment of 10 samples (2 donors, 5 doses). In the PCA plot, we observed that two samples from the same dose group (one from each donor) cluster extremely close together. We are assuming that they were pooled together by mistake. Could there be any other explanation for why these samples are very close to each others?

We suggest that these samples may have been pooled or are technical duplicates, but we want to consider other possibilities.

Questions:

Could there be alternative explanations for why these samples are so similar?

What are the recommended checks to confirm whether they are truly identical or independent?

If confirmed identical, should we exclude them from the analysis, or can we still save them for differential expression since they are from the same dose group and should be replicates of each other?

We have attached the PCA plot for reference. Any guidance would be greatly appreciated.

The PCA plot is attached below.

Thanks in advance.

PCA plot

RNASeq DESeq2 DifferentialExpression RNASeqData • 90 views
ADD COMMENT
1
Entering edit mode

The donor effect is crystal clear, and absent in these suspicious samples. It strongly argues for a technical problem. If you really need some hard facts, maybe find some genes that are known to have lots of person-specific polymorphisms, and then see whether a) in the clear samples these are different between the two donors, and then b) in the suspicious ones you find heterozygosity. Question is if you need this since the overall picture seems so clear.

ADD REPLY
1
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 8 hours ago
The Cave, 181 Longwood Avenue, Boston, …

There are alternative explanations for why these samples are so similar. The similarity could result from low biological variability in gene expression at that dose, leading to comparable profiles despite different donors. Another possibility is that batch effects or technical artifacts, such as similar sequencing depths or library preparation conditions, minimized differences. However, given the donor-specific effects visible in other samples, these explanations are less likely than a technical error like pooling.

To confirm whether they are truly identical or independent, perform these checks. First, calculate the Pearson correlation coefficient between the normalized counts of the two samples using DESeq2's plotPCA or base R functions; a correlation near 1 indicates high similarity. Second, inspect raw FASTQ files for identical read sequences by computing MD5 hashes or using tools like fastq-dump and diff. Third, examine genes with donor-specific polymorphisms: identify SNPs differentiating the donors from variant calling (e.g., via GATK), then check if the suspicious samples show heterozygosity instead of donor-specific homozygosity, as suggested in related discussions.

If confirmed identical, exclude one sample from the analysis to avoid inflating statistical power with non-independent data. Do not use them as replicates for differential expression, as they likely represent a technical duplicate rather than biological replicates from different donors, which could bias results. Instead, proceed with the remaining samples per dose group.

Kevin

ADD COMMENT
0
Entering edit mode

Thank you very much, @ATpoint and Kevin , for your valuable insights. I have carefully checked all your suggestions.

First, I examined the expression of some HLA genes, which are highly polymorphic and individual-specific (see attached)HLA_A HLA_B. At this particular dose, it appears that the two suspicious samples were mixed. Additionally, visual inspection in IGV shows many shared SNPs between these two samples that are not shared in the clean ones.

I checked MD5 checksums for the FASTQ files, and they are different between the samples.

I also checked the correlation between these 2 suspicious samples - namely NA_0.25Gy_24h and NS_0.25Gy_24h- and 2 good ones (code attached)

# checking the correlation between the suspicious samples
> norm <- counts(dds_neutron_t2, normalized=TRUE)

> cor(norm[ , "NS_025Gy_24h"], norm[ , "NA_025Gy_24h"], method="pearson")
[1] 0.9975267

# checking the correlation between 2 good samples
> cor(norm[ , "NS_05Gy_24h"], norm[ , "NA_05Gy_24h"], method="pearson")
[1] 0.9852654

Next, I performed a brief SNP check for these 2 samples plus other good ones using bcftools. The discordance between these two suspicious samples is significantly lower compared to other pairwise comparisons, as summarized below:

Query Sample Genotyped Sample Discordance Sites Compared Matching Genotypes
NS_0.25Gy_24h NS_0.5Gy_24h 61011 26005 19956
NS_0.25Gy_24h NA_0.5Gy_24h 43625 28373 23035
NA_0.25Gy_24h NS_0.5Gy_24h 57846 23486 17851
NA_0.25Gy_24h NA_0.5Gy_24h 43220 25959 20908
NA_0.25Gy_24h NS_0.25Gy_24h 17794 27317 23293

Overall, given these results, I think it seems that the samples are pooled together.

So, my question now is, since these samples are valuable, would you collapse these 2 samples as technical replicates in DESeq2::collapseReplicates() -since they are from the same dose group-, or eliminate them from differential expression?

I am trying to find a way to retain the data if possible, but if they can/will mess up with the analysis, then of course it wouldn't make sense to keep them.

Thanks again for your guidance.

ADD REPLY

Login before adding your answer.

Traffic: 509 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6