DE Analysis on cells from a patient derived mouse xenograft with high levels of mouse count "contamination"
1
0
Entering edit mode
jropa ▴ 10
@jropa-22911
Last seen 22 months ago
United States

I am performing a differential expression analysis for collaborators. The overall biological design from my collaborators is as follows: 1) Received patient sample. 2) Amplified patient sample using patient derived xenograft (PDX) in a mouse host. 3) Extracted cells from mouse and enriched for human cells by positive selection using a human specific antibody. 4) Treated cells ex vivo with Treatments A, B, or A+B (and control). 5) Harvested RNA and submitted for poly-A selected RNA-seq.

Here's where I came in. Knowing where the sample came from, I aligned the RNA-seq reads with a combined human+mouse genome. After a quick preliminary analysis, I found that despite the enrichment the collaborators performed on their cells, the libraries only contained ~40-60% human reads, with the remainder of the reads being assigned to mouse genes.

My question is: for DESeq2 analysis, when should I remove the mouse reads from the analysis? I only care about the effects on the human cells, but I don't want to inappropriately influence the library composition. Does the RNA composition bias which the normalization addresses arise primarily from library preparation and sequencing artifacts or is it a biological phenomena, or a combination of the two? My thinking is that if it is a sequencing artifact I have to keep the mouse reads present for normalization, but if it is due to an underlying biological phenomena I can probably remove them prior to normalization.

I performed the analysis both ways (see sample code below) and there are clear differences not only in the number of significantly DE genes found, but also in the magnitude of many changes, which is unsurprising but confirms that I need to determine which is the more appropriate manner to perform the analysis.

Any input, corrections to my line of thinking, or advice would be greatly appreciated!

> ########Sample Code
> library(DESeq2)
> 
> #Generate sample table
> directory <- "counts"
> sampleFiles <- grep("counts",list.files(directory),value=TRUE)
> sampleTreat <- factor(c(rep("con",3),rep("A",3),rep("B",3),rep("AB",3)),levels=c("con","A","B","AB"))
> sampleRep <- c(rep(c(1,2,3),4))
> sampleName <- paste(sampleTreat,sampleRep,sep="_")
> sample_df <- data.frame(sampleName,sampleFiles,sampleTreat)
> 
> ####DE analysis with mouse and human counts present
> dds <- DESeqDataSetFromHTSeqCount(sample_df,directory=directory,design=~sampleTreat)
> dds <- estimateSizeFactors(dds)
> 
> idx <- rowSums(counts(dds,normalized=TRUE)>=5)>=3
> dds <- dds[idx,]
> dds <- DESeq(dds)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> 
> res.A <- results(dds, alpha=0.05,contrast=c("sampleTreat","A","con"))
> summary(res.A)

out of 28817 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 489, 1.7%
LFC < 0 (down)     : 2309, 8%
outliers [1]       : 18, 0.062%
low counts [2]     : 3353, 12%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> res.A.df <- data.frame(res.A)
> 
> #Extract results from human genes only
> res.A.df$ensembl <- gsub("\\..*","",rownames(res.A.df))
> res.A.df.human <- res.A.df[grep("^ENSG",res.A.df$ensembl),]
> 
> ####DE analysis with after removing mouse counts
> dds2 <- DESeqDataSetFromHTSeqCount(sample_df,directory=directory,design=~sampleTreat)
> dds2 <- dds2[grep("^ENSG",rownames(dds2)),]
> dds2 <- estimateSizeFactors(dds2)
> 
> idx2 <- rowSums(counts(dds2,normalized=TRUE)>=5)>=3
> dds2 <- dds2[idx2,]
> dds2 <- DESeq(dds2)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
> 
> res.2.A <- results(dds2, alpha=0.05,contrast=c("sampleTreat","A","con"))
> 
> ###Compare significant genes from different analyses
> ##DE analysis with human and mouse counts present
> #Total
> length(subset(res.A.df.human,padj<0.05)[[1]])
[1] 504
> #Up
> length(subset(res.A.df.human,padj<0.05&log2FoldChange>0)[[1]])
[1] 351
> #Down
> length(subset(res.A.df.human,padj<0.05&log2FoldChange<0)[[1]])
[1] 153
> ##DE analysis with mouse counts removed
> summary(res.2.A)

out of 14589 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 89, 0.61%
LFC < 0 (down)     : 242, 1.7%
outliers [1]       : 3, 0.021%
low counts [2]     : 2263, 16%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

> 
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.3
DESeq2 • 775 views
ADD COMMENT
1
Entering edit mode
@wolfgang-huber-3550
Last seen 4 months ago
EMBL European Molecular Biology Laborat…

The "Ockham's razor" principle says that if something can be explained by a technical bias (like, differences in enrichment efficiency for human cells) and explained away by reasonable normalization, then one should not invoke more complicated reasons (like, true differential expression in the human cells). I think this points towards your second analysis.

I would do some EDA (scatterplots, MA plots), stratified by species, & for human counts only, do better understand the data.

Perhaps this kind of experiment should nowadays be conducted using a single-cell resolution.

ADD COMMENT
1
Entering edit mode

Thanks so much for the assistance and really rapid response.

First off, I agree 100% that this type of experiment would benefit greatly from a single-cell approach. I have passed that advice on to my collaborators as well.

I think your advice for doing some exploratory plot generation was spot on. I've included a few here. First for Treatment A, there is a modest difference when mouse and human counts are both present for DE analysis- human genes are in red, mouse genes in blue (this is also confirmed by a slight shift in the linear correlation of fold-change values between the two different analyses, which I didn't include here for the sake of space).

MAplot_DE includes mouse and human countsMAplot_DE excludes mouse counts

Treatment B had a much more striking shift, again human genes in red, mouse genes in blue:

MAplot_DE excludes mouse counts TreatmentBMAplot_DE includes mouse and human counts TreatmentB

For the sake of not flooding the page with plots, I won't show all the scatter plots I also looked at but basically what I found was that while there was not a large shift in normalized counts, there was a modest shift in fold-changes calculated in the two different analyses for all genes, which is most dramatic in Treatment B. The biggest differences I found were in statistical power, where for instance Treatment B had 126 genes significantly unregulated when mouse counts were excluded but >4000 upregulated (human genes only) when mouse counts were included in DE analysis. I'm not sure if this sheds more light on the problem or not. I do think it also points to the second analysis as fitting the data better, and suggests some kind of technical artifact instead of a true biological phenomena as you suggested, though I'm still having a bit of a hard time understanding from a biological and technical perspective why the second analysis might be the more appropriate. But anyways, I appreciate the help and will update this post if I have any insights as I continue to dig into the data.

ADD REPLY

Login before adding your answer.

Traffic: 602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6