I am performing a differential expression analysis for collaborators. The overall biological design from my collaborators is as follows: 1) Received patient sample. 2) Amplified patient sample using patient derived xenograft (PDX) in a mouse host. 3) Extracted cells from mouse and enriched for human cells by positive selection using a human specific antibody. 4) Treated cells ex vivo with Treatments A, B, or A+B (and control). 5) Harvested RNA and submitted for poly-A selected RNA-seq.
Here's where I came in. Knowing where the sample came from, I aligned the RNA-seq reads with a combined human+mouse genome. After a quick preliminary analysis, I found that despite the enrichment the collaborators performed on their cells, the libraries only contained ~40-60% human reads, with the remainder of the reads being assigned to mouse genes.
My question is: for DESeq2 analysis, when should I remove the mouse reads from the analysis? I only care about the effects on the human cells, but I don't want to inappropriately influence the library composition. Does the RNA composition bias which the normalization addresses arise primarily from library preparation and sequencing artifacts or is it a biological phenomena, or a combination of the two? My thinking is that if it is a sequencing artifact I have to keep the mouse reads present for normalization, but if it is due to an underlying biological phenomena I can probably remove them prior to normalization.
I performed the analysis both ways (see sample code below) and there are clear differences not only in the number of significantly DE genes found, but also in the magnitude of many changes, which is unsurprising but confirms that I need to determine which is the more appropriate manner to perform the analysis.
Any input, corrections to my line of thinking, or advice would be greatly appreciated!
> ########Sample Code
> library(DESeq2)
>
> #Generate sample table
> directory <- "counts"
> sampleFiles <- grep("counts",list.files(directory),value=TRUE)
> sampleTreat <- factor(c(rep("con",3),rep("A",3),rep("B",3),rep("AB",3)),levels=c("con","A","B","AB"))
> sampleRep <- c(rep(c(1,2,3),4))
> sampleName <- paste(sampleTreat,sampleRep,sep="_")
> sample_df <- data.frame(sampleName,sampleFiles,sampleTreat)
>
> ####DE analysis with mouse and human counts present
> dds <- DESeqDataSetFromHTSeqCount(sample_df,directory=directory,design=~sampleTreat)
> dds <- estimateSizeFactors(dds)
>
> idx <- rowSums(counts(dds,normalized=TRUE)>=5)>=3
> dds <- dds[idx,]
> dds <- DESeq(dds)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
>
> res.A <- results(dds, alpha=0.05,contrast=c("sampleTreat","A","con"))
> summary(res.A)
out of 28817 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 489, 1.7%
LFC < 0 (down) : 2309, 8%
outliers [1] : 18, 0.062%
low counts [2] : 3353, 12%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
> res.A.df <- data.frame(res.A)
>
> #Extract results from human genes only
> res.A.df$ensembl <- gsub("\\..*","",rownames(res.A.df))
> res.A.df.human <- res.A.df[grep("^ENSG",res.A.df$ensembl),]
>
> ####DE analysis with after removing mouse counts
> dds2 <- DESeqDataSetFromHTSeqCount(sample_df,directory=directory,design=~sampleTreat)
> dds2 <- dds2[grep("^ENSG",rownames(dds2)),]
> dds2 <- estimateSizeFactors(dds2)
>
> idx2 <- rowSums(counts(dds2,normalized=TRUE)>=5)>=3
> dds2 <- dds2[idx2,]
> dds2 <- DESeq(dds2)
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
>
> res.2.A <- results(dds2, alpha=0.05,contrast=c("sampleTreat","A","con"))
>
> ###Compare significant genes from different analyses
> ##DE analysis with human and mouse counts present
> #Total
> length(subset(res.A.df.human,padj<0.05)[[1]])
[1] 504
> #Up
> length(subset(res.A.df.human,padj<0.05&log2FoldChange>0)[[1]])
[1] 351
> #Down
> length(subset(res.A.df.human,padj<0.05&log2FoldChange<0)[[1]])
[1] 153
> ##DE analysis with mouse counts removed
> summary(res.2.A)
out of 14589 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up) : 89, 0.61%
LFC < 0 (down) : 242, 1.7%
outliers [1] : 3, 0.021%
low counts [2] : 2263, 16%
(mean count < 7)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
>
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.3
Thanks so much for the assistance and really rapid response.
First off, I agree 100% that this type of experiment would benefit greatly from a single-cell approach. I have passed that advice on to my collaborators as well.
I think your advice for doing some exploratory plot generation was spot on. I've included a few here. First for Treatment A, there is a modest difference when mouse and human counts are both present for DE analysis- human genes are in red, mouse genes in blue (this is also confirmed by a slight shift in the linear correlation of fold-change values between the two different analyses, which I didn't include here for the sake of space).
Treatment B had a much more striking shift, again human genes in red, mouse genes in blue:
For the sake of not flooding the page with plots, I won't show all the scatter plots I also looked at but basically what I found was that while there was not a large shift in normalized counts, there was a modest shift in fold-changes calculated in the two different analyses for all genes, which is most dramatic in Treatment B. The biggest differences I found were in statistical power, where for instance Treatment B had 126 genes significantly unregulated when mouse counts were excluded but >4000 upregulated (human genes only) when mouse counts were included in DE analysis. I'm not sure if this sheds more light on the problem or not. I do think it also points to the second analysis as fitting the data better, and suggests some kind of technical artifact instead of a true biological phenomena as you suggested, though I'm still having a bit of a hard time understanding from a biological and technical perspective why the second analysis might be the more appropriate. But anyways, I appreciate the help and will update this post if I have any insights as I continue to dig into the data.