Question

RNA-seq analysis on mouse data with human contamination in controls

0

Entering edit mode

Endre Sebestyén ▴ 70

@endre-sebestyen-2707

Last seen 4 months ago

Hungary, Budapest

Hi all,

I recently started analysing a mouse RNA-seq dataset, where I had the following data:

6 control samples, all of them contaminated with a human cell line
6 KD samples with no contamination

After seeing the "experimental setup" my first thought was to just trash it, take a deep breath, and tell the collaborators to redo the experiments/sequencing as the contamination and the control condition is perfectly confounded.

Later I still started doing some basic analysis. The pipeline is the following:

combine an up-to-date mouse and human transcriptome annotation from GENCODE
run Salmon using the combined transcriptome index
tximport to import gene level counts
calculate read number and % mapping to human transcripts
do limma/voom for differential expression only on the mouse genes, using:

model.matrix(~ human_read_percent + condition)

Of course those samples that have no human contamination, still have somewhere between 3-7% reads mapped to the human transcriptome, thanks to homology. There are no significantly differentially expressed genes after this, not even the gene that was knocked down.

Is there any hope of improving this somehow, or I should just give up? Thanks for any advice!

rna-seq salmon tximport limma • 2.6k views

ADD COMMENT • link updated 7.0 years ago by Aaron Lun ★ 28k • written 7.0 years ago by Endre Sebestyén ▴ 70

0

Entering edit mode

This may be an impossible effort due to the homology. One thing to note about your model is that you will have in actuality *linear* increases in counts/expression with human_read_percent, but all the statistical models are working on the log scale, whether modeling log CPM or using a GLM with log link. So it doesn't make sense to have human_read_percent in the model matrix as a continuous vector when it's being used to explain log count increases.

"There are no significantly differentially expressed genes after this, not even the gene that was knocked down." What do the TPMs look like for this gene across the samples?

ADD REPLY • link 7.0 years ago Michael Love 41k

0

Entering edit mode

OK, maybe I can drop human_read_percent from the model, and see results then.

The TPM is between 0 - 0.1 for all 12 samples, both for the mouse gene and the human ortholog. Wondering if I have other problems too... BTW, based on biomart, I have a one 2 one ortholog, with 98% sequence identity between mouse/human for the KD.

ADD REPLY • link 7.0 years ago Endre Sebestyén ▴ 70

0

Entering edit mode

Yeah, I'd say there are issues pre-Bioconductor packages, which you'll have to look into first.

ADD REPLY • link 7.0 years ago Michael Love 41k

score 1 · Answer 1 · 2017-05-01

The goal during quantification would be to remove ambiguous reads that could map to human or mouse. In a conventional aligner, you could just remove multi-mapping reads based on the MAPQ score or other flags. This will reduce your power due to homologies, but the alternative would be to get "differential expression" due to human contamination, which would be bad. I'm not sure that Salmon will explicitly remove these reads - they might still contribute to quantification, so best to check that they're being tossed out. Better safe than sorry here.

Moving onto your model; in addition to Mike's comment, putting the human contamination as a covariate will invariably reduce power because it's confounded with the experimental condition. Any differences in expression between the control/KD conditions can be partially absorbed into the coefficient for the contamination covariate under the null model, reducing the evidence to reject the null. As Mike says, check whether the lack of DE for your positive control is actually due to an absence of change in expression. If the target gene has no expression or constant expression across conditions, due to homologous mapping/read pruning, then you're out of luck.