RNA-seq analysis on mouse data with human contamination in controls
1
0
Entering edit mode
@endre-sebestyen-2707
Last seen 13 months ago
Hungary, Budapest

Hi all,

I recently started analysing a mouse RNA-seq dataset, where I had the following data:

  • 6 control samples, all of them contaminated with a human cell line
  • 6 KD samples with no contamination

After seeing the "experimental setup" my first thought was to just trash it, take a deep breath, and tell the collaborators to redo the experiments/sequencing as the contamination and the control condition is perfectly confounded.

Later I still started doing some basic analysis. The pipeline is the following:

  • combine an up-to-date mouse and human transcriptome annotation from GENCODE
  • run Salmon using the combined transcriptome index
  • tximport to import gene level counts
  • calculate read number and % mapping to human transcripts
  • do limma/voom for differential expression only on the mouse genes, using:
model.matrix(~ human_read_percent + condition)

Of course those samples that have no human contamination, still have somewhere between 3-7% reads mapped to the human transcriptome, thanks to homology. There are no significantly differentially expressed genes after this, not even the gene that was knocked down.

Is there any hope of improving this somehow, or I should just give up? Thanks for any advice!

 

rna-seq salmon tximport limma • 3.1k views
ADD COMMENT
0
Entering edit mode

This may be an impossible effort due to the homology. One thing to note about your model is that you will have in actuality *linear* increases in counts/expression with human_read_percent, but all the statistical models are working on the log scale, whether modeling log CPM or using a GLM with log link. So it doesn't make sense to have human_read_percent in the model matrix as a continuous vector when it's being used to explain log count increases.

"There are no significantly differentially expressed genes after this, not even the gene that was knocked down." What do the TPMs look like for this gene across the samples?

ADD REPLY
0
Entering edit mode

OK, maybe I can drop human_read_percent from the model, and see results then.

The TPM is between 0 - 0.1 for all 12 samples, both for the mouse gene and the human ortholog. Wondering if I have other problems too... BTW, based on biomart, I have a one 2 one ortholog, with 98% sequence identity between mouse/human for the KD.

ADD REPLY
0
Entering edit mode

Yeah, I'd say there are issues pre-Bioconductor packages, which you'll have to look into first.

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 6 hours ago
The city by the bay

The goal during quantification would be to remove ambiguous reads that could map to human or mouse. In a conventional aligner, you could just remove multi-mapping reads based on the MAPQ score or other flags. This will reduce your power due to homologies, but the alternative would be to get "differential expression" due to human contamination, which would be bad. I'm not sure that Salmon will explicitly remove these reads - they might still contribute to quantification, so best to check that they're being tossed out. Better safe than sorry here.

Moving onto your model; in addition to Mike's comment, putting the human contamination as a covariate will invariably reduce power because it's confounded with the experimental condition. Any differences in expression between the control/KD conditions can be partially absorbed into the coefficient for the contamination covariate under the null model, reducing the evidence to reject the null. As Mike says, check whether the lack of DE for your positive control is actually due to an absence of change in expression. If the target gene has no expression or constant expression across conditions, due to homologous mapping/read pruning, then you're out of luck.

ADD COMMENT
0
Entering edit mode

Salmon does not remove the species ambiguous reads, but tries to resolve them and assign to a given transcript, using the combined human/mouse transcriptome. The human ortholog of the KD has 98% sequence identity, based on biomart, so yeah, maybe I should give up.

ADD REPLY

Login before adding your answer.

Traffic: 616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6