Search
Question: RNA-seq analysis on mouse data with human contamination in controls
0
gravatar for Endre Sebestyen
6 months ago by
Endre Sebestyen60 wrote:

Hi all,

I recently started analysing a mouse RNA-seq dataset, where I had the following data:

  • 6 control samples, all of them contaminated with a human cell line
  • 6 KD samples with no contamination

After seeing the "experimental setup" my first thought was to just trash it, take a deep breath, and tell the collaborators to redo the experiments/sequencing as the contamination and the control condition is perfectly confounded.

Later I still started doing some basic analysis. The pipeline is the following:

  • combine an up-to-date mouse and human transcriptome annotation from GENCODE
  • run Salmon using the combined transcriptome index
  • tximport to import gene level counts
  • calculate read number and % mapping to human transcripts
  • do limma/voom for differential expression only on the mouse genes, using:
model.matrix(~ human_read_percent + condition)

Of course those samples that have no human contamination, still have somewhere between 3-7% reads mapped to the human transcriptome, thanks to homology. There are no significantly differentially expressed genes after this, not even the gene that was knocked down.

Is there any hope of improving this somehow, or I should just give up? Thanks for any advice!

 

ADD COMMENTlink modified 6 months ago by Aaron Lun17k • written 6 months ago by Endre Sebestyen60

This may be an impossible effort due to the homology. One thing to note about your model is that you will have in actuality *linear* increases in counts/expression with human_read_percent, but all the statistical models are working on the log scale, whether modeling log CPM or using a GLM with log link. So it doesn't make sense to have human_read_percent in the model matrix as a continuous vector when it's being used to explain log count increases.

"There are no significantly differentially expressed genes after this, not even the gene that was knocked down." What do the TPMs look like for this gene across the samples?

ADD REPLYlink modified 6 months ago • written 6 months ago by Michael Love14k

OK, maybe I can drop human_read_percent from the model, and see results then.

The TPM is between 0 - 0.1 for all 12 samples, both for the mouse gene and the human ortholog. Wondering if I have other problems too... BTW, based on biomart, I have a one 2 one ortholog, with 98% sequence identity between mouse/human for the KD.

ADD REPLYlink written 6 months ago by Endre Sebestyen60

Yeah, I'd say there are issues pre-Bioconductor packages, which you'll have to look into first.

ADD REPLYlink written 6 months ago by Michael Love14k
1
gravatar for Aaron Lun
6 months ago by
Aaron Lun17k
Cambridge, United Kingdom
Aaron Lun17k wrote:

The goal during quantification would be to remove ambiguous reads that could map to human or mouse. In a conventional aligner, you could just remove multi-mapping reads based on the MAPQ score or other flags. This will reduce your power due to homologies, but the alternative would be to get "differential expression" due to human contamination, which would be bad. I'm not sure that Salmon will explicitly remove these reads - they might still contribute to quantification, so best to check that they're being tossed out. Better safe than sorry here.

Moving onto your model; in addition to Mike's comment, putting the human contamination as a covariate will invariably reduce power because it's confounded with the experimental condition. Any differences in expression between the control/KD conditions can be partially absorbed into the coefficient for the contamination covariate under the null model, reducing the evidence to reject the null. As Mike says, check whether the lack of DE for your positive control is actually due to an absence of change in expression. If the target gene has no expression or constant expression across conditions, due to homologous mapping/read pruning, then you're out of luck.

ADD COMMENTlink modified 6 months ago • written 6 months ago by Aaron Lun17k

Salmon does not remove the species ambiguous reads, but tries to resolve them and assign to a given transcript, using the combined human/mouse transcriptome. The human ortholog of the KD has 98% sequence identity, based on biomart, so yeah, maybe I should give up.

ADD REPLYlink written 6 months ago by Endre Sebestyen60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 173 users visited in the last hour