Hi all,
I recently started analysing a mouse RNA-seq dataset, where I had the following data:
- 6 control samples, all of them contaminated with a human cell line
- 6 KD samples with no contamination
After seeing the "experimental setup" my first thought was to just trash it, take a deep breath, and tell the collaborators to redo the experiments/sequencing as the contamination and the control condition is perfectly confounded.
Later I still started doing some basic analysis. The pipeline is the following:
- combine an up-to-date mouse and human transcriptome annotation from GENCODE
- run Salmon using the combined transcriptome index
- tximport to import gene level counts
- calculate read number and % mapping to human transcripts
- do limma/voom for differential expression only on the mouse genes, using:
model.matrix(~ human_read_percent + condition)
Of course those samples that have no human contamination, still have somewhere between 3-7% reads mapped to the human transcriptome, thanks to homology. There are no significantly differentially expressed genes after this, not even the gene that was knocked down.
Is there any hope of improving this somehow, or I should just give up? Thanks for any advice!
This may be an impossible effort due to the homology. One thing to note about your model is that you will have in actuality *linear* increases in counts/expression with human_read_percent, but all the statistical models are working on the log scale, whether modeling log CPM or using a GLM with log link. So it doesn't make sense to have human_read_percent in the model matrix as a continuous vector when it's being used to explain log count increases.
"There are no significantly differentially expressed genes after this, not even the gene that was knocked down." What do the TPMs look like for this gene across the samples?
OK, maybe I can drop human_read_percent from the model, and see results then.
The TPM is between 0 - 0.1 for all 12 samples, both for the mouse gene and the human ortholog. Wondering if I have other problems too... BTW, based on biomart, I have a one 2 one ortholog, with 98% sequence identity between mouse/human for the KD.
Yeah, I'd say there are issues pre-Bioconductor packages, which you'll have to look into first.