I recently started analysing a mouse RNA-seq dataset, where I had the following data:
- 6 control samples, all of them contaminated with a human cell line
- 6 KD samples with no contamination
After seeing the "experimental setup" my first thought was to just trash it, take a deep breath, and tell the collaborators to redo the experiments/sequencing as the contamination and the control condition is perfectly confounded.
Later I still started doing some basic analysis. The pipeline is the following:
- combine an up-to-date mouse and human transcriptome annotation from GENCODE
- run Salmon using the combined transcriptome index
- tximport to import gene level counts
- calculate read number and % mapping to human transcripts
- do limma/voom for differential expression only on the mouse genes, using:
model.matrix(~ human_read_percent + condition)
Of course those samples that have no human contamination, still have somewhere between 3-7% reads mapped to the human transcriptome, thanks to homology. There are no significantly differentially expressed genes after this, not even the gene that was knocked down.
Is there any hope of improving this somehow, or I should just give up? Thanks for any advice!