Question

Unexpected gene polymorphism using Salmon-tximeta-DESeq2

0

Entering edit mode

Ray ▴ 20

@Ray-24558

Last seen 20 months ago

Hong Kong

We're analyzing RNAseq data with a pipeline consisting of Salmon, tximeta, and DESeq2.

We have a multi-factorial experimental design, and the experiment was performed on cell lines.

On thing that surprised us is that in the result output, we observe many gene polymorphisms.

For example, for gene NLRP2 we observed multiple entries associated with different ensembl IDs ENSG00000022556, ENSG00000275082, ENSG00000275843, etc.

Entries of NLRP2 from one particular RNAseq experiment result

My question is how do we interpret data like this? And how to deal with this kind of situation? Can we add/average different entries associated with the same gene?

tximeta DESeq2 • 1.3k views

ADD COMMENT • link 3.3 years ago • updated 3.2 years ago Ray ▴ 20

score 0 · Answer 1 · 2021-01-14

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

This is a consequence of the transcriptome you used for quantification. I recommend that people working with human data use GENCODE reference transcripts, because it does not duplicate genes on haplotype chromosomes (which Ensembl does for its transcripts FASTA files). See the chromosome for the genes other than the first, they are listed as "Chromosome CHR_HSCHR19..." which is a haplotype of chr19.

Another reason is that GENCODE provides a single file, while for Ensembl you need to combine the cDNA and ncRNA files to produce a transcriptome.

ADD COMMENT • link 3.3 years ago Michael Love 41k

0

Entering edit mode

Thanks so much for the clarification Michael.

I was indeed confused by the alternative scaffolds included in ensembl genome.

Now that you've mentioned it, I will rebuild salmon index with GENCODE reference transcriptome.

ADD REPLY • link 3.2 years ago Ray ▴ 20

0

Entering edit mode

Oh and a further recommendation, when you use Salmon to index, specify --gencode which will clean the transcript names in the Salmon output.

ADD REPLY • link 3.2 years ago Michael Love 41k

0

Entering edit mode

Thank you!!

Indeed I included the --gencode flag by following a tutorial from here https://biocorecrg.github.io/RNAseq_course_2019/salmon.html :)

Right now I'm trying to extract some extra information (i.e. gene symbol, description, etc.) from the rowRanges slot. When I was using ensembl genome reference, these were automatically appended to the SummarizedExperiment object from AnnotationHub, but with GENCODE genome these information were missing.

I've tried the makeLinkedTxome() function to link a local gencode gtf file but it didn't seem to work.

Now I'm reading this vignette https://biodatascience.github.io/compbio/bioc/SE.html to see if I can add these back directly from the gencode gtf file. Any suggestions?

ADD REPLY • link 3.2 years ago Ray ▴ 20

0

Entering edit mode

Have you tried addIds from tximeta package?

ADD REPLY • link 3.2 years ago Michael Love 41k

0

Entering edit mode

Just tried addIds and it worked, thanks a lot Michael!

ADD REPLY • link 3.2 years ago Ray ▴ 20