Question

sumarizeToGene is dropping transcripts, how to debug

0

Entering edit mode

aedavids ▴ 20

@aa611017

Last seen 3.8 years ago

United States

I have been using DESeq2 for a while. works great. I am running into trouble with a custom reference.

I am using DESeq2 to normalize transcript counts. I noticed that about half of the transcripts are getting dropped. Using print statements and the debugger I think there maybe be a problem with my tx2Gene data frame. I think it matches the salmon quant.sf files. using debug(), browse(), trace() I am able to step into tximport(). I am not able to step into sumarizeToGene() to figure out where the match goes wrong.

I tried playing around with the ignoreAfterBar and txOut arguments to tximport() v 1.17.3

When I run salmon I am using a custom reference. The counts have basically to formats

first format example

ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|   632 386.000 0.000000    0.000

second format example. This is the custom part of our reference.

hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5'pad=0_3'pad=0_strand=-_repeatMasking=none   23915   23653.000   0.000000    0.000

my tx2Gene file has entries that match both formats. It may not be easy to see but this file has 2 columns, txid, gene. The txid is really long. for ENST transcript we use the entire salmon quant.sf 'name' column. I suspect that summarizeToGene make some sort of assumption about the txid that I am violating.

grep ENST00000456328 gencode.v35.ucsc.rmsk.tx.to.gene.csv 
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|,DDX11L1

$ grep  hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5\'pad=0_3\'pad=0_strand=-_repeatMasking=none gencode.v35.ucsc.rmsk.tx.to.gene.csv 
hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5'pad=0_3'pad=0_strand=-_repeatMasking=none,ALR/Alpha

for unknown reasons all the transcripts that start with ENST are dropped.

attributes(txi$abundance)[2]$dimnames[[1]]

values are the gene name column, gene names like DDX11L1 are missing. That is to say, transcripts that started with ENST did not match

I have run tximport() on the same salmon quant.sf file using a tx2gene file that only supported the ENST like entries. If I set ignoreAfterBar=TRUE, it works as expected.

Any idea how I can debug this?

Kind regards

Andy

txi <- tximport(allSalmonQuantFiles, 
                  type="salmon", 
                  tx2gene=transcriptMapperDF, 
                  ignoreAfterBar=ignoreAfterBar,
                  txOut=FALSE)

sumarizeToGene tximport • 1.5k views

ADD COMMENT • link updated 5.0 years ago by Michael Love 43k • written 5.1 years ago by aedavids ▴ 20

score 0 · Answer 1 · 2020-12-08

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 26 minutes ago

United States

With tx2gene, you are summarizing transcripts to gene level, so there should not be ENST in the output. I'm confused by what you mean that transcripts are missing, because they should be missing (txOut=FALSE).

ADD COMMENT • link 5.1 years ago Michael Love 43k

0

Entering edit mode

Hi Michael

I did. not explain things as well as I might have. I do get gene names in my normalized csv output file. All the gene for transcripts that begin with 'ENST" are missing

here is an example of a dropped transcript, ie a missing gene in the output

$ head  pancreas.plasma.ev.long.RNA/data/PDAC/SRR10080545/salmon.out/quant.sf
Name    Length  EffectiveLength TPM NumReads
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|    1657    1284.505    7.992595    149.828

I am able to find the salmon 'name' in the file I use for tx2gene

$ grep 'ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|' gencode.v35.ucsc.rmsk.tx.to.gene.csv 
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|,DDX11L1

given the TMP is 149.828 I expected to see an entry in the normalized output file for gene DDX11L1

If I look at the normalized output file all of the genes correspond to the custom entries in our reference. That is to say the transcripts in the salmon file that matched what I called the "second formate", "(AAAG)n" is one of our custom gene names

$ head pancreas.plasma.ev.long.RNA.normalized.deseq.gene.counts.csv
"","healthy_SRR10080507","healthy_SRR10080508","PDAC_SRR10080544","PDAC_SRR10080545"
"(AAAG)n",39.5866270935895,88.1897690237628,47.0826968945544,1.14471999278944

I really appreciate any suggestion you have for debugging

Andy

ADD REPLY • link 5.1 years ago aedavids ▴ 20

0

Entering edit mode

You should check two things:

1) is this character string in the first column of tx2gene?

or

ENST00000456328.2

...if you are using ignoreAfterBar=TRUE

2) if so, what is the string in the second column?

ADD REPLY • link 5.1 years ago Michael Love 43k

0

Entering edit mode

Hi Michael

many thanks. Turns out you were correct there was a parsing bug in the code that created tx2gene. It was hard to find in RStudio it was displaying truncated lines

Happy new year

Andy

ADD REPLY • link 5.0 years ago aedavids ▴ 20

0

Entering edit mode

Thanks for reporting back!

Happy new year :)

ADD REPLY • link 5.0 years ago Michael Love 43k