sumarizeToGene is dropping transcripts, how to debug
1
0
Entering edit mode
aedavids ▴ 20
@aa611017
Last seen 4 months ago
United States

I have been using DESeq2 for a while. works great. I am running into trouble with a custom reference.

I am using DESeq2 to normalize transcript counts. I noticed that about half of the transcripts are getting dropped. Using print statements and the debugger I think there maybe be a problem with my tx2Gene data frame. I think it matches the salmon quant.sf files. using debug(), browse(), trace() I am able to step into tximport(). I am not able to step into sumarizeToGene() to figure out where the match goes wrong.

I tried playing around with the ignoreAfterBar and txOut arguments to tximport() v 1.17.3

When I run salmon I am using a custom reference. The counts have basically to formats

first format example

ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene|   632 386.000 0.000000    0.000


second format example. This is the custom part of our reference.

hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5'pad=0_3'pad=0_strand=-_repeatMasking=none   23915   23653.000   0.000000    0.000


my tx2Gene file has entries that match both formats. It may not be easy to see but this file has 2 columns, txid, gene. The txid is really long. for ENST transcript we use the entire salmon quant.sf 'name' column. I suspect that summarizeToGene make some sort of assumption about the txid that I am violating.

grep ENST00000456328 gencode.v35.ucsc.rmsk.tx.to.gene.csv
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|,DDX11L1

$grep hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5\'pad=0_3\'pad=0_strand=-_repeatMasking=none gencode.v35.ucsc.rmsk.tx.to.gene.csv hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5'pad=0_3'pad=0_strand=-_repeatMasking=none,ALR/Alpha  for unknown reasons all the transcripts that start with ENST are dropped. attributes(txi$abundance)[2]$dimnames[[1]]  values are the gene name column, gene names like DDX11L1 are missing. That is to say, transcripts that started with ENST did not match I have run tximport() on the same salmon quant.sf file using a tx2gene file that only supported the ENST like entries. If I set ignoreAfterBar=TRUE, it works as expected. Any idea how I can debug this? Kind regards Andy txi <- tximport(allSalmonQuantFiles, type="salmon", tx2gene=transcriptMapperDF, ignoreAfterBar=ignoreAfterBar, txOut=FALSE)  sumarizeToGene tximport • 505 views ADD COMMENT 0 Entering edit mode @mikelove Last seen 11 hours ago United States With tx2gene, you are summarizing transcripts to gene level, so there should not be ENST in the output. I'm confused by what you mean that transcripts are missing, because they should be missing (txOut=FALSE). ADD COMMENT 0 Entering edit mode Hi Michael I did. not explain things as well as I might have. I do get gene names in my normalized csv output file. All the gene for transcripts that begin with 'ENST" are missing here is an example of a dropped transcript, ie a missing gene in the output $ head  pancreas.plasma.ev.long.RNA/data/PDAC/SRR10080545/salmon.out/quant.sf
Name    Length  EffectiveLength TPM NumReads
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|    1657    1284.505    7.992595    149.828


I am able to find the salmon 'name' in the file I use for tx2gene

$grep 'ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|' gencode.v35.ucsc.rmsk.tx.to.gene.csv ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|,DDX11L1  given the TMP is 149.828 I expected to see an entry in the normalized output file for gene DDX11L1 If I look at the normalized output file all of the genes correspond to the custom entries in our reference. That is to say the transcripts in the salmon file that matched what I called the "second formate", "(AAAG)n" is one of our custom gene names $ head pancreas.plasma.ev.long.RNA.normalized.deseq.gene.counts.csv
"","healthy_SRR10080507","healthy_SRR10080508","PDAC_SRR10080544","PDAC_SRR10080545"
"(AAAG)n",39.5866270935895,88.1897690237628,47.0826968945544,1.14471999278944


I really appreciate any suggestion you have for debugging

Andy

ADD REPLY
0
Entering edit mode

You should check two things:

1) is this character string in the first column of tx2gene?

ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|

or

ENST00000456328.2

...if you are using ignoreAfterBar=TRUE

2) if so, what is the string in the second column?

ADD REPLY
0
Entering edit mode

Hi Michael

many thanks. Turns out you were correct there was a parsing bug in the code that created tx2gene. It was hard to find in RStudio it was displaying truncated lines

Happy new year

Andy

ADD REPLY
0
Entering edit mode

Thanks for reporting back!

Happy new year :)

ADD REPLY

Login before adding your answer.

Traffic: 487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6