I have been using DESeq2 for a while. works great. I am running into trouble with a custom reference.
I am using DESeq2 to normalize transcript counts. I noticed that about half of the transcripts are getting dropped. Using print statements and the debugger I think there maybe be a problem with my tx2Gene data frame. I think it matches the salmon quant.sf files. using debug(), browse(), trace() I am able to step into tximport(). I am not able to step into sumarizeToGene() to figure out where the match goes wrong.
I tried playing around with the ignoreAfterBar and txOut arguments to tximport() v 1.17.3
When I run salmon I am using a custom reference. The counts have basically to formats
first format example
ENST00000450305.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000002844.2|DDX11L1-201|DDX11L1|632|transcribed_unprocessed_pseudogene| 632 386.000 0.000000 0.000
second format example. This is the custom part of our reference.
hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5'pad=0_3'pad=0_strand=-_repeatMasking=none 23915 23653.000 0.000000 0.000
my tx2Gene file has entries that match both formats. It may not be easy to see but this file has 2 columns, txid, gene. The txid is really long. for ENST transcript we use the entire salmon quant.sf 'name' column. I suspect that summarizeToGene make some sort of assumption about the txid that I am violating.
grep ENST00000456328 gencode.v35.ucsc.rmsk.tx.to.gene.csv
ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|,DDX11L1
$ grep hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5\'pad=0_3\'pad=0_strand=-_repeatMasking=none gencode.v35.ucsc.rmsk.tx.to.gene.csv
hg38_rmsk_ALR/Alpha_range=chr22_KI270739v1_random:50071-73985_5'pad=0_3'pad=0_strand=-_repeatMasking=none,ALR/Alpha
for unknown reasons all the transcripts that start with ENST are dropped.
attributes(txi$abundance)[2]$dimnames[[1]]
values are the gene name column, gene names like DDX11L1 are missing. That is to say, transcripts that started with ENST did not match
I have run tximport() on the same salmon quant.sf file using a tx2gene file that only supported the ENST like entries. If I set ignoreAfterBar=TRUE, it works as expected.
Any idea how I can debug this?
Kind regards
Andy
txi <- tximport(allSalmonQuantFiles,
type="salmon",
tx2gene=transcriptMapperDF,
ignoreAfterBar=ignoreAfterBar,
txOut=FALSE)
Hi Michael
I did. not explain things as well as I might have. I do get gene names in my normalized csv output file. All the gene for transcripts that begin with 'ENST" are missing
here is an example of a dropped transcript, ie a missing gene in the output
I am able to find the salmon 'name' in the file I use for tx2gene
given the TMP is 149.828 I expected to see an entry in the normalized output file for gene DDX11L1
If I look at the normalized output file all of the genes correspond to the custom entries in our reference. That is to say the transcripts in the salmon file that matched what I called the "second formate", "(AAAG)n" is one of our custom gene names
I really appreciate any suggestion you have for debugging
Andy
You should check two things:
1) is this character string in the first column of
tx2gene
?ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|
or
ENST00000456328.2
...if you are using
ignoreAfterBar=TRUE
2) if so, what is the string in the second column?
Hi Michael
many thanks. Turns out you were correct there was a parsing bug in the code that created tx2gene. It was hard to find in RStudio it was displaying truncated lines
Happy new year
Andy
Thanks for reporting back!
Happy new year :)