Salmon/DESeq2: tx2gene discrepancy in DESeq object
1
0
Entering edit mode
gmchaput • 0
@gmchaput-22280
Last seen 19 months ago
United States

I have been working with a RNAseq dataset and wanted to change the names of my GENEID during tximport so I could match it with a dataframe in a downstream analysis. However, if I change the GENEID all of a sudden DESeq2 says I have 30200 genes rather than 25900. I thought tximport only is matching the transcript IDs between the gtf file and the quant.sf file?

Just to explain the code below:

  • I get my gff3 file from JGI Phytozome database for Panicum hallii var. hallii v2.1 (Hall's panicgrass) and convert it to a gtf file
  • The transcript ID does not match my quant.sf file so I remove the ".v2.1" in the TXNAME
  • The discrepancy only occurs when I change the GENEID and nothing else

Here is the code:

TxDb = makeTxDbFromGFF("PhalliiHAL_annot.gtf", format = "gtf")
k <- keys(TxDb, keytype = "GENEID")
df <- ensembldb::select(TxDb, keys = k, keytype = "GENEID", columns = "TXNAME")
tx2gene <- df[, 2:1]
head(tx2gene) #should have 2 columns with transcript ID as the first column
#because the ".v2.1" won't match with the other file, I need to remove these for both gene and transcript
tx2gene$TXNAME = as.character(tx2gene$TXNAME) #make sure it's a character vector
TXNAME = as.vector(tx2gene$TXNAME)
GENEID = as.vector(tx2gene$GENEID)
n <- 3 # keep first 3 fields only
tx2gene$TXNAME = do.call(paste, c(read.table(text = TXNAME, sep = ".")[1:n], sep = "."))
tx2gene$GENEID = do.call(paste, c(read.table(text = TXNAME, sep = ".")[1:n], sep = "."))
samples = read.table("samples.csv", header =TRUE, sep = ",")
files = file.path("PhalliiHAL_quants", samples$sample, "quant.sf") 
names(files) <- samples$sample[1:42]
all(file.exists(files))txi = tximport(files, type ="salmon", tx2gene = tx2gene)
deseq2 tximport salmon • 1.1k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 day ago
United States

Yes, tximport is matching the transcript IDs in the quantification files with column 1 of tx2gene.

To debug what's going on, I'd recommend to first import data with tximport using the table that is created from the GFF (with only modifications necessary to match the transcript IDs, e.g. stripping version numbers in the GFF which are not present in FASTA).

Then as a second step, you could rename the gene IDs.

Separating into two steps will help make it easier for you and others to know what's happening at each step.

ADD COMMENT

Login before adding your answer.

Traffic: 396 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6