Question

tximport gene name

1

Entering edit mode

tanyabioinfo ▴ 20

@tanyabioinfo-14091

Last seen 5.2 years ago

Hi I am doing the following to get the tximport count matrix with gene name in the first column

txdf <- transcripts(EnsDb.Mmusculus.v79, return.type = "DataFrame")
txdf$symbol <- mapIds(EnsDb.Mmusculus.v79, txdf$gene_id, "GENENAME", "GENEID")
tx2gene <- as.data.frame(txdf[,c("tx_id","symbol")])

txi <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE,dropInfReps=TRUE)

However when I do head(txi$abundance)

0 wt 0 wt 0 wt 0 wt 6 wt 6 wt 6 wt
71.50353 112.29713 73.64570 73.13216 60.17879 56.01880 57.25439
0610007P14Rik 0.00000 16.73136 69.46050 60.45882 86.66511 27.10330 48.84700
0610009B22Rik 0.00000 16.34480 29.00857 26.11050 0.00000 18.28440 25.29169

I am getting an extra row at the top. Can someone help me to rectify this or let me know if I am doing anything wrong.

Tanya

tximport • 1.9k views

ADD COMMENT • link updated 6.5 years ago by Ed Siefker ▴ 230 • written 6.5 years ago by tanyabioinfo ▴ 20

score 0 · Answer 1 · 2017-10-26

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

I believe that’s not a row of data, it’s the column names. Check what is in position [1,1] if you want to see the data values alone.

ADD COMMENT • link 6.5 years ago Michael Love 41k

0

Entering edit mode

I believe the "extra row" she means is the one under the column names.
71.50353 112.29713 73.64570 73.13216 60.17879 56.01880 57.25439

I did the same analysis with the same annotation this week and also have a row with no rowname. I was worried there might be an off by one error somewhere, but my results look similar to those from another tool so that's probably not the case. Just one tx_id that doesn't have a corresponding symbol I guess. Or could it be many tx_ids that are collapsed to the symbol "" during summarization?

ADD REPLY • link 6.5 years ago Ed Siefker ▴ 230

1

Entering edit mode

Re “many tx_ids that are collapsed to the symbol "... If so, you can go looking in your tx2gene.

ADD REPLY • link 6.5 years ago Michael Love 41k

score 0 · Answer 2 · 2017-10-31

Good point Michael. They weren't hard to find.

> tx2gene <- transcripts(EnsDb.Mmusculus.v79, columns=c("gene_name"), return.type="data.frame")[c(2,1)]
> head(tx2gene,n=12)
tx_id gene_name
1 ENSMUST00000077235
2 ENSMUST00000179505
3 ENSMUST00000178343
4 ENSMUST00000187028
5 ENSMUST00000186475
6 ENSMUST00000161472
7 ENSMUST00000182513
8 ENSMUST00000130094 0610005C13Rik
9 ENSMUST00000145208 0610005C13Rik
10 ENSMUST00000133678 0610005C13Rik
11 ENSMUST00000123549 0610005C13Rik
12 ENSMUST00000132138 0610005C13Rik
> which(tx2gene$gene_name =="")
[1] 1 2 3 4 5 6 7
>

tx2gene[1,] is DHRSX. Both of tx2gene[2:3,] are AC149090. tx2gene[4,] is Zfp383 and so on.

So, we are collapsing multiple tx_id to "". Unsurprising, since Ensembl is on v90 and we're using v79 annotations. I haven't tested, but I'd imagine building the current EnsDB using ensembldb as documented
(http://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html#1021_directly_from_ensembl_databases) would fix this.