Question

tx2gene for xenopus LAEVIS

0

Entering edit mode

Daria • 0

@daria-21676

Last seen 5.7 years ago

I am working with xenopus LAEVIS. For alignment I used mRNA sequence from genebank. Is it possible to do tx2gene for this organism? it is absent here ftp://ftp.ebi.ac.uk/pub/databases/gencode/ and even here https://bioconductor.org/packages/release/bioc/manuals/AnnotationDbi/man/AnnotationDbi.pdf .

The problem is that in reference mRNA transcript names look like >gi|163915948|gb|BC157470| Xenopus laevis surfactant protein C, mRNA (cDNA clone MGC:180004 IMAGE:4056930), complete cds ; but at .sf files name of the transcript comes as gi|163915948|gb|BC157470| and it is hard to analyze the results, putting aside that it is impossible to attribute all the transcripts to each particular gene.

deseq2 • 1.7k views

ADD COMMENT • link updated 5.7 years ago by James W. MacDonald 68k • written 5.7 years ago by Daria • 0

James W. MacDonald · Answer 1 · 2019-08-20

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 10 hours ago

United States

Non-model organisms can be tough. You do apparently have GenBank IDs, so hypothetically you could map to Gene ID using an OrgDb package.

> library(AnnotationHub)
> hub <- AnnotationHub()
  |======================================================================| 100%

snapshotDate(): 2018-10-24
> query(hub, c("laevis","orgdb"))
AnnotationHub with 4 records
# snapshotDate(): 2018-10-24 
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Cynoglossus (Arelia) semilaevis, Cynoglossus semilaevis, Xenopus...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH66162"]]' 

            title                                        
  AH66162 | org.Xl.eg.db.sqlite                          
  AH66542 | org.Xenopus_laevis_tropicalis.eg.sqlite      
  AH66623 | org.Cynoglossus_(Arelia)_semilaevis.eg.sqlite
  AH66624 | org.Cynoglossus_semilaevis.eg.sqlite         
> z <- hub[["AH66162"]]

> z
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: XENOPUS_DB
| ORGANISM: Xenopus laevis
| SPECIES: Xenopus
| EGSOURCEDATE: 2018-Oct11
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 8355
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 2018-Oct10
| GOEGSOURCEDATE: 2018-Oct11
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15

Please see: help('select') for usage information

> select(z, "BC157470", "ENTREZID", "ACCNUM")
'select()' returned 1:1 mapping between keys and columns
    ACCNUM ENTREZID
1 BC157470   779071
>

You would need to do some parsing of your reference mRNA transcript names. I tend to use something like

nam <- sapply(strsplit(<transcript names go here>, "|"), "[", 4)

But there are any number of ways to accomplish that task. You might consider using mapIds rather than select. See the help pages for those functions to see the differences.

ADD COMMENT • link 5.7 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks James.

A note to dshk:

I don't have any specific solutions, but I'll just say you're somewhat "on your own" to generate a table of transcripts IDs to gene IDs for tximport summarization.

I strongly recommend to obtain the sequence (FASTA) and transcript to gene mapping (GTF) from the same source if possible.

ADD REPLY • link 5.7 years ago Michael Love 43k

0

Entering edit mode

Thank you very much for answer! I will try to perform it and get back.

ADD REPLY • link 5.7 years ago Daria • 0

0

Entering edit mode

If I understand correctly tximport allows to import all .sf files from the experiment together and connect transcript IDs to gene IDs and join the counts for the same gene that come from different transcripts. To connect transcript IDs to geneIDs there should be tx2gene data frame that contains that connections.

This tx2gene should be made up of two columns: 1) transcript ID 2) gene ID.

Transcript ID should be the same as in abundance (.sf) files. In this case transcript ID is GeneBank number (eg.BC157470). tx2gene can be created from OrgDb object=z (that you created for me) and select function. How is it possible to connect all of GeneBank numbers to Gene ID with select function without putting the particular key corresponding to the gene (like BC157470) every time. Should i next do as it is shown in the tximport vignette, but instead of txdb use z?

k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
and then apply it to .sf files? 
library(tximport)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
head(txi$counts)

The command

nam <- sapply(strsplit(<transcript names go here>, "|"), "[", 4)

should be applied to mRNA.fasta that I used for salmon mapping or for quant.sf files? I am sorry for don't understanding what should be instead of <transcript names="" go="" here="">. This should be done before creating tx2gene data frame?

ADD REPLY • link updated 5.7 years ago by James W. MacDonald 68k • written 5.7 years ago by Daria • 0

0

Entering edit mode

Please try to format your posts reasonably. I had to go through and fix the blob of text you wrote in order to be able to see what you are asking.

There are ways of importing the FASTA names into R, but probably the easiest is to use the Biostrings package:

fastas <- readDNAStringSet(<FASTA file name goes here>)

In which case the transcript names can be extracted by

names(fastas)

And you can parse as I showed. Do note that the transcripts column has to be the same as the first column in the quant.sf file, which will be similar to

gi|163915948|gb|BC157470| Xenopus laevis surfactant protein C, mRNA (cDNA clone MGC:180004 IMAGE:4056930), 
complete cds

So the columns of your tx2gene should have the big long FASTA names in the first column and the Entrez Gene ID in the second column. A (probably good) alternative would be to

## replace FASTA names with just GenBank ID
names(fastas) <- sapply(strsplit(names(fastas), "|"), "[", 4)
## write out FASTA with truncated names
writeXStringSet(fastas, <reasonable filename goes here>)

And then align using the 'fixed' FASTA file, so you don't have to deal with those uber-long names, and you can just have a two-column data.frame with GenBank IDs in the first column and Gene IDs in the second.

ADD REPLY • link 5.7 years ago James W. MacDonald 68k

0

Entering edit mode

Thank you, James!

ADD REPLY • link 5.7 years ago Daria • 0

0

Entering edit mode

## replace FASTA names with just GenBank ID
names(fastas) <- sapply(strsplit(names(fastas), "|"), "[", 4)
## write out FASTA with truncated names
writeXStringSet(fastas, <reasonable filename goes here>)

Strangely, this code changes all FASTA names to "1".

ADD REPLY • link 5.7 years ago Daria • 0

0

Entering edit mode

And could you please clarify, if in principle it is possible for R to change FASTA names to GenBank IDs in .fasta files, why the same operation can't be made for the column "name" in .sf files and the alignment should be re-done?

ADD REPLY • link 5.7 years ago Daria • 0

0

Entering edit mode

It could. But why are you asking me? It's your analysis and in the end you are responsible for doing it.

ADD REPLY • link 5.7 years ago James W. MacDonald 68k

0

Entering edit mode

If I succeed to rename FASATA and re-do the alignment and get .sf files with normal names - GenBank IDs, how then I associate the data with OrgDb object?

Would it come like this?

k <- keys(z, keytype = "ACCNUM")
tx2gene <- select(z, k, "ENTREZID", "ACCNUM")

and then

txi <- tximport(<files .sf>, type="salmon", tx2gene=tx2gene)

ADD REPLY • link 5.7 years ago Daria • 0

0

Entering edit mode

This is the point at which I inevitably ask why you are doing this analysis yourself instead of finding someone local to do it for you? You are stumbling on one of the easiest steps in the pipeline, so that doesn't bode well for whatever you end up doing with these data.

You could hypothetically re-wire your house, but most people get an electrician for that. You could also re-plumb the house, or rebuild the engine in your car, or do your own surgery when you need it. But nobody does that. They hire the expertise and let the expert handle it. Which is what experts are for.

You can certainly persist in analyzing the data yourself, but do note that this is by far the easiest part! Getting R to do your bidding is only part of the equation. You also need to have some familiarity with statistics in order to know if what you did is defensible or not. Presumably you will be presenting these results to someone else, and will need to explain what you did, why you did it, and what the pros and cons are. Can you do that? If not, I would recommend finding someone local who can and have them do the analysis for you.

ADD REPLY • link 5.7 years ago James W. MacDonald 68k