Question: tx2gene for xenopus LAEVIS
0
29 days ago by
Daria0
Daria0 wrote:

I am working with xenopus LAEVIS. For alignment I used mRNA sequence from genebank. Is it possible to do tx2gene for this organism? it is absent here ftp://ftp.ebi.ac.uk/pub/databases/gencode/ and even here https://bioconductor.org/packages/release/bioc/manuals/AnnotationDbi/man/AnnotationDbi.pdf .

The problem is that in reference mRNA transcript names look like >gi|163915948|gb|BC157470| Xenopus laevis surfactant protein C, mRNA (cDNA clone MGC:180004 IMAGE:4056930), complete cds ; but at .sf files name of the transcript comes as gi|163915948|gb|BC157470| and it is hard to analyze the results, putting aside that it is impossible to attribute all the transcripts to each particular gene.

deseq2 • 109 views
modified 28 days ago by James W. MacDonald51k • written 29 days ago by Daria0
0
28 days ago by
United States
James W. MacDonald51k wrote:

Non-model organisms can be tough. You do apparently have GenBank IDs, so hypothetically you could map to Gene ID using an OrgDb package.

> library(AnnotationHub)
> hub <- AnnotationHub()
|======================================================================| 100%

snapshotDate(): 2018-10-24
> query(hub, c("laevis","orgdb"))
AnnotationHub with 4 records
# snapshotDate(): 2018-10-24
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ #$species: Cynoglossus (Arelia) semilaevis, Cynoglossus semilaevis, Xenopus...
# $rdataclass: OrgDb # additional mcols(): taxonomyid, genome, description, # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, # rdatapath, sourceurl, sourcetype # retrieve records with, e.g., 'object[["AH66162"]]' title AH66162 | org.Xl.eg.db.sqlite AH66542 | org.Xenopus_laevis_tropicalis.eg.sqlite AH66623 | org.Cynoglossus_(Arelia)_semilaevis.eg.sqlite AH66624 | org.Cynoglossus_semilaevis.eg.sqlite > z <- hub[["AH66162"]] > z OrgDb object: | DBSCHEMAVERSION: 2.1 | Db type: OrgDb | Supporting package: AnnotationDbi | DBSCHEMA: XENOPUS_DB | ORGANISM: Xenopus laevis | SPECIES: Xenopus | EGSOURCEDATE: 2018-Oct11 | EGSOURCENAME: Entrez Gene | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | CENTRALID: EG | TAXID: 8355 | GOSOURCENAME: Gene Ontology | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/ | GOSOURCEDATE: 2018-Oct10 | GOEGSOURCEDATE: 2018-Oct11 | GOEGSOURCENAME: Entrez Gene | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA | KEGGSOURCENAME: KEGG GENOME | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes | KEGGSOURCEDATE: 2011-Mar15 Please see: help('select') for usage information > select(z, "BC157470", "ENTREZID", "ACCNUM") 'select()' returned 1:1 mapping between keys and columns ACCNUM ENTREZID 1 BC157470 779071 >  You would need to do some parsing of your reference mRNA transcript names. I tend to use something like nam <- sapply(strsplit(<transcript names go here>, "|"), "[", 4)  But there are any number of ways to accomplish that task. You might consider using mapIds rather than select. See the help pages for those functions to see the differences. ADD COMMENTlink written 28 days ago by James W. MacDonald51k Thanks James. A note to dshk: I don't have any specific solutions, but I'll just say you're somewhat "on your own" to generate a table of transcripts IDs to gene IDs for tximport summarization. I strongly recommend to obtain the sequence (FASTA) and transcript to gene mapping (GTF) from the same source if possible. ADD REPLYlink written 28 days ago by Michael Love25k Thank you very much for answer! I will try to perform it and get back. ADD REPLYlink written 28 days ago by Daria0 If I understand correctly tximport allows to import all .sf files from the experiment together and connect transcript IDs to gene IDs and join the counts for the same gene that come from different transcripts. To connect transcript IDs to geneIDs there should be tx2gene data frame that contains that connections. This tx2gene should be made up of two columns: 1) transcript ID 2) gene ID. Transcript ID should be the same as in abundance (.sf) files. In this case transcript ID is GeneBank number (eg.BC157470). tx2gene can be created from OrgDb object=z (that you created for me) and select function. How is it possible to connect all of GeneBank numbers to Gene ID with select function without putting the particular key corresponding to the gene (like BC157470) every time. Should i next do as it is shown in the tximport vignette, but instead of txdb use z? k <- keys(txdb, keytype = "TXNAME") tx2gene <- select(txdb, k, "GENEID", "TXNAME") and then apply it to .sf files? library(tximport) txi <- tximport(files, type = "salmon", tx2gene = tx2gene) head(txi$counts)


The command

nam <- sapply(strsplit(<transcript names go here>, "|"), "[", 4)


should be applied to mRNA.fasta that I used for salmon mapping or for quant.sf files? I am sorry for don't understanding what should be instead of <transcript names="" go="" here="">. This should be done before creating tx2gene data frame?

Please try to format your posts reasonably. I had to go through and fix the blob of text you wrote in order to be able to see what you are asking.

There are ways of importing the FASTA names into R, but probably the easiest is to use the Biostrings package:

fastas <- readDNAStringSet(<FASTA file name goes here>)



In which case the transcript names can be extracted by

names(fastas)


And you can parse as I showed. Do note that the transcripts column has to be the same as the first column in the quant.sf file, which will be similar to

gi|163915948|gb|BC157470| Xenopus laevis surfactant protein C, mRNA (cDNA clone MGC:180004 IMAGE:4056930),
complete cds


So the columns of your tx2gene should have the big long FASTA names in the first column and the Entrez Gene ID in the second column. A (probably good) alternative would be to

## replace FASTA names with just GenBank ID
names(fastas) <- sapply(strsplit(names(fastas), "|"), "[", 4)
## write out FASTA with truncated names
writeXStringSet(fastas, <reasonable filename goes here>)


And then align using the 'fixed' FASTA file, so you don't have to deal with those uber-long names, and you can just have a two-column data.frame with GenBank IDs in the first column and Gene IDs in the second.

Thank you, James!

## replace FASTA names with just GenBank ID
names(fastas) <- sapply(strsplit(names(fastas), "|"), "[", 4)
## write out FASTA with truncated names
writeXStringSet(fastas, <reasonable filename goes here>)


Strangely, this code changes all FASTA names to "1".

And could you please clarify, if in principle it is possible for R to change FASTA names to GenBank IDs in .fasta files, why the same operation can't be made for the column "name" in .sf files and the alignment should be re-done?

It could. But why are you asking me? It's your analysis and in the end you are responsible for doing it.

If I succeed to rename FASATA and re-do the alignment and get .sf files with normal names - GenBank IDs, how then I associate the data with OrgDb object?

Would it come like this?

k <- keys(z, keytype = "ACCNUM")
tx2gene <- select(z, k, "ENTREZID", "ACCNUM")


and then

txi <- tximport(<files .sf>, type="salmon", tx2gene=tx2gene)


This is the point at which I inevitably ask why you are doing this analysis yourself instead of finding someone local to do it for you? You are stumbling on one of the easiest steps in the pipeline, so that doesn't bode well for whatever you end up doing with these data.

You could hypothetically re-wire your house, but most people get an electrician for that. You could also re-plumb the house, or rebuild the engine in your car, or do your own surgery when you need it. But nobody does that. They hire the expertise and let the expert handle it. Which is what experts are for.

You can certainly persist in analyzing the data yourself, but do note that this is by far the easiest part! Getting R to do your bidding is only part of the equation. You also need to have some familiarity with statistics in order to know if what you did is defensible or not. Presumably you will be presenting these results to someone else, and will need to explain what you did, why you did it, and what the pros and cons are. Can you do that? If not, I would recommend finding someone local who can and have them do the analysis for you.