The problem is that in reference mRNA transcript names look like >gi|163915948|gb|BC157470| Xenopus laevis surfactant protein C, mRNA (cDNA clone MGC:180004 IMAGE:4056930), complete cds ; but at .sf files name of the transcript comes as gi|163915948|gb|BC157470| and it is hard to analyze the results, putting aside that it is impossible to attribute all the transcripts to each particular gene.
You would need to do some parsing of your reference mRNA transcript names. I tend to use something like
nam <- sapply(strsplit(<transcript names go here>, "|"), "[", 4)
But there are any number of ways to accomplish that task. You might consider using mapIds rather than select. See the help pages for those functions to see the differences.
I don't have any specific solutions, but I'll just say you're somewhat "on your own" to generate a table of transcripts IDs to gene IDs for tximport summarization.
I strongly recommend to obtain the sequence (FASTA) and transcript to gene mapping (GTF) from the same source if possible.
If I understand correctly tximport allows to import all .sf files from the experiment together and connect transcript IDs to gene IDs and join the counts for the same gene that come from different transcripts. To connect transcript IDs to geneIDs there should be tx2gene data frame that contains that connections.
This tx2gene should be made up of two columns:
1) transcript ID
2) gene ID.
Transcript ID should be the same as in abundance (.sf) files. In this case transcript ID is GeneBank number (eg.BC157470). tx2gene can be created from OrgDb object=z (that you created for me) and select function. How is it possible to connect all of GeneBank numbers to Gene ID with select function without putting the particular key corresponding to the gene (like BC157470) every time.
Should i next do as it is shown in the tximport vignette, but instead of txdb use z?
k <- keys(txdb, keytype ="TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
and then apply it to .sf files?
library(tximport)
txi <- tximport(files, type ="salmon", tx2gene = tx2gene)
head(txi$counts)
The command
nam <- sapply(strsplit(<transcript names go here>, "|"), "[", 4)
should be applied to mRNA.fasta that I used for salmon mapping or for quant.sf files? I am sorry for don't understanding what should be instead of <transcript names="" go="" here="">. This should be done before creating tx2gene data frame?
Please try to format your posts reasonably. I had to go through and fix the blob of text you wrote in order to be able to see what you are asking.
There are ways of importing the FASTA names into R, but probably the easiest is to use the Biostrings package:
fastas <- readDNAStringSet(<FASTA file name goes here>)
In which case the transcript names can be extracted by
names(fastas)
And you can parse as I showed. Do note that the transcripts column has to be the same as the first column in the quant.sf file, which will be similar to
So the columns of your tx2gene should have the big long FASTA names in the first column and the Entrez Gene ID in the second column. A (probably good) alternative would be to
## replace FASTA names with just GenBank ID
names(fastas)<- sapply(strsplit(names(fastas), "|"), "[", 4)## write out FASTA with truncated names
writeXStringSet(fastas, <reasonable filename goes here>)
And then align using the 'fixed' FASTA file, so you don't have to deal with those uber-long names, and you can just have a two-column data.frame with GenBank IDs in the first column and Gene IDs in the second.
## replace FASTA names with just GenBank ID
names(fastas)<- sapply(strsplit(names(fastas), "|"), "[", 4)## write out FASTA with truncated names
writeXStringSet(fastas, <reasonable filename goes here>)
Strangely, this code changes all FASTA names to "1".
And could you please clarify, if in principle it is possible for R to change FASTA names to GenBank IDs in .fasta files, why the same operation can't be made for the column "name" in .sf files and the alignment should be re-done?
If I succeed to rename FASATA and re-do the alignment and get .sf files with normal names - GenBank IDs, how then I associate the data with OrgDb object?
This is the point at which I inevitably ask why you are doing this analysis yourself instead of finding someone local to do it for you? You are stumbling on one of the easiest steps in the pipeline, so that doesn't bode well for whatever you end up doing with these data.
You could hypothetically re-wire your house, but most people get an electrician for that. You could also re-plumb the house, or rebuild the engine in your car, or do your own surgery when you need it. But nobody does that. They hire the expertise and let the expert handle it. Which is what experts are for.
You can certainly persist in analyzing the data yourself, but do note that this is by far the easiest part! Getting R to do your bidding is only part of the equation. You also need to have some familiarity with statistics in order to know if what you did is defensible or not. Presumably you will be presenting these results to someone else, and will need to explain what you did, why you did it, and what the pros and cons are. Can you do that? If not, I would recommend finding someone local who can and have them do the analysis for you.
Thanks James.
A note to dshk:
I don't have any specific solutions, but I'll just say you're somewhat "on your own" to generate a table of transcripts IDs to gene IDs for tximport summarization.
I strongly recommend to obtain the sequence (FASTA) and transcript to gene mapping (GTF) from the same source if possible.
Thank you very much for answer! I will try to perform it and get back.
If I understand correctly tximport allows to import all .sf files from the experiment together and connect transcript IDs to gene IDs and join the counts for the same gene that come from different transcripts. To connect transcript IDs to geneIDs there should be tx2gene data frame that contains that connections.
This tx2gene should be made up of two columns: 1) transcript ID 2) gene ID.
Transcript ID should be the same as in abundance (.sf) files. In this case transcript ID is GeneBank number (eg.BC157470). tx2gene can be created from OrgDb object=z (that you created for me) and select function. How is it possible to connect all of GeneBank numbers to Gene ID with select function without putting the particular key corresponding to the gene (like BC157470) every time. Should i next do as it is shown in the tximport vignette, but instead of txdb use z?
The command
should be applied to mRNA.fasta that I used for salmon mapping or for quant.sf files? I am sorry for don't understanding what should be instead of <transcript names="" go="" here="">. This should be done before creating tx2gene data frame?
Please try to format your posts reasonably. I had to go through and fix the blob of text you wrote in order to be able to see what you are asking.
There are ways of importing the FASTA names into R, but probably the easiest is to use the Biostrings package:
In which case the transcript names can be extracted by
And you can parse as I showed. Do note that the transcripts column has to be the same as the first column in the quant.sf file, which will be similar to
So the columns of your tx2gene should have the big long FASTA names in the first column and the Entrez Gene ID in the second column. A (probably good) alternative would be to
And then align using the 'fixed' FASTA file, so you don't have to deal with those uber-long names, and you can just have a two-column data.frame with GenBank IDs in the first column and Gene IDs in the second.
Thank you, James!
Strangely, this code changes all FASTA names to "1".
And could you please clarify, if in principle it is possible for R to change FASTA names to GenBank IDs in .fasta files, why the same operation can't be made for the column "name" in .sf files and the alignment should be re-done?
It could. But why are you asking me? It's your analysis and in the end you are responsible for doing it.
If I succeed to rename FASATA and re-do the alignment and get .sf files with normal names - GenBank IDs, how then I associate the data with OrgDb object?
Would it come like this?
and then
This is the point at which I inevitably ask why you are doing this analysis yourself instead of finding someone local to do it for you? You are stumbling on one of the easiest steps in the pipeline, so that doesn't bode well for whatever you end up doing with these data.
You could hypothetically re-wire your house, but most people get an electrician for that. You could also re-plumb the house, or rebuild the engine in your car, or do your own surgery when you need it. But nobody does that. They hire the expertise and let the expert handle it. Which is what experts are for.
You can certainly persist in analyzing the data yourself, but do note that this is by far the easiest part! Getting R to do your bidding is only part of the equation. You also need to have some familiarity with statistics in order to know if what you did is defensible or not. Presumably you will be presenting these results to someone else, and will need to explain what you did, why you did it, and what the pros and cons are. Can you do that? If not, I would recommend finding someone local who can and have them do the analysis for you.