GoStats and microRNA pipeline using Biomart
2
0
Entering edit mode
David ▴ 860
@david-3335
Last seen 3.5 years ago
Hi, I open this new discussion so not to confuse with the previous one. The objective here is to look for overrepresented GoTerms from microRNA targets. One microRNA can have several targets (genes) and one single gene can be targeted by several microRNAs. The assumption is to check for a specific microRNAs which GoTerms are overrepresented. Ok so let's say me my microRNA of interest is mir-A. Step1: based on my favorite prediction algorithm i have managed to get a list of genes targeted by mir-A. The genes are ensembl transcripts and as i said before miR-A can target several times the same transcript (at different location) so i need to account for this. miR-A targets -> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up to 300 different transcripts. I use biomart to get the corresponding GoIds for these transcripts .... #Select mart database mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl") #Get go for a specific transcript # First problem as Biomart will not return twice GoTerms for duplicated transcripts. The example below show that for transcript c("ENST00000347770","ENST00000347770") i get the same goTerms than for transcript c("ENST00000347770"). # As i said before a microRNA can target several times the same microRNA so twice the number of goterms associated to this particular microRNA. Can we force biomart to return redundant GoTerms ???? gomir = getBM(attributes=c( 'go_biological_process_id', 'go_biological_process_linkage_type', 'go_cellular_component_linkage_type', 'go_cellular_component_id', 'go_molecular_function_id', 'go_molecular_function_id') ,filters="ensembl_transcript_id", values=c("ENST00000347770","ENST00000347770"......), mart=mart) .... i will complete the rest of the pipiline with GoStats if i get clean on that first.
GO GOstats biomaRt microRNA GO GOstats biomaRt microRNA • 1.1k views
0
Entering edit mode
@steve-lianoglou-2771
Last seen 12 hours ago
Denali
Hi, On Wed, Mar 30, 2011 at 9:43 AM, David martin <vilanew at="" gmail.com=""> wrote: > Hi, > I open this new discussion so not to confuse with the previous one. > > The objective here is to look for overrepresented GoTerms from microRNA > targets. One microRNA can have several targets (genes) ?and one single gene > can be targeted by several microRNAs. The assumption is to check for a > specific microRNAs which GoTerms are overrepresented. > > > Ok so let's say me my microRNA of interest is mir-A. > > Step1: based on my favorite prediction algorithm i have managed to get a > list of genes targeted by mir-A. The genes are ensembl transcripts and as i > said before miR-A can target several times the same transcript (at different > location) so i need to account for this. > > miR-A targets -> > ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up to 300 > different transcripts. I don't get why you'd want to have the same transcript multiple times as a target for the miRNA -- if the miRNA targets the same transcript in two different locations, you then want to double count the GO terms associated with that transcript? Somehow that seems wrong to me -- if the "hit count" of the miRNA to the transcript is important to you, one thing you can do is store your miR-A vector as its "table()" so the names will the the transcripts, and the values will be the number of hits. > I use biomart to get the corresponding GoIds for these transcripts > > .... > #Select mart database > mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl") > > #Get go for a specific transcript > # First problem as Biomart will not return twice GoTerms for duplicated > transcripts. The example below show that for transcript > c("ENST00000347770","ENST00000347770") i get the same goTerms than for > transcript c("ENST00000347770"). > # As i said before a microRNA can target several times the same microRNA so > twice the number of goterms associated to this particular microRNA. Can we > force biomart to return redundant GoTerms ???? I'm actually still not sure what you want to do, but if you follow my advice above, you can manipulate the data.frame you get from getBM to replicate rows (or whatever you're trying to do). You will also want to add "ensembl_transcript_id" to your vector of attributes so you can reassociate the rows in the table that is returned to you with your original ensembl transcripts you are querying for, eg: R> gomir <- getBM(attributes=c('ensembl_transcript_id', 'go..', ...), filters='ensemble_transcript_id', values=c("ENST..."), mart=mart) Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
0
Entering edit mode
On 03/30/2011 04:56 PM, Steve Lianoglou wrote: > Hi, > > On Wed, Mar 30, 2011 at 9:43 AM, David martin<vilanew at="" gmail.com=""> wrote: >> Hi, >> I open this new discussion so not to confuse with the previous one. >> >> The objective here is to look for overrepresented GoTerms from microRNA >> targets. One microRNA can have several targets (genes) and one single gene >> can be targeted by several microRNAs. The assumption is to check for a >> specific microRNAs which GoTerms are overrepresented. >> >> >> Ok so let's say me my microRNA of interest is mir-A. >> >> Step1: based on my favorite prediction algorithm i have managed to get a >> list of genes targeted by mir-A. The genes are ensembl transcripts and as i >> said before miR-A can target several times the same transcript (at different >> location) so i need to account for this. >> >> miR-A targets -> >> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up to 300 >> different transcripts. > > I don't get why you'd want to have the same transcript multiple times > as a target for the miRNA -- if the miRNA targets the same transcript > in two different locations, you then want to double count the GO terms > associated with that transcript? That's correct. The idea behind that is that a transcript targeted at different locations is more "likely to be twice targeted" and therefore GO term associated to this transcript have to be replicated. This sound good to me but i don not expect that you agree on that. i have managed to get all GO ids with a small function. Basically you input one transcript id in a loop l = length(genes) # list of all ensembl transcripts for (l in 1:l) { goid[l] <- getgoids("ENST...") } getgoids <- function (id) { getBM(attributes=c( 'go_biological_process_id', 'go_biological_process_linkage_type', 'go_cellular_component_id', 'go_cellular_component_linkage_type', 'go_molecular_function_id', 'go_molecular_function_linkage_type') ,filters="ensembl_transcript_id", values=id, mart=mart) } I agree wioth you that i might need to add the transcript_id to be able to use for GoStats mapping between transcripts and GO ids. Now i want to use that as the univere set for GoStats and do hyperG to compare with the GO for a specific microRNA. I guess : goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id) #list of all GOids from all transcripts targeted by all microRNA goFrame = GOFrame(goframeData, organism = "Homo sapiens") goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping In the GSEAGOHyperGParams function below can you correct me ? geneSetCollection = List of all go ids off all transcripts targetted by all microRNA single_mir_transcript_ids = list of ENSEMBl transcripts ids targeted by a specific microRNA univerGeneIds: list of transcript to Go mapping Is this correc t? gsc <- GeneSetCollection(goAllFrame, setType = GOCollection()) params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot Params",geneSetCollection = gsc, geneIds = single_mir_transcripts_ids, universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05, conditional = FALSE,testDirection = "over") > > Somehow that seems wrong to me -- if the "hit count" of the miRNA to > the transcript is important to you, one thing you can do is store your > miR-A vector as its "table()" so the names will the the transcripts, > and the values will be the number of hits. > >> I use biomart to get the corresponding GoIds for these transcripts >> >> .... >> #Select mart database >> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl") >> >> #Get go for a specific transcript >> # First problem as Biomart will not return twice GoTerms for duplicated >> transcripts. The example below show that for transcript >> c("ENST00000347770","ENST00000347770") i get the same goTerms than for >> transcript c("ENST00000347770"). >> # As i said before a microRNA can target several times the same microRNA so >> twice the number of goterms associated to this particular microRNA. Can we >> force biomart to return redundant GoTerms ???? > > I'm actually still not sure what you want to do, but if you follow my > advice above, you can manipulate the data.frame you get from getBM to > replicate rows (or whatever you're trying to do). > > You will also want to add "ensembl_transcript_id" to your vector of > attributes so you can reassociate the rows in the table that is > returned to you with your original ensembl transcripts you are > querying for, eg: > > R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...), > filters='ensemble_transcript_id', values=c("ENST..."), mart=mart) > > Hope that helps, > -steve > ADD REPLY 0 Entering edit mode Hi David, I understand your reasoning for counting the number of miRNA binding sites with the 3' UTR of a predicted target, you are trying to include the 'combinatorial' effect of miRNA targeting. I would try to include the length of any UTR however (some kind of normalization if you wish) since the longer the UTR the more chances are that miRNA will bind. Does this make sense? Best, J. On 03/30/2011 05:23 PM, David martin wrote: > On 03/30/2011 04:56 PM, Steve Lianoglou wrote: >> Hi, >> >> On Wed, Mar 30, 2011 at 9:43 AM, David martin<vilanew at="" gmail.com=""> wrote: >>> Hi, >>> I open this new discussion so not to confuse with the previous one. >>> >>> The objective here is to look for overrepresented GoTerms from microRNA >>> targets. One microRNA can have several targets (genes) and one single >>> gene >>> can be targeted by several microRNAs. The assumption is to check for a >>> specific microRNAs which GoTerms are overrepresented. >>> >>> >>> Ok so let's say me my microRNA of interest is mir-A. >>> >>> Step1: based on my favorite prediction algorithm i have managed to get a >>> list of genes targeted by mir-A. The genes are ensembl transcripts >>> and as i >>> said before miR-A can target several times the same transcript (at >>> different >>> location) so i need to account for this. >>> >>> miR-A targets -> >>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up >>> to 300 >>> different transcripts. >> >> I don't get why you'd want to have the same transcript multiple times >> as a target for the miRNA -- if the miRNA targets the same transcript >> in two different locations, you then want to double count the GO terms >> associated with that transcript? > > That's correct. The idea behind that is that a transcript targeted at > different locations is more "likely to be twice targeted" and therefore > GO term associated to this transcript have to be replicated. This sound > good to me but i don not expect that you agree on that. > > > i have managed to get all GO ids with a small function. Basically you > input one transcript id in a loop > > l = length(genes) # list of all ensembl transcripts > for (l in 1:l) > { > goid[l] <- getgoids("ENST...") > > } > getgoids <- function (id) { > getBM(attributes=c( > 'go_biological_process_id', > 'go_biological_process_linkage_type', > 'go_cellular_component_id', > 'go_cellular_component_linkage_type', > 'go_molecular_function_id', > 'go_molecular_function_linkage_type') > ,filters="ensembl_transcript_id", values=id, mart=mart) > } > > I agree wioth you that i might need to add the transcript_id to be able > to use for GoStats mapping between transcripts and GO ids. > > > Now i want to use that as the univere set for GoStats and do hyperG to > compare with the GO for a specific microRNA. > > I guess : > > goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id) > #list of all GOids from all transcripts targeted by all microRNA > > goFrame = GOFrame(goframeData, organism = "Homo sapiens") > goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping > > > In the GSEAGOHyperGParams function below can you correct me ? > geneSetCollection = List of all go ids off all transcripts targetted by > all microRNA > single_mir_transcript_ids = list of ENSEMBl transcripts ids targeted by > a specific microRNA > univerGeneIds: list of transcript to Go mapping > Is this correc t? > > > gsc <- GeneSetCollection(goAllFrame, setType = GOCollection()) > params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot > Params",geneSetCollection = gsc, geneIds = single_mir_transcripts_ids, > universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05, > conditional = FALSE,testDirection = "over") > > >> >> Somehow that seems wrong to me -- if the "hit count" of the miRNA to >> the transcript is important to you, one thing you can do is store your >> miR-A vector as its "table()" so the names will the the transcripts, >> and the values will be the number of hits. >> >>> I use biomart to get the corresponding GoIds for these transcripts >>> >>> .... >>> #Select mart database >>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl") >>> >>> #Get go for a specific transcript >>> # First problem as Biomart will not return twice GoTerms for duplicated >>> transcripts. The example below show that for transcript >>> c("ENST00000347770","ENST00000347770") i get the same goTerms than for >>> transcript c("ENST00000347770"). >>> # As i said before a microRNA can target several times the same >>> microRNA so >>> twice the number of goterms associated to this particular microRNA. >>> Can we >>> force biomart to return redundant GoTerms ???? >> >> I'm actually still not sure what you want to do, but if you follow my >> advice above, you can manipulate the data.frame you get from getBM to >> replicate rows (or whatever you're trying to do). >> >> You will also want to add "ensembl_transcript_id" to your vector of >> attributes so you can reassociate the rows in the table that is >> returned to you with your original ensembl transcripts you are >> querying for, eg: >> >> R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...), >> filters='ensemble_transcript_id', values=c("ENST..."), mart=mart) >> >> Hope that helps, >> -steve >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
0
Entering edit mode
Yes absolutly. A few ensembl releases ago UTR tend to be smaller but this is getting better now. How would you normalize that based on length ? On 03/30/2011 07:00 PM, James F. Reid wrote: > Hi David, > > I understand your reasoning for counting the number of miRNA binding > sites with the 3' UTR of a predicted target, you are trying to include > the 'combinatorial' effect of miRNA targeting. > I would try to include the length of any UTR however (some kind of > normalization if you wish) since the longer the UTR the more chances are > that miRNA will bind. > Does this make sense? > > Best, > J. > > On 03/30/2011 05:23 PM, David martin wrote: >> On 03/30/2011 04:56 PM, Steve Lianoglou wrote: >>> Hi, >>> >>> On Wed, Mar 30, 2011 at 9:43 AM, David >>> martin<vilanew at="" gmail.com=""> wrote: >>>> Hi, >>>> I open this new discussion so not to confuse with the previous one. >>>> >>>> The objective here is to look for overrepresented GoTerms from microRNA >>>> targets. One microRNA can have several targets (genes) and one single >>>> gene >>>> can be targeted by several microRNAs. The assumption is to check for a >>>> specific microRNAs which GoTerms are overrepresented. >>>> >>>> >>>> Ok so let's say me my microRNA of interest is mir-A. >>>> >>>> Step1: based on my favorite prediction algorithm i have managed to >>>> get a >>>> list of genes targeted by mir-A. The genes are ensembl transcripts >>>> and as i >>>> said before miR-A can target several times the same transcript (at >>>> different >>>> location) so i need to account for this. >>>> >>>> miR-A targets -> >>>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up >>>> to 300 >>>> different transcripts. >>> >>> I don't get why you'd want to have the same transcript multiple times >>> as a target for the miRNA -- if the miRNA targets the same transcript >>> in two different locations, you then want to double count the GO terms >>> associated with that transcript? >> >> That's correct. The idea behind that is that a transcript targeted at >> different locations is more "likely to be twice targeted" and therefore >> GO term associated to this transcript have to be replicated. This sound >> good to me but i don not expect that you agree on that. >> >> >> i have managed to get all GO ids with a small function. Basically you >> input one transcript id in a loop >> >> l = length(genes) # list of all ensembl transcripts >> for (l in 1:l) >> { >> goid[l] <- getgoids("ENST...") >> >> } >> getgoids <- function (id) { >> getBM(attributes=c( >> 'go_biological_process_id', >> 'go_biological_process_linkage_type', >> 'go_cellular_component_id', >> 'go_cellular_component_linkage_type', >> 'go_molecular_function_id', >> 'go_molecular_function_linkage_type') >> ,filters="ensembl_transcript_id", values=id, mart=mart) >> } >> >> I agree wioth you that i might need to add the transcript_id to be able >> to use for GoStats mapping between transcripts and GO ids. >> >> >> Now i want to use that as the univere set for GoStats and do hyperG to >> compare with the GO for a specific microRNA. >> >> I guess : >> >> goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id) >> #list of all GOids from all transcripts targeted by all microRNA >> >> goFrame = GOFrame(goframeData, organism = "Homo sapiens") >> goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping >> >> >> In the GSEAGOHyperGParams function below can you correct me ? >> geneSetCollection = List of all go ids off all transcripts targetted by >> all microRNA >> single_mir_transcript_ids = list of ENSEMBl transcripts ids targeted by >> a specific microRNA >> univerGeneIds: list of transcript to Go mapping >> Is this correc t? >> >> >> gsc <- GeneSetCollection(goAllFrame, setType = GOCollection()) >> params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot >> Params",geneSetCollection = gsc, geneIds = single_mir_transcripts_ids, >> universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05, >> conditional = FALSE,testDirection = "over") >> >> >>> >>> Somehow that seems wrong to me -- if the "hit count" of the miRNA to >>> the transcript is important to you, one thing you can do is store your >>> miR-A vector as its "table()" so the names will the the transcripts, >>> and the values will be the number of hits. >>> >>>> I use biomart to get the corresponding GoIds for these transcripts >>>> >>>> .... >>>> #Select mart database >>>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl") >>>> >>>> #Get go for a specific transcript >>>> # First problem as Biomart will not return twice GoTerms for duplicated >>>> transcripts. The example below show that for transcript >>>> c("ENST00000347770","ENST00000347770") i get the same goTerms than for >>>> transcript c("ENST00000347770"). >>>> # As i said before a microRNA can target several times the same >>>> microRNA so >>>> twice the number of goterms associated to this particular microRNA. >>>> Can we >>>> force biomart to return redundant GoTerms ???? >>> >>> I'm actually still not sure what you want to do, but if you follow my >>> advice above, you can manipulate the data.frame you get from getBM to >>> replicate rows (or whatever you're trying to do). >>> >>> You will also want to add "ensembl_transcript_id" to your vector of >>> attributes so you can reassociate the rows in the table that is >>> returned to you with your original ensembl transcripts you are >>> querying for, eg: >>> >>> R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...), >>> filters='ensemble_transcript_id', values=c("ENST..."), mart=mart) >>> >>> Hope that helps, >>> -steve >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ADD REPLY 0 Entering edit mode Hi David, On 03/30/2011 08:31 PM, David martin wrote: > Yes absolutly. A few ensembl releases ago UTR tend to be smaller but > this is getting better now. How would you normalize that based on length ? I'm afraid that I don't have a simple answer to this it would need thinking out especially wrt to your GO enrichment analysis. Any ideas from the members of the list? Best, J. > On 03/30/2011 07:00 PM, James F. Reid wrote: >> Hi David, >> >> I understand your reasoning for counting the number of miRNA binding >> sites with the 3' UTR of a predicted target, you are trying to include >> the 'combinatorial' effect of miRNA targeting. >> I would try to include the length of any UTR however (some kind of >> normalization if you wish) since the longer the UTR the more chances are >> that miRNA will bind. >> Does this make sense? >> >> Best, >> J. >> >> On 03/30/2011 05:23 PM, David martin wrote: >>> On 03/30/2011 04:56 PM, Steve Lianoglou wrote: >>>> Hi, >>>> >>>> On Wed, Mar 30, 2011 at 9:43 AM, David >>>> martin<vilanew at="" gmail.com=""> wrote: >>>>> Hi, >>>>> I open this new discussion so not to confuse with the previous one. >>>>> >>>>> The objective here is to look for overrepresented GoTerms from >>>>> microRNA >>>>> targets. One microRNA can have several targets (genes) and one single >>>>> gene >>>>> can be targeted by several microRNAs. The assumption is to check for a >>>>> specific microRNAs which GoTerms are overrepresented. >>>>> >>>>> >>>>> Ok so let's say me my microRNA of interest is mir-A. >>>>> >>>>> Step1: based on my favorite prediction algorithm i have managed to >>>>> get a >>>>> list of genes targeted by mir-A. The genes are ensembl transcripts >>>>> and as i >>>>> said before miR-A can target several times the same transcript (at >>>>> different >>>>> location) so i need to account for this. >>>>> >>>>> miR-A targets -> >>>>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up >>>>> to 300 >>>>> different transcripts. >>>> >>>> I don't get why you'd want to have the same transcript multiple times >>>> as a target for the miRNA -- if the miRNA targets the same transcript >>>> in two different locations, you then want to double count the GO terms >>>> associated with that transcript? >>> >>> That's correct. The idea behind that is that a transcript targeted at >>> different locations is more "likely to be twice targeted" and therefore >>> GO term associated to this transcript have to be replicated. This sound >>> good to me but i don not expect that you agree on that. >>> >>> >>> i have managed to get all GO ids with a small function. Basically you >>> input one transcript id in a loop >>> >>> l = length(genes) # list of all ensembl transcripts >>> for (l in 1:l) >>> { >>> goid[l] <- getgoids("ENST...") >>> >>> } >>> getgoids <- function (id) { >>> getBM(attributes=c( >>> 'go_biological_process_id', >>> 'go_biological_process_linkage_type', >>> 'go_cellular_component_id', >>> 'go_cellular_component_linkage_type', >>> 'go_molecular_function_id', >>> 'go_molecular_function_linkage_type') >>> ,filters="ensembl_transcript_id", values=id, mart=mart) >>> } >>> >>> I agree wioth you that i might need to add the transcript_id to be able >>> to use for GoStats mapping between transcripts and GO ids. >>> >>> >>> Now i want to use that as the univere set for GoStats and do hyperG to >>> compare with the GO for a specific microRNA. >>> >>> I guess : >>> >>> goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id) >>> #list of all GOids from all transcripts targeted by all microRNA >>> >>> goFrame = GOFrame(goframeData, organism = "Homo sapiens") >>> goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping >>> >>> >>> In the GSEAGOHyperGParams function below can you correct me ? >>> geneSetCollection = List of all go ids off all transcripts targetted by >>> all microRNA >>> single_mir_transcript_ids = list of ENSEMBl transcripts ids targeted by >>> a specific microRNA >>> univerGeneIds: list of transcript to Go mapping >>> Is this correc t? >>> >>> >>> gsc <- GeneSetCollection(goAllFrame, setType = GOCollection()) >>> params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot >>> Params",geneSetCollection = gsc, geneIds = single_mir_transcripts_ids, >>> universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05, >>> conditional = FALSE,testDirection = "over") >>> >>> >>>> >>>> Somehow that seems wrong to me -- if the "hit count" of the miRNA to >>>> the transcript is important to you, one thing you can do is store your >>>> miR-A vector as its "table()" so the names will the the transcripts, >>>> and the values will be the number of hits. >>>> >>>>> I use biomart to get the corresponding GoIds for these transcripts >>>>> >>>>> .... >>>>> #Select mart database >>>>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl") >>>>> >>>>> #Get go for a specific transcript >>>>> # First problem as Biomart will not return twice GoTerms for >>>>> duplicated >>>>> transcripts. The example below show that for transcript >>>>> c("ENST00000347770","ENST00000347770") i get the same goTerms than for >>>>> transcript c("ENST00000347770"). >>>>> # As i said before a microRNA can target several times the same >>>>> microRNA so >>>>> twice the number of goterms associated to this particular microRNA. >>>>> Can we >>>>> force biomart to return redundant GoTerms ???? >>>> >>>> I'm actually still not sure what you want to do, but if you follow my >>>> advice above, you can manipulate the data.frame you get from getBM to >>>> replicate rows (or whatever you're trying to do). >>>> >>>> You will also want to add "ensembl_transcript_id" to your vector of >>>> attributes so you can reassociate the rows in the table that is >>>> returned to you with your original ensembl transcripts you are >>>> querying for, eg: >>>> >>>> R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...), >>>> filters='ensemble_transcript_id', values=c("ENST..."), mart=mart) >>>> >>>> Hope that helps, >>>> -steve >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
0
Entering edit mode
Ok thanks, Any idea on how to turn the biomart output into a valid GOFrame input ?? For example : I wrote this function getgoids <- function (id) { getBM(attributes=c( 'entrezgene', 'ensembl_transcript_id', 'go_biological_process_id', 'go_biological_process_linkage_type', 'go_cellular_component_id', 'go_cellular_component_linkage_type', 'go_molecular_function_id', 'go_molecular_function_linkage_type') ,filters="ensembl_transcript_id", values=id, mart=mart) } getgoids('ENST00000306434') How do i turn this into a valid GOFrame Object ? thanks, david On 03/31/2011 10:10 AM, James F. Reid wrote: > Hi David, > > On 03/30/2011 08:31 PM, David martin wrote: > > Yes absolutly. A few ensembl releases ago UTR tend to be smaller but > > this is getting better now. How would you normalize that based on > length ? > > I'm afraid that I don't have a simple answer to this it would need > thinking out especially wrt to your GO enrichment analysis. > Any ideas from the members of the list? > > Best, > J. > >> On 03/30/2011 07:00 PM, James F. Reid wrote: >>> Hi David, >>> >>> I understand your reasoning for counting the number of miRNA binding >>> sites with the 3' UTR of a predicted target, you are trying to include >>> the 'combinatorial' effect of miRNA targeting. >>> I would try to include the length of any UTR however (some kind of >>> normalization if you wish) since the longer the UTR the more chances are >>> that miRNA will bind. >>> Does this make sense? >>> >>> Best, >>> J. >>> >>> On 03/30/2011 05:23 PM, David martin wrote: >>>> On 03/30/2011 04:56 PM, Steve Lianoglou wrote: >>>>> Hi, >>>>> >>>>> On Wed, Mar 30, 2011 at 9:43 AM, David >>>>> martin<vilanew at="" gmail.com=""> wrote: >>>>>> Hi, >>>>>> I open this new discussion so not to confuse with the previous one. >>>>>> >>>>>> The objective here is to look for overrepresented GoTerms from >>>>>> microRNA >>>>>> targets. One microRNA can have several targets (genes) and one single >>>>>> gene >>>>>> can be targeted by several microRNAs. The assumption is to check >>>>>> for a >>>>>> specific microRNAs which GoTerms are overrepresented. >>>>>> >>>>>> >>>>>> Ok so let's say me my microRNA of interest is mir-A. >>>>>> >>>>>> Step1: based on my favorite prediction algorithm i have managed to >>>>>> get a >>>>>> list of genes targeted by mir-A. The genes are ensembl transcripts >>>>>> and as i >>>>>> said before miR-A can target several times the same transcript (at >>>>>> different >>>>>> location) so i need to account for this. >>>>>> >>>>>> miR-A targets -> >>>>>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up >>>>>> to 300 >>>>>> different transcripts. >>>>> >>>>> I don't get why you'd want to have the same transcript multiple times >>>>> as a target for the miRNA -- if the miRNA targets the same transcript >>>>> in two different locations, you then want to double count the GO terms >>>>> associated with that transcript? >>>> >>>> That's correct. The idea behind that is that a transcript targeted at >>>> different locations is more "likely to be twice targeted" and therefore >>>> GO term associated to this transcript have to be replicated. This sound >>>> good to me but i don not expect that you agree on that. >>>> >>>> >>>> i have managed to get all GO ids with a small function. Basically you >>>> input one transcript id in a loop >>>> >>>> l = length(genes) # list of all ensembl transcripts >>>> for (l in 1:l) >>>> { >>>> goid[l] <- getgoids("ENST...") >>>> >>>> } >>>> getgoids <- function (id) { >>>> getBM(attributes=c( >>>> 'go_biological_process_id', >>>> 'go_biological_process_linkage_type', >>>> 'go_cellular_component_id', >>>> 'go_cellular_component_linkage_type', >>>> 'go_molecular_function_id', >>>> 'go_molecular_function_linkage_type') >>>> ,filters="ensembl_transcript_id", values=id, mart=mart) >>>> } >>>> >>>> I agree wioth you that i might need to add the transcript_id to be able >>>> to use for GoStats mapping between transcripts and GO ids. >>>> >>>> >>>> Now i want to use that as the univere set for GoStats and do hyperG to >>>> compare with the GO for a specific microRNA. >>>> >>>> I guess : >>>> >>>> goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id) >>>> #list of all GOids from all transcripts targeted by all microRNA >>>> >>>> goFrame = GOFrame(goframeData, organism = "Homo sapiens") >>>> goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping >>>> >>>> >>>> In the GSEAGOHyperGParams function below can you correct me ? >>>> geneSetCollection = List of all go ids off all transcripts targetted by >>>> all microRNA >>>> single_mir_transcript_ids = list of ENSEMBl transcripts ids targeted by >>>> a specific microRNA >>>> univerGeneIds: list of transcript to Go mapping >>>> Is this correc t? >>>> >>>> >>>> gsc <- GeneSetCollection(goAllFrame, setType = GOCollection()) >>>> params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot >>>> Params",geneSetCollection = gsc, geneIds = single_mir_transcripts_ids, >>>> universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05, >>>> conditional = FALSE,testDirection = "over") >>>> >>>> >>>>> >>>>> Somehow that seems wrong to me -- if the "hit count" of the miRNA to >>>>> the transcript is important to you, one thing you can do is store your >>>>> miR-A vector as its "table()" so the names will the the transcripts, >>>>> and the values will be the number of hits. >>>>> >>>>>> I use biomart to get the corresponding GoIds for these transcripts >>>>>> >>>>>> .... >>>>>> #Select mart database >>>>>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl") >>>>>> >>>>>> #Get go for a specific transcript >>>>>> # First problem as Biomart will not return twice GoTerms for >>>>>> duplicated >>>>>> transcripts. The example below show that for transcript >>>>>> c("ENST00000347770","ENST00000347770") i get the same goTerms than >>>>>> for >>>>>> transcript c("ENST00000347770"). >>>>>> # As i said before a microRNA can target several times the same >>>>>> microRNA so >>>>>> twice the number of goterms associated to this particular microRNA. >>>>>> Can we >>>>>> force biomart to return redundant GoTerms ???? >>>>> >>>>> I'm actually still not sure what you want to do, but if you follow my >>>>> advice above, you can manipulate the data.frame you get from getBM to >>>>> replicate rows (or whatever you're trying to do). >>>>> >>>>> You will also want to add "ensembl_transcript_id" to your vector of >>>>> attributes so you can reassociate the rows in the table that is >>>>> returned to you with your original ensembl transcripts you are >>>>> querying for, eg: >>>>> >>>>> R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...), >>>>> filters='ensemble_transcript_id', values=c("ENST..."), mart=mart) >>>>> >>>>> Hope that helps, >>>>> -steve >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ADD REPLY 0 Entering edit mode Hi David, If this was your function you would 1st of all want to just pass in a big vector (with your universe of transcript IDs in it) to get out all the data. Then making the GOFrame is just a matter of taking all the Gene IDs (entrez gene IDs) and all the GO IDs (from any of the three ontologies), and the evidence codes into a single data.frame as outlined in this document here: http://www.bioconductor.org/packages/2.7/bioc/vignettes/GOstats/inst/d oc/GOstatsForUnsupportedOrganisms.pdf But if it were me, I would attempt to save a little headache for making the final table, but just getting only the data I needed from getBM (and since they keep the three ontologies separate, that means I would make three calls to get BM. So like this getBioProcgoids <- function (id) { getBM(attributes=c( 'go_biological_process_id', 'go_biological_process_linkage_type', 'entrezgene') ,filters="ensembl_transcript_id", values=id, mart=mart) } BioGOs <- getBioProcgoids( yourBigUniverseVectorOfEnsemblTranscriptIDsGoesHere ) Then do separate small functions to get the other two ontologies and call them etc. Then something like this: myGOFrame <- rbind(BioGOs, CCGOs, MFGOs) To stick them all together. Does that help? Marc On 03/31/2011 02:47 AM, David martin wrote: > Ok thanks, > Any idea on how to turn the biomart output into a valid GOFrame input ?? > > For example : > I wrote this function > > getgoids <- function (id) { > getBM(attributes=c( > 'entrezgene', > 'ensembl_transcript_id', > 'go_biological_process_id', > 'go_biological_process_linkage_type', > 'go_cellular_component_id', > 'go_cellular_component_linkage_type', > 'go_molecular_function_id', > 'go_molecular_function_linkage_type') > ,filters="ensembl_transcript_id", values=id, mart=mart) > } > foo > > How do i turn this into a valid GOFrame Object ? > > thanks, > david > > > > > On 03/31/2011 10:10 AM, James F. Reid wrote: >> Hi David, >> >> On 03/30/2011 08:31 PM, David martin wrote: >> > Yes absolutly. A few ensembl releases ago UTR tend to be smaller but >> > this is getting better now. How would you normalize that based on >> length ? >> >> I'm afraid that I don't have a simple answer to this it would need >> thinking out especially wrt to your GO enrichment analysis. >> Any ideas from the members of the list? >> >> Best, >> J. >> >>> On 03/30/2011 07:00 PM, James F. Reid wrote: >>>> Hi David, >>>> >>>> I understand your reasoning for counting the number of miRNA binding >>>> sites with the 3' UTR of a predicted target, you are trying to include >>>> the 'combinatorial' effect of miRNA targeting. >>>> I would try to include the length of any UTR however (some kind of >>>> normalization if you wish) since the longer the UTR the more >>>> chances are >>>> that miRNA will bind. >>>> Does this make sense? >>>> >>>> Best, >>>> J. >>>> >>>> On 03/30/2011 05:23 PM, David martin wrote: >>>>> On 03/30/2011 04:56 PM, Steve Lianoglou wrote: >>>>>> Hi, >>>>>> >>>>>> On Wed, Mar 30, 2011 at 9:43 AM, David >>>>>> martin<vilanew at="" gmail.com=""> wrote: >>>>>>> Hi, >>>>>>> I open this new discussion so not to confuse with the previous one. >>>>>>> >>>>>>> The objective here is to look for overrepresented GoTerms from >>>>>>> microRNA >>>>>>> targets. One microRNA can have several targets (genes) and one >>>>>>> single >>>>>>> gene >>>>>>> can be targeted by several microRNAs. The assumption is to check >>>>>>> for a >>>>>>> specific microRNAs which GoTerms are overrepresented. >>>>>>> >>>>>>> >>>>>>> Ok so let's say me my microRNA of interest is mir-A. >>>>>>> >>>>>>> Step1: based on my favorite prediction algorithm i have managed to >>>>>>> get a >>>>>>> list of genes targeted by mir-A. The genes are ensembl transcripts >>>>>>> and as i >>>>>>> said before miR-A can target several times the same transcript (at >>>>>>> different >>>>>>> location) so i need to account for this. >>>>>>> >>>>>>> miR-A targets -> >>>>>>> ENST001,ENST001,ENST001,ENST0025,ENST089,ENST099,ENST0099......) up >>>>>>> to 300 >>>>>>> different transcripts. >>>>>> >>>>>> I don't get why you'd want to have the same transcript multiple >>>>>> times >>>>>> as a target for the miRNA -- if the miRNA targets the same >>>>>> transcript >>>>>> in two different locations, you then want to double count the GO >>>>>> terms >>>>>> associated with that transcript? >>>>> >>>>> That's correct. The idea behind that is that a transcript targeted at >>>>> different locations is more "likely to be twice targeted" and >>>>> therefore >>>>> GO term associated to this transcript have to be replicated. This >>>>> sound >>>>> good to me but i don not expect that you agree on that. >>>>> >>>>> >>>>> i have managed to get all GO ids with a small function. Basically you >>>>> input one transcript id in a loop >>>>> >>>>> l = length(genes) # list of all ensembl transcripts >>>>> for (l in 1:l) >>>>> { >>>>> goid[l] <- getgoids("ENST...") >>>>> >>>>> } >>>>> getgoids <- function (id) { >>>>> getBM(attributes=c( >>>>> 'go_biological_process_id', >>>>> 'go_biological_process_linkage_type', >>>>> 'go_cellular_component_id', >>>>> 'go_cellular_component_linkage_type', >>>>> 'go_molecular_function_id', >>>>> 'go_molecular_function_linkage_type') >>>>> ,filters="ensembl_transcript_id", values=id, mart=mart) >>>>> } >>>>> >>>>> I agree wioth you that i might need to add the transcript_id to be >>>>> able >>>>> to use for GoStats mapping between transcripts and GO ids. >>>>> >>>>> >>>>> Now i want to use that as the univere set for GoStats and do >>>>> hyperG to >>>>> compare with the GO for a specific microRNA. >>>>> >>>>> I guess : >>>>> >>>>> goframeData = data.frame(frame$go_id, frame$Evidence, frame$gene_id) >>>>> #list of all GOids from all transcripts targeted by all microRNA >>>>> >>>>> goFrame = GOFrame(goframeData, organism = "Homo sapiens") >>>>> goAllFrame = GOAllFrame(goFrame) #Geneid to ALL go id mapping >>>>> >>>>> >>>>> In the GSEAGOHyperGParams function below can you correct me ? >>>>> geneSetCollection = List of all go ids off all transcripts >>>>> targetted by >>>>> all microRNA >>>>> single_mir_transcript_ids = list of ENSEMBl transcripts ids >>>>> targeted by >>>>> a specific microRNA >>>>> univerGeneIds: list of transcript to Go mapping >>>>> Is this correc t? >>>>> >>>>> >>>>> gsc <- GeneSetCollection(goAllFrame, setType = GOCollection()) >>>>> params <- GSEAGOHyperGParams(name = "My Custom GSEA based annot >>>>> Params",geneSetCollection = gsc, geneIds = >>>>> single_mir_transcripts_ids, >>>>> universeGeneIds = universe,ontology = "BP", pvalueCutoff = 0.05, >>>>> conditional = FALSE,testDirection = "over") >>>>> >>>>> >>>>>> >>>>>> Somehow that seems wrong to me -- if the "hit count" of the miRNA to >>>>>> the transcript is important to you, one thing you can do is store >>>>>> your >>>>>> miR-A vector as its "table()" so the names will the the transcripts, >>>>>> and the values will be the number of hits. >>>>>> >>>>>>> I use biomart to get the corresponding GoIds for these transcripts >>>>>>> >>>>>>> .... >>>>>>> #Select mart database >>>>>>> mart<- useMart("ensembl", dataset="hsapiens_gene_ensembl") >>>>>>> >>>>>>> #Get go for a specific transcript >>>>>>> # First problem as Biomart will not return twice GoTerms for >>>>>>> duplicated >>>>>>> transcripts. The example below show that for transcript >>>>>>> c("ENST00000347770","ENST00000347770") i get the same goTerms than >>>>>>> for >>>>>>> transcript c("ENST00000347770"). >>>>>>> # As i said before a microRNA can target several times the same >>>>>>> microRNA so >>>>>>> twice the number of goterms associated to this particular microRNA. >>>>>>> Can we >>>>>>> force biomart to return redundant GoTerms ???? >>>>>> >>>>>> I'm actually still not sure what you want to do, but if you >>>>>> follow my >>>>>> advice above, you can manipulate the data.frame you get from >>>>>> getBM to >>>>>> replicate rows (or whatever you're trying to do). >>>>>> >>>>>> You will also want to add "ensembl_transcript_id" to your vector of >>>>>> attributes so you can reassociate the rows in the table that is >>>>>> returned to you with your original ensembl transcripts you are >>>>>> querying for, eg: >>>>>> >>>>>> R> gomir<- getBM(attributes=c('ensembl_transcript_id', 'go..', ...), >>>>>> filters='ensemble_transcript_id', values=c("ENST..."), mart=mart) >>>>>> >>>>>> Hope that helps, >>>>>> -steve >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
0
Entering edit mode
@iain-gallagher-2532
Last seen 6.2 years ago
United Kingdom
Hi David I'm not sure that you need to / should normalise based on UTR length. The mechanism of miRNA target repression is mostly thought to results from miRNA binding to the 3'UTR and guiding the target mRNA to an AGO containing processing unit for destruction. This may not be the only mechanism. In any case the targeting of a mRNA by a miRNA is dependent on complementarity between the seed region of the miRNA (bases 2-7 or 8) and the the 3'UTR. There's also some evidence that complementarity at bases 12/13 - 16/17 of the miRNA are also important in binding. Furthermore the position of the binding site in the UTR is also important (see the Bartel groups TargetScan papers for more info here - Friedman et al I think and Grimson et al) so some binding sites may be spurious. Thus the targeting of a given mRNA by a miRNA is not really a function of UTR length per se but UTR sequence. Presumably the algorithm you have used to select miRNA targets takes into account the sequence (i.e. you know that there is complementarity already).