Problem using the %in% command
1
0
Entering edit mode
@paul-christophschroder-1940
Last seen 9.6 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080220/ 76033284/attachment.pl
• 789 views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 days ago
United States
Hi Paul -- I saw this on the R mailing list, too. Such 'cross-posting' is discouraged (though in this case you get answers that you wouldn't have got if you'd restricted yourself to just one list!) I wonder if your problem was splitting the 'genes' character string with "," rather than ", *" or ",[[:blank:]]*" ? Whatever the case, if you have a data frame (from jim holtman's reply on the R list) > func_gen Function x 1 Function1 gene5, gene19, gene22, gene23 2 Function2 gene1, gene7, gene19 3 Function3 gene2, gene3, gene7, gene23 I would have created a named list associating function and gene name: > fids <- sapply(func_gen[["x"]], strsplit, ",[[:blank:]]*") > names(fids) <- func_gen[["Function"]] and converted this to an incidence matrix: > uids <- unique(unlist(fids)) > incidence <- sapply(fids, "%in%", x=uids) > rownames(incidence) <- uids Since these seem like gene sets, and your work flow might continue along these lines, it might be convenient to represent your data as a gene set collection > library(GSEABase) > gs <- mapply(GeneSet, fids, setName=names(fids)) > gsc <- GeneSetCollection(gs) and then let the package do the clever operation > incidence(gsc) gene5 gene19 gene22 gene23 gene1 gene7 gene2 gene3 Function1 1 1 1 1 0 0 0 0 Function2 0 1 0 0 1 1 0 0 Function3 0 0 0 1 0 1 1 1 Martin Paul Christoph Schr?der <pschrode at="" alumni.unav.es=""> writes: > Hello all! > > I have the following problem with the %in% command: > > 1) I have a data frame that consists of functions (rows) and genes > (columns). The whole has been loaded with the "read.delim" command > because of gene-duplications between the different rows. > 2) Now, there is another data frame that contains all the genes (only > the genes and without duplicates) from all the functions of the above > data frame. > > What I want to do now is to use the "% in %" command to obtain a > TRUE-FALSE data frame. This should be a data frame, where for every > function some genes are TRUE and some are FALSE depending if they were > or not in the specific function when matched against the "all genes" > data frame. > > The main problem I have is the way how the genes are in the first data > frame. I used the "unlist" command to separate them through commas ",". > But every time I do the match between the first and second data frame it > returns out FALSE for every gene in every function. > > Can anyone please give me a hind how to handle the problem? > Thank you very much in advance! > > Paul > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
It might be reasonable to split on space (" "), then paste/collapse together with "" and then split on ",". This will ensure that all spaces (before or after comma) are removed at once. Oleg Martin Morgan wrote: > Hi Paul -- I saw this on the R mailing list, too. Such > 'cross-posting' is discouraged (though in this case you get answers > that you wouldn't have got if you'd restricted yourself to just one > list!) > > I wonder if your problem was splitting the 'genes' character string > with "," rather than ", *" or ",[[:blank:]]*" ? Whatever the case, if > you have a data frame (from jim holtman's reply on the R list) > >> func_gen > Function x > 1 Function1 gene5, gene19, gene22, gene23 > 2 Function2 gene1, gene7, gene19 > 3 Function3 gene2, gene3, gene7, gene23 > > I would have created a named list associating function and gene name: > >> fids <- sapply(func_gen[["x"]], strsplit, ",[[:blank:]]*") >> names(fids) <- func_gen[["Function"]] > > and converted this to an incidence matrix: > >> uids <- unique(unlist(fids)) >> incidence <- sapply(fids, "%in%", x=uids) >> rownames(incidence) <- uids > > Since these seem like gene sets, and your work flow might continue > along these lines, it might be convenient to represent your data as a > gene set collection > >> library(GSEABase) >> gs <- mapply(GeneSet, fids, setName=names(fids)) >> gsc <- GeneSetCollection(gs) > > and then let the package do the clever operation > >> incidence(gsc) > gene5 gene19 gene22 gene23 gene1 gene7 gene2 gene3 > Function1 1 1 1 1 0 0 0 0 > Function2 0 1 0 0 1 1 0 0 > Function3 0 0 0 1 0 1 1 1 > > Martin > > Paul Christoph Schr?der <pschrode at="" alumni.unav.es=""> writes: > >> Hello all! >> >> I have the following problem with the %in% command: >> >> 1) I have a data frame that consists of functions (rows) and genes >> (columns). The whole has been loaded with the "read.delim" command >> because of gene-duplications between the different rows. >> 2) Now, there is another data frame that contains all the genes (only >> the genes and without duplicates) from all the functions of the above >> data frame. >> >> What I want to do now is to use the "% in %" command to obtain a >> TRUE-FALSE data frame. This should be a data frame, where for every >> function some genes are TRUE and some are FALSE depending if they were >> or not in the specific function when matched against the "all genes" >> data frame. >> >> The main problem I have is the way how the genes are in the first data >> frame. I used the "unlist" command to separate them through commas ",". >> But every time I do the match between the first and second data frame it >> returns out FALSE for every gene in every function. >> >> Can anyone please give me a hind how to handle the problem? >> Thank you very much in advance! >> >> Paul >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr Oleg Sklyar * EBI-EMBL, Cambridge CB10 1SD, UK * +44-1223-494466
ADD REPLY
0
Entering edit mode
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080221/ 94cce6f0/attachment.pl
ADD REPLY

Login before adding your answer.

Traffic: 1027 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6