Problem using the %in% command

0

Entering edit mode

Paul ChristophSchröder ▴ 70

@paul-christophschroder-1940

Last seen 9.6 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080220/ 76033284/attachment.pl

• 789 views

ADD COMMENT • link updated 16.2 years ago by Martin Morgan 25k • written 16.2 years ago by Paul ChristophSchröder ▴ 70

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 4 days ago

United States

Hi Paul -- I saw this on the R mailing list, too. Such 'cross-posting' is discouraged (though in this case you get answers that you wouldn't have got if you'd restricted yourself to just one list!) I wonder if your problem was splitting the 'genes' character string with "," rather than ", *" or ",[[:blank:]]*" ? Whatever the case, if you have a data frame (from jim holtman's reply on the R list) > func_gen Function x 1 Function1 gene5, gene19, gene22, gene23 2 Function2 gene1, gene7, gene19 3 Function3 gene2, gene3, gene7, gene23 I would have created a named list associating function and gene name: > fids <- sapply(func_gen[["x"]], strsplit, ",[[:blank:]]*") > names(fids) <- func_gen[["Function"]] and converted this to an incidence matrix: > uids <- unique(unlist(fids)) > incidence <- sapply(fids, "%in%", x=uids) > rownames(incidence) <- uids Since these seem like gene sets, and your work flow might continue along these lines, it might be convenient to represent your data as a gene set collection > library(GSEABase) > gs <- mapply(GeneSet, fids, setName=names(fids)) > gsc <- GeneSetCollection(gs) and then let the package do the clever operation > incidence(gsc) gene5 gene19 gene22 gene23 gene1 gene7 gene2 gene3 Function1 1 1 1 1 0 0 0 0 Function2 0 1 0 0 1 1 0 0 Function3 0 0 0 1 0 1 1 1 Martin Paul Christoph Schr?der <pschrode at="" alumni.unav.es=""> writes: > Hello all! > > I have the following problem with the %in% command: > > 1) I have a data frame that consists of functions (rows) and genes > (columns). The whole has been loaded with the "read.delim" command > because of gene-duplications between the different rows. > 2) Now, there is another data frame that contains all the genes (only > the genes and without duplicates) from all the functions of the above > data frame. > > What I want to do now is to use the "% in %" command to obtain a > TRUE-FALSE data frame. This should be a data frame, where for every > function some genes are TRUE and some are FALSE depending if they were > or not in the specific function when matched against the "all genes" > data frame. > > The main problem I have is the way how the genes are in the first data > frame. I used the "unlist" command to separate them through commas ",". > But every time I do the match between the first and second data frame it > returns out FALSE for every gene in every function. > > Can anyone please give me a hind how to handle the problem? > Thank you very much in advance! > > Paul > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793

ADD COMMENT • link 16.2 years ago Martin Morgan 25k

0

Entering edit mode

It might be reasonable to split on space (" "), then paste/collapse together with "" and then split on ",". This will ensure that all spaces (before or after comma) are removed at once. Oleg Martin Morgan wrote: > Hi Paul -- I saw this on the R mailing list, too. Such > 'cross-posting' is discouraged (though in this case you get answers > that you wouldn't have got if you'd restricted yourself to just one > list!) > > I wonder if your problem was splitting the 'genes' character string > with "," rather than ", *" or ",[[:blank:]]*" ? Whatever the case, if > you have a data frame (from jim holtman's reply on the R list) > >> func_gen > Function x > 1 Function1 gene5, gene19, gene22, gene23 > 2 Function2 gene1, gene7, gene19 > 3 Function3 gene2, gene3, gene7, gene23 > > I would have created a named list associating function and gene name: > >> fids <- sapply(func_gen[["x"]], strsplit, ",[[:blank:]]*") >> names(fids) <- func_gen[["Function"]] > > and converted this to an incidence matrix: > >> uids <- unique(unlist(fids)) >> incidence <- sapply(fids, "%in%", x=uids) >> rownames(incidence) <- uids > > Since these seem like gene sets, and your work flow might continue > along these lines, it might be convenient to represent your data as a > gene set collection > >> library(GSEABase) >> gs <- mapply(GeneSet, fids, setName=names(fids)) >> gsc <- GeneSetCollection(gs) > > and then let the package do the clever operation > >> incidence(gsc) > gene5 gene19 gene22 gene23 gene1 gene7 gene2 gene3 > Function1 1 1 1 1 0 0 0 0 > Function2 0 1 0 0 1 1 0 0 > Function3 0 0 0 1 0 1 1 1 > > Martin > > Paul Christoph Schr?der <pschrode at="" alumni.unav.es=""> writes: > >> Hello all! >> >> I have the following problem with the %in% command: >> >> 1) I have a data frame that consists of functions (rows) and genes >> (columns). The whole has been loaded with the "read.delim" command >> because of gene-duplications between the different rows. >> 2) Now, there is another data frame that contains all the genes (only >> the genes and without duplicates) from all the functions of the above >> data frame. >> >> What I want to do now is to use the "% in %" command to obtain a >> TRUE-FALSE data frame. This should be a data frame, where for every >> function some genes are TRUE and some are FALSE depending if they were >> or not in the specific function when matched against the "all genes" >> data frame. >> >> The main problem I have is the way how the genes are in the first data >> frame. I used the "unlist" command to separate them through commas ",". >> But every time I do the match between the first and second data frame it >> returns out FALSE for every gene in every function. >> >> Can anyone please give me a hind how to handle the problem? >> Thank you very much in advance! >> >> Paul >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr Oleg Sklyar * EBI-EMBL, Cambridge CB10 1SD, UK * +44-1223-494466

ADD REPLY • link 16.2 years ago Oleg Sklyar ▴ 260

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20080221/ 94cce6f0/attachment.pl

ADD REPLY • link 16.2 years ago Paul ChristophSchröder ▴ 70

Login before adding your answer.