extracting character string
1
0
Entering edit mode
Hari Easwaran ▴ 240
@hari-easwaran-3510
Last seen 9.5 years ago
United States
Hi all, I am working with Agilent microarray data and trying to extract only the accession numbers from the output probe annotation. Basically I have a column detailing the probe as follows: ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158 ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239 mgc|BC034752:79 ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref|NM_194302:45605 |mirna|hsa-mir-375:5790 ... I am trying to extract only the Refseq IDs (in this case NM_004564, NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new column with the IDs. I am not able to figure out how to do this. I tried using the function 'strsplit', but it doesn't work. I am a newbie to R/Bioconductor and would appreciate if someone can help. Thanks. Hari [[alternative HTML version deleted]]
Microarray Annotation probe Microarray Annotation probe • 927 views
ADD COMMENT
0
Entering edit mode
Mark Robinson ★ 1.1k
@mark-robinson-2171
Last seen 10.2 years ago
Hi Hari. strsplit() will work, its just sensitive. For starters, you might try: > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158", + "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79") > > strsplit(x,"\\|") [[1]] [1] "ref" "NM_004564" "ref" "PET112L:2131" [5] "mgc" "BC130348:2158" [[2]] [1] "ref" "NM_007266" "ref" "XAB1:2255" [5] "mgc" "BC007451:2239" [[3]] [1] "mgc" "BC034752:79" And, for extracting the first 2 columns, maybe you'll want to migrate towards something like: > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2], USE.NAMES=FALSE)) [,1] [,2] [1,] "ref" "NM_004564" [2,] "ref" "NM_007266" [3,] "mgc" "BC034752:79" Hope that gets you started. Cheers, Mark On 17/06/2009, at 7:54 AM, Hari Easwaran wrote: > Hi all, > I am working with Agilent microarray data and trying to extract only > the > accession numbers from the output probe annotation. Basically I have a > column detailing the probe as follows: > > ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158 > ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239 > mgc|BC034752:79 > ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref| > NM_194302:45605|mirna|hsa-mir-375:5790 > ... > > I am trying to extract only the Refseq IDs (in this case NM_004564, > NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new > column > with the IDs. I am not able to figure out how to do this. I tried > using the > function 'strsplit', but it doesn't work. > I am a newbie to R/Bioconductor and would appreciate if someone can > help. > > Thanks. > Hari > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852
ADD COMMENT
0
Entering edit mode
Hi Hari, Mark, Mark Robinson wrote: > Hi Hari. > > strsplit() will work, its just sensitive. For starters, you might try: > > > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158", > + "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79") > > > > strsplit(x,"\\|") > [[1]] > [1] "ref" "NM_004564" "ref" "PET112L:2131" > [5] "mgc" "BC130348:2158" > > [[2]] > [1] "ref" "NM_007266" "ref" "XAB1:2255" > [5] "mgc" "BC007451:2239" > > [[3]] > [1] "mgc" "BC034752:79" Note that it's better here to use strsplit() with fixed=TRUE. Then no need to escape the | and in addition strsplit() will be much faster... Cheers, H. > > > And, for extracting the first 2 columns, maybe you'll want to migrate > towards something like: > > > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2], > USE.NAMES=FALSE)) > [,1] [,2] > [1,] "ref" "NM_004564" > [2,] "ref" "NM_007266" > [3,] "mgc" "BC034752:79" > > Hope that gets you started. > > Cheers, > Mark > > > On 17/06/2009, at 7:54 AM, Hari Easwaran wrote: > >> Hi all, >> I am working with Agilent microarray data and trying to extract only the >> accession numbers from the output probe annotation. Basically I have a >> column detailing the probe as follows: >> >> ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158 >> ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239 >> mgc|BC034752:79 >> ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref|NM_194302:45 605|mirna|hsa-mir-375:5790 >> >> ... >> >> I am trying to extract only the Refseq IDs (in this case NM_004564, >> NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new column >> with the IDs. I am not able to figure out how to do this. I tried >> using the >> function 'strsplit', but it doesn't work. >> I am a newbie to R/Bioconductor and would appreciate if someone can help. >> >> Thanks. >> Hari >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: m.robinson at garvan.org.au > e: mrobinson at wehi.edu.au > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY
0
Entering edit mode
Hi Mark and Hervé, Thanks a lot. I will try that. I was using strsplit(x,"|"), without the backslashes. Thanks again. Sincerely, Hari 2009/6/16 Hervé Pagès <hpages@fhcrc.org> > Hi Hari, Mark, > > Mark Robinson wrote: > >> Hi Hari. >> >> strsplit() will work, its just sensitive. For starters, you might try: >> >> > x <- c("ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158", >> + "ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239","mgc|BC034752:79") >> > >> > strsplit(x,"\\|") >> [[1]] >> [1] "ref" "NM_004564" "ref" "PET112L:2131" >> [5] "mgc" "BC130348:2158" >> >> [[2]] >> [1] "ref" "NM_007266" "ref" "XAB1:2255" >> [5] "mgc" "BC007451:2239" >> >> [[3]] >> [1] "mgc" "BC034752:79" >> > > Note that it's better here to use strsplit() with fixed=TRUE. Then no > need to escape the | and in addition strsplit() will be much faster... > > Cheers, > H. > > > >> >> And, for extracting the first 2 columns, maybe you'll want to migrate >> towards something like: >> >> > t(sapply(x, FUN=function(u) strsplit(u, "\\|")[[1]][1:2], >> USE.NAMES=FALSE)) >> [,1] [,2] >> [1,] "ref" "NM_004564" >> [2,] "ref" "NM_007266" >> [3,] "mgc" "BC034752:79" >> >> Hope that gets you started. >> >> Cheers, >> Mark >> >> >> On 17/06/2009, at 7:54 AM, Hari Easwaran wrote: >> >> Hi all, >>> I am working with Agilent microarray data and trying to extract only the >>> accession numbers from the output probe annotation. Basically I have a >>> column detailing the probe as follows: >>> >>> ref|NM_004564|ref|PET112L:2131|mgc|BC130348:2158 >>> ref|NM_007266|ref|XAB1:2255|mgc|BC007451:2239 >>> mgc|BC034752:79 >>> ref|NM_057094|ref|CRYBA2:-2513|ref|NM_005209:-2519|ref|NM_194302:4 5605|mirna|hsa-mir-375:5790 >>> >>> ... >>> >>> I am trying to extract only the Refseq IDs (in this case NM_004564, >>> NM_007266, NM_057094, NM_005209, NM_194302.....) and create a new column >>> with the IDs. I am not able to figure out how to do this. I tried using >>> the >>> function 'strsplit', but it doesn't work. >>> I am a newbie to R/Bioconductor and would appreciate if someone can help. >>> >>> Thanks. >>> Hari >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> ------------------------------ >> Mark Robinson, PhD (Melb) >> Epigenetics Laboratory, Garvan >> Bioinformatics Division, WEHI >> e: m.robinson@garvan.org.au >> e: mrobinson@wehi.edu.au >> p: +61 (0)3 9345 2628 >> f: +61 (0)3 9347 0852 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6