Gene names

0

Entering edit mode

Narendra Kaushik ▴ 150

@narendra-kaushik-1390

Last seen 9.6 years ago

I have gene file in this format, everything in one column (no spaces at all): SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B Is there any way to convert it in this format (into four columns) except manually? SFTPB NM_000542.1 4506904 surfactant, pulmonary-associated protein B Any suggestions? Narendra Dr. Narendra Kaushik School of Biosciences, University of Cardiff, Museum Avenue, Cardiff CF10 3US Tel: 029 20 875 153

convert convert • 1.1k views

ADD COMMENT • link updated 18.5 years ago by Seth Falcon ★ 7.4k • written 18.5 years ago by Narendra Kaushik ▴ 150

0

Entering edit mode

John Zhang ★ 2.9k

@john-zhang-6

Last seen 9.6 years ago

>I have gene file in this format, everything in one column (no spaces at all): >SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B >Is there any way to convert it in this format (into four columns) except >manually? > >SFTPB NM_000542.1 4506904 >surfactant, pulmonary-associated protein B try: unlist(strsplit(yourString, "\\|")) > >Any suggestions? > >Narendra > >Dr. Narendra Kaushik >School of Biosciences, >University of Cardiff, >Museum Avenue, >Cardiff CF10 3US >Tel: 029 20 875 153 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor Jianhua Zhang Department of Medical Oncology Dana-Farber Cancer Institute 44 Binney Street Boston, MA 02115-6084

ADD COMMENT • link 18.5 years ago John Zhang ★ 2.9k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 20 hours ago

United States

Narendra Kaushik wrote: > I have gene file in this format, everything in one column (no spaces at all): > SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B > Is there any way to convert it in this format (into four columns) except > manually? > > SFTPB NM_000542.1 4506904 > surfactant, pulmonary-associated protein B > > Any suggestions? Does data.frame(scan("filename", what = "c", sep = "|")) do what you want? Best, Jim > > Narendra > > Dr. Narendra Kaushik > School of Biosciences, > University of Cardiff, > Museum Avenue, > Cardiff CF10 3US > Tel: 029 20 875 153 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor -- James W. MacDonald Affymetrix and cDNA Microarray Core University of Michigan Cancer Center 1500 E. Medical Center Drive 7410 CCGC Ann Arbor MI 48109 734-647-5623

ADD COMMENT • link 18.5 years ago James W. MacDonald 65k

0

Entering edit mode

J.delasHeras@ed.ac.uk ★ 1.9k

@jdelasherasedacuk-1189

Last seen 8.7 years ago

United Kingdom

Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">: > I have gene file in this format, everything in one column (no spaces at all): > SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B > Is there any way to convert it in this format (into four columns) except > manually? > > SFTPB NM_000542.1 4506904 > surfactant, pulmonary-associated protein B > > Any suggestions? > > Narendra Maybe too obvious, but Excel is very good for this sort of thing. Functions like Search allow you to obtain the position of a particulat character (like "|") and knowing that you can select the text to the left or right to it... if you do that consecutively you can sort it like that. It'll take a minute. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK

ADD COMMENT • link 18.5 years ago J.delasHeras@ed.ac.uk ★ 1.9k

0

Entering edit mode

Hi Narendra, R is also very good for this sort of thing. Have a look at the strsplit function. x = readLines("yourfile") sp = strsplit(x, split="|") (see the man page of strsplit) and from this you can construct e.g. a vector with the j-th column through sapply(sp, "[", j) Cheers Wolfgang ------------------------------------- Wolfgang Huber European Bioinformatics Institute European Molecular Biology Laboratory Cambridge CB10 1SD England Phone: +44 1223 494642 Fax: +44 1223 494486 Http: www.ebi.ac.uk/huber ------------------------------------- J.delasHeras at ed.ac.uk wrote: > Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">: > > >>I have gene file in this format, everything in one column (no spaces at all): >>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B >>Is there any way to convert it in this format (into four columns) except >>manually? >> >>SFTPB NM_000542.1 4506904 >>surfactant, pulmonary-associated protein B >> >>Any suggestions? >> >>Narendra > > > Maybe too obvious, but Excel is very good for this sort of thing. > Functions like > Search allow you to obtain the position of a particulat character (like > "|") and > knowing that you can select the text to the left or right to it... if you do > that consecutively you can sort it like that. It'll take a minute. >

ADD REPLY • link 18.5 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Seth Falcon ★ 7.4k

@seth-falcon-992

Last seen 9.6 years ago

On 6 Nov 2005, christopher.wilkinson at adelaide.edu.au wrote: > > If you want to do this in R, the function you want is strsplit, > telling it to split on the "|" character. However "|" is special in > character splitting (regular expressions) so we have to protect it > with backslashes. For using strsplit in this way, you can also pass the fixed=TRUE option and then you do not need to do any escaping. + seth

ADD COMMENT • link 18.5 years ago Seth Falcon ★ 7.4k

0

Entering edit mode

Christopher Wilkinson ▴ 140

@christopher-wilkinson-309

Last seen 9.6 years ago

If you want to do this in R, the function you want is strsplit, telling it to split on the "|" character. However "|" is special in character splitting (regular expressions) so we have to protect it with backslashes. As a word of advice look up regular expressions - they are extremely powerful for manipulating strings (?regexp) > geneName <- "SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B" > strsplit(geneName,"\\|") [[1]] [1] "SFTPB" "NM_000542.1" [3] "4506904" "surfactant, pulmonary-associated protein B" note it returns a list, where you probably want a vector or array, so something like t(as.matrix(strsplit(geneName,"\\|")[[1]])) or unlist(strsplit(geneName,"\\|") will give "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B" Now lets assume you have a vector of genenames to be split, you can use the sapply function. geneNames <- rep(geneName,3) geneNamesAsMatrix <- t(sapply(geneNames,function(x){unlist(strsplit(x,"\\|"))})) > rownames(geneNamesAsMatrix) <- NULL ## otherwise whole str is the row name > geneNamesAsMatrix [,1] [,2] [,3] [,4] [1,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B" [2,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B" [3,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated protein B" Of course you could do this on the command line with perl using something like perl -ne 'my @F=split /\|/,$_;print join("\t", at F)' infile > outfile Cheers Chris >Date: Sun, 06 Nov 2005 02:13:39 +0000 >From: J.delasHeras at ed.ac.uk >Subject: Re: [BioC] Gene names >To: bioconductor at stat.math.ethz.ch >Message-ID: <20051106021339.3x6viekhogs0w8w0 at www.staffmail.ed.ac.uk> >Content-Type: text/plain; charset=ISO-8859-1; format="flowed" > >Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">: > > > >>I have gene file in this format, everything in one column (no spaces at all): >>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B >>Is there any way to convert it in this format (into four columns) except >>manually? >> >>SFTPB NM_000542.1 4506904 >>surfactant, pulmonary-associated protein B >> >>Any suggestions? >> >>Narendra >> >> > >Maybe too obvious, but Excel is very good for this sort of thing. >Functions like >Search allow you to obtain the position of a particulat character (like >"|") and >knowing that you can select the text to the left or right to it... if you do >that consecutively you can sort it like that. It'll take a minute. > >Jose > > > -- Dr Chris Wilkinson Senior Research Officer | ARC Research Associate Child Health Research Institute (CHRI)| Microarray Analysis Group 7th floor, Clarence Rieger Building | Room 121 Women's and Children's Hospital | School of Mathematical Sciences 72 King William Rd, | The University of Adelaide, 5005 North Adelaide, 5006 | CRICOS Provider Number 00123M Math's Office (Room 121) Ph: 8303 3714 CHRI Office (CR2 52A) Ph: 8161 6363 Christopher.Wilkinson at adelaide.edu.au http://mag.maths.adelaide.edu.au/crwilkinson.html Organising Committee Member, 5th Australian Microarray Conference 29th Sept to 1st Oct 2005, Novatel Barossa Valley Resort http://www.sapmea.asn.au/conventions/microarray/index.html

ADD COMMENT • link 18.5 years ago Christopher Wilkinson ▴ 140

Login before adding your answer.