If you want to do this in R, the function you want is strsplit,
telling
it to split on the "|" character. However "|" is special in character
splitting (regular expressions) so we have to protect it with
backslashes. As a word of advice look up regular expressions - they
are
extremely powerful for manipulating strings (?regexp)
> geneName <- "SFTPB|NM_000542.1|4506904|surfactant,
pulmonary-associated protein B"
> strsplit(geneName,"
\|")
[[1]]
[1] "SFTPB"
"NM_000542.1"
[3] "4506904" "surfactant,
pulmonary-associated protein B"
note it returns a list, where you probably want a vector or array, so
something like
t(as.matrix(strsplit(geneName,"
\|")[[1]])) or
unlist(strsplit(geneName,"
\|") will give
"SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
Now lets assume you have a vector of genenames to be split, you can
use
the sapply function.
geneNames <- rep(geneName,3)
geneNamesAsMatrix <-
t(sapply(geneNames,function(x){unlist(strsplit(x,"
\|"))}))
> rownames(geneNamesAsMatrix) <- NULL ## otherwise whole str is the
row
name
> geneNamesAsMatrix
[,1] [,2] [,3]
[,4]
[1,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
[2,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
[3,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
Of course you could do this on the command line with perl using
something like
perl -ne 'my @F=split /\|/,$_;print join("\t", at F)' infile > outfile
Cheers
Chris
>Date: Sun, 06 Nov 2005 02:13:39 +0000
>From: J.delasHeras at ed.ac.uk
>Subject: Re: [BioC] Gene names
>To: bioconductor at stat.math.ethz.ch
>Message-ID: <20051106021339.3x6viekhogs0w8w0 at
www.staffmail.ed.ac.uk>
>Content-Type: text/plain; charset=ISO-8859-1;
format="flowed"
>
>Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">:
>
>
>
>>I have gene file in this format, everything in one column (no spaces
at all):
>>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
>>Is there any way to convert it in this format (into four columns)
except
>>manually?
>>
>>SFTPB NM_000542.1 4506904
>>surfactant, pulmonary-associated protein B
>>
>>Any suggestions?
>>
>>Narendra
>>
>>
>
>Maybe too obvious, but Excel is very good for this sort of thing.
>Functions like
>Search allow you to obtain the position of a particulat character
(like
>"|") and
>knowing that you can select the text to the left or right to it... if
you do
>that consecutively you can sort it like that. It'll take a minute.
>
>Jose
>
>
>
--
Dr Chris Wilkinson
Senior Research Officer | ARC Research Associate
Child Health Research Institute (CHRI)| Microarray Analysis Group
7th floor, Clarence Rieger Building | Room 121
Women's and Children's Hospital | School of Mathematical
Sciences
72 King William Rd, | The University of Adelaide,
5005
North Adelaide, 5006 | CRICOS Provider Number 00123M
Math's Office (Room 121) Ph: 8303 3714
CHRI Office (CR2 52A) Ph: 8161 6363
Christopher.Wilkinson at adelaide.edu.au
http://mag.maths.adelaide.edu.au/crwilkinson.html
Organising Committee Member, 5th Australian Microarray Conference
29th Sept to 1st Oct 2005, Novatel Barossa Valley Resort
http://www.sapmea.asn.au/conventions/microarray/index.html