I have gene file in this format, everything in one column (no spaces
at all):
SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
Is there any way to convert it in this format (into four columns)
except
manually?
SFTPB NM_000542.1 4506904
surfactant, pulmonary-associated protein B
Any suggestions?
Narendra
Dr. Narendra Kaushik
School of Biosciences,
University of Cardiff,
Museum Avenue,
Cardiff CF10 3US
Tel: 029 20 875 153
>I have gene file in this format, everything in one column (no spaces
at all):
>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
>Is there any way to convert it in this format (into four columns)
except
>manually?
>
>SFTPB NM_000542.1 4506904
>surfactant, pulmonary-associated protein B
try:
unlist(strsplit(yourString, "\\|"))
>
>Any suggestions?
>
>Narendra
>
>Dr. Narendra Kaushik
>School of Biosciences,
>University of Cardiff,
>Museum Avenue,
>Cardiff CF10 3US
>Tel: 029 20 875 153
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
Jianhua Zhang
Department of Medical Oncology
Dana-Farber Cancer Institute
44 Binney Street
Boston, MA 02115-6084
Narendra Kaushik wrote:
> I have gene file in this format, everything in one column (no spaces
at all):
> SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
> Is there any way to convert it in this format (into four columns)
except
> manually?
>
> SFTPB NM_000542.1 4506904
> surfactant, pulmonary-associated protein B
>
> Any suggestions?
Does
data.frame(scan("filename", what = "c", sep = "|"))
do what you want?
Best,
Jim
>
> Narendra
>
> Dr. Narendra Kaushik
> School of Biosciences,
> University of Cardiff,
> Museum Avenue,
> Cardiff CF10 3US
> Tel: 029 20 875 153
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
--
James W. MacDonald
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">:
> I have gene file in this format, everything in one column (no spaces
at all):
> SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
> Is there any way to convert it in this format (into four columns)
except
> manually?
>
> SFTPB NM_000542.1 4506904
> surfactant, pulmonary-associated protein B
>
> Any suggestions?
>
> Narendra
Maybe too obvious, but Excel is very good for this sort of thing.
Functions like
Search allow you to obtain the position of a particulat character
(like
"|") and
knowing that you can select the text to the left or right to it... if
you do
that consecutively you can sort it like that. It'll take a minute.
Jose
--
Dr. Jose I. de las Heras Email: J.delasHeras at
ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131
6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131
6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
Hi Narendra,
R is also very good for this sort of thing. Have a look at the
strsplit
function.
x = readLines("yourfile")
sp = strsplit(x, split="|")
(see the man page of strsplit) and from this you can construct e.g. a
vector with the j-th column through
sapply(sp, "[", j)
Cheers
Wolfgang
-------------------------------------
Wolfgang Huber
European Bioinformatics Institute
European Molecular Biology Laboratory
Cambridge CB10 1SD
England
Phone: +44 1223 494642
Fax: +44 1223 494486
Http: www.ebi.ac.uk/huber
-------------------------------------
J.delasHeras at ed.ac.uk wrote:
> Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">:
>
>
>>I have gene file in this format, everything in one column (no spaces
at all):
>>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
>>Is there any way to convert it in this format (into four columns)
except
>>manually?
>>
>>SFTPB NM_000542.1 4506904
>>surfactant, pulmonary-associated protein B
>>
>>Any suggestions?
>>
>>Narendra
>
>
> Maybe too obvious, but Excel is very good for this sort of thing.
> Functions like
> Search allow you to obtain the position of a particulat character
(like
> "|") and
> knowing that you can select the text to the left or right to it...
if you do
> that consecutively you can sort it like that. It'll take a minute.
>
On 6 Nov 2005, christopher.wilkinson at adelaide.edu.au wrote:
>
> If you want to do this in R, the function you want is strsplit,
> telling it to split on the "|" character. However "|" is special in
> character splitting (regular expressions) so we have to protect it
> with backslashes.
For using strsplit in this way, you can also pass the fixed=TRUE
option and then you do not need to do any escaping.
+ seth
If you want to do this in R, the function you want is strsplit,
telling
it to split on the "|" character. However "|" is special in character
splitting (regular expressions) so we have to protect it with
backslashes. As a word of advice look up regular expressions - they
are
extremely powerful for manipulating strings (?regexp)
> geneName <- "SFTPB|NM_000542.1|4506904|surfactant,
pulmonary-associated protein B"
> strsplit(geneName,"\\|")
[[1]]
[1] "SFTPB"
"NM_000542.1"
[3] "4506904" "surfactant,
pulmonary-associated protein B"
note it returns a list, where you probably want a vector or array, so
something like
t(as.matrix(strsplit(geneName,"\\|")[[1]])) or
unlist(strsplit(geneName,"\\|") will give
"SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
Now lets assume you have a vector of genenames to be split, you can
use
the sapply function.
geneNames <- rep(geneName,3)
geneNamesAsMatrix <-
t(sapply(geneNames,function(x){unlist(strsplit(x,"\\|"))}))
> rownames(geneNamesAsMatrix) <- NULL ## otherwise whole str is the
row
name
> geneNamesAsMatrix
[,1] [,2] [,3]
[,4]
[1,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
[2,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
[3,] "SFTPB" "NM_000542.1" "4506904" "surfactant, pulmonary-associated
protein B"
Of course you could do this on the command line with perl using
something like
perl -ne 'my @F=split /\|/,$_;print join("\t", at F)' infile > outfile
Cheers
Chris
>Date: Sun, 06 Nov 2005 02:13:39 +0000
>From: J.delasHeras at ed.ac.uk
>Subject: Re: [BioC] Gene names
>To: bioconductor at stat.math.ethz.ch
>Message-ID: <20051106021339.3x6viekhogs0w8w0 at
www.staffmail.ed.ac.uk>
>Content-Type: text/plain; charset=ISO-8859-1;
format="flowed"
>
>Quoting Narendra Kaushik <kaushiknk at="" cardiff.ac.uk="">:
>
>
>
>>I have gene file in this format, everything in one column (no spaces
at all):
>>SFTPB|NM_000542.1|4506904|surfactant, pulmonary-associated protein B
>>Is there any way to convert it in this format (into four columns)
except
>>manually?
>>
>>SFTPB NM_000542.1 4506904
>>surfactant, pulmonary-associated protein B
>>
>>Any suggestions?
>>
>>Narendra
>>
>>
>
>Maybe too obvious, but Excel is very good for this sort of thing.
>Functions like
>Search allow you to obtain the position of a particulat character
(like
>"|") and
>knowing that you can select the text to the left or right to it... if
you do
>that consecutively you can sort it like that. It'll take a minute.
>
>Jose
>
>
>
--
Dr Chris Wilkinson
Senior Research Officer | ARC Research Associate
Child Health Research Institute (CHRI)| Microarray Analysis Group
7th floor, Clarence Rieger Building | Room 121
Women's and Children's Hospital | School of Mathematical
Sciences
72 King William Rd, | The University of Adelaide,
5005
North Adelaide, 5006 | CRICOS Provider Number 00123M
Math's Office (Room 121) Ph: 8303 3714
CHRI Office (CR2 52A) Ph: 8161 6363
Christopher.Wilkinson at adelaide.edu.au
http://mag.maths.adelaide.edu.au/crwilkinson.html
Organising Committee Member, 5th Australian Microarray Conference
29th Sept to 1st Oct 2005, Novatel Barossa Valley Resort
http://www.sapmea.asn.au/conventions/microarray/index.html