Formatting in R
1
0
Entering edit mode
Voke AO ▴ 760
@voke-ao-4830
Last seen 10.2 years ago
Hi all, I have two files I need to merge but first I need to reformat one of them...it looks like this: GENE1 snp001 snp002 snp003 snp004 END GENE2 snp005 snp006 snp007 snp008 END GENE3 snp009 snp010 snp011 snp012 END It's pretty much a set file from Plink. What I'd like to do is to have the file looking like this: GENE1 snp001 GENE1 snp002 GENE1 snp003 GENE1 snp004 GENE2 snp005 GENE2 snp006 GENE2 snp007 GENE2 snp008 GENE3 snp009 GENE3 snp010 ... The second file looks pretty much a file with most of the same SNPs for the most part but with additional data so I pretty much want to do something like can be done on Microsoft Access merging the two files using the common unique identifier the snps. SNP Pval Fst snp001 0.0005 0.25 snp002 0.0003 0.75 snp003 0.0001 0.65 snp004 0.00001 0.3 snp005 0.00006 0.5 snp006 0.0004 0.1 snp007 0.00003 0.6 snp008 0.0002 0.75 Any help with this in R will be greatly appreciated. Thanks. Avoks [[alternative HTML version deleted]]
• 1.2k views
ADD COMMENT
0
Entering edit mode
@stephanie-m-gogarten-5121
Last seen 4 months ago
University of Washington
Hi Avoks, This is more of an R-help question than a Bioconductor question, but here's something that I think will work for you. Step 1 does some indexing of your first input file, and step 2 uses the "merge" function in R - see the man page for that function for more details. > f1 <- readLines("f1.txt") > gene.index <- grep("^GENE", f1) > snp.index <- grep("^snp", f1) > end.index <- grep("^END", f1) > snps <- f1[snp.index] > genes <- character(length(snps)) > for (i in 1:length(gene.index)) { + snps.in.gene <- snp.index > gene.index[i] & snp.index < end.index[i] + genes[snps.in.gene] <- f1[gene.index[i]] + } > snp.by.gene <- data.frame(GENE=genes, SNP=snps, stringsAsFactors=FALSE) > snp.by.gene GENE SNP 1 GENE1 snp001 2 GENE1 snp002 3 GENE1 snp003 4 GENE1 snp004 5 GENE2 snp005 6 GENE2 snp006 7 GENE2 snp007 8 GENE2 snp008 9 GENE3 snp009 10 GENE3 snp010 11 GENE3 snp011 12 GENE3 snp012 > f2 <- read.table("f2.txt", as.is=TRUE, header=TRUE) > f2 SNP Pval Fst 1 snp001 5e-04 0.25 2 snp002 3e-04 0.75 3 snp003 1e-04 0.65 4 snp004 1e-05 0.30 5 snp005 6e-05 0.50 6 snp006 4e-04 0.10 7 snp007 3e-05 0.60 8 snp008 2e-04 0.75 > snp.table <- merge(snp.by.gene, f2, by="SNP", all.x=TRUE) > snp.table SNP GENE Pval Fst 1 snp001 GENE1 5e-04 0.25 2 snp002 GENE1 3e-04 0.75 3 snp003 GENE1 1e-04 0.65 4 snp004 GENE1 1e-05 0.30 5 snp005 GENE2 6e-05 0.50 6 snp006 GENE2 4e-04 0.10 7 snp007 GENE2 3e-05 0.60 8 snp008 GENE2 2e-04 0.75 9 snp009 GENE3 NA NA 10 snp010 GENE3 NA NA 11 snp011 GENE3 NA NA 12 snp012 GENE3 NA NA Stephanie On 10/3/13 12:23 AM, Ovokeraye Achinike-Oduaran wrote: > Hi all, > > I have two files I need to merge but first I need to reformat one of > them...it looks like this: > > GENE1 > snp001 > snp002 > snp003 > snp004 > > END > GENE2 > snp005 > snp006 > snp007 > snp008 > > END > > GENE3 > snp009 > snp010 > snp011 > snp012 > > END > > It's pretty much a set file from Plink. What I'd like to do is to have the > file looking like this: > > GENE1 snp001 > GENE1 snp002 > GENE1 snp003 > GENE1 snp004 > GENE2 snp005 > GENE2 snp006 > GENE2 snp007 > GENE2 snp008 > GENE3 snp009 > GENE3 snp010 > ... > > The second file looks pretty much a file with most of the same SNPs for the > most part but with additional data?so I pretty much want to do something > like can be done on Microsoft Access merging the two files using the common > unique identifier?the snps. > > SNP > Pval Fst snp001 0.0005 0.25 snp002 0.0003 0.75 snp003 0.0001 0.65 snp004 > 0.00001 0.3 snp005 0.00006 0.5 snp006 0.0004 0.1 snp007 0.00003 0.6 > snp008 0.0002 0.75 > > Any help with this in R will be greatly appreciated. > > Thanks. > > Avoks > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Thank you Stephanie. -Avoks On Thu, Oct 3, 2013 at 8:28 PM, Stephanie M. Gogarten < sdmorris@u.washington.edu> wrote: > Hi Avoks, > > This is more of an R-help question than a Bioconductor question, but > here's something that I think will work for you. Step 1 does some indexing > of your first input file, and step 2 uses the "merge" function in R - see > the man page for that function for more details. > > > f1 <- readLines("f1.txt") > > gene.index <- grep("^GENE", f1) > > snp.index <- grep("^snp", f1) > > end.index <- grep("^END", f1) > > snps <- f1[snp.index] > > genes <- character(length(snps)) > > for (i in 1:length(gene.index)) { > + snps.in.gene <- snp.index > gene.index[i] & snp.index < end.index[i] > + genes[snps.in.gene] <- f1[gene.index[i]] > + } > > snp.by.gene <- data.frame(GENE=genes, SNP=snps, stringsAsFactors=FALSE) > > snp.by.gene > GENE SNP > 1 GENE1 snp001 > 2 GENE1 snp002 > 3 GENE1 snp003 > 4 GENE1 snp004 > 5 GENE2 snp005 > 6 GENE2 snp006 > 7 GENE2 snp007 > 8 GENE2 snp008 > 9 GENE3 snp009 > 10 GENE3 snp010 > 11 GENE3 snp011 > 12 GENE3 snp012 > > f2 <- read.table("f2.txt", as.is=TRUE, header=TRUE) > > f2 > SNP Pval Fst > 1 snp001 5e-04 0.25 > 2 snp002 3e-04 0.75 > 3 snp003 1e-04 0.65 > 4 snp004 1e-05 0.30 > 5 snp005 6e-05 0.50 > 6 snp006 4e-04 0.10 > 7 snp007 3e-05 0.60 > 8 snp008 2e-04 0.75 > > snp.table <- merge(snp.by.gene, f2, by="SNP", all.x=TRUE) > > snp.table > SNP GENE Pval Fst > 1 snp001 GENE1 5e-04 0.25 > 2 snp002 GENE1 3e-04 0.75 > 3 snp003 GENE1 1e-04 0.65 > 4 snp004 GENE1 1e-05 0.30 > 5 snp005 GENE2 6e-05 0.50 > 6 snp006 GENE2 4e-04 0.10 > 7 snp007 GENE2 3e-05 0.60 > 8 snp008 GENE2 2e-04 0.75 > 9 snp009 GENE3 NA NA > 10 snp010 GENE3 NA NA > 11 snp011 GENE3 NA NA > 12 snp012 GENE3 NA NA > > Stephanie > > > On 10/3/13 12:23 AM, Ovokeraye Achinike-Oduaran wrote: > >> Hi all, >> >> I have two files I need to merge but first I need to reformat one of >> them...it looks like this: >> >> GENE1 >> snp001 >> snp002 >> snp003 >> snp004 >> >> END >> GENE2 >> snp005 >> snp006 >> snp007 >> snp008 >> >> END >> >> GENE3 >> snp009 >> snp010 >> snp011 >> snp012 >> >> END >> >> It's pretty much a set file from Plink. What I'd like to do is to have the >> file looking like this: >> >> GENE1 snp001 >> GENE1 snp002 >> GENE1 snp003 >> GENE1 snp004 >> GENE2 snp005 >> GENE2 snp006 >> GENE2 snp007 >> GENE2 snp008 >> GENE3 snp009 >> GENE3 snp010 >> ... >> >> The second file looks pretty much a file with most of the same SNPs for >> the >> most part but with additional data so I pretty much want to do something >> like can be done on Microsoft Access merging the two files using the >> common >> unique identifier the snps. >> >> SNP >> Pval Fst snp001 0.0005 0.25 snp002 0.0003 0.75 snp003 0.0001 0.65 >> snp004 >> 0.00001 0.3 snp005 0.00006 0.5 snp006 0.0004 0.1 snp007 0.00003 0.6 >> snp008 0.0002 0.75 >> >> Any help with this in R will be greatly appreciated. >> >> Thanks. >> >> Avoks >> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 744 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6