Question

snpStats, read.long, alleles in two columns

0

Entering edit mode

Liz Hare ▴ 30

@liz-hare-5148

Last seen 10.6 years ago

Hello, I am trying to read an Illumina final format .txt file (tab-delimited) into snpStats. The file contains 4 columns: snp, sample, allele 1, and allele 2. Some sample lines: BICF2G630100019 04-0677/J279 C C BICF2G630100032 04-0677/J279 T T BICF2G630100034 04-0677/J279 G G BICF2G630100043 04-0677/J279 A A BICF2G630100054 04-0677/J279 T T BICF2G630100063 04-0677/J279 T C BICF2G630100075 04-0677/J279 T T BICF2G63010009 04-0677/J279 G G BICF2G630100090 04-0677/J279 C C I can't figure out from the documentation or vignette on data input how to specify that the alleles are in two columns. This doesn't work: > CanineHD <- read.long(file="filename", + fields=c(snp=1, sample=2, genotype=3, genotype=4), + verbose=TRUE) Data to be read from the file filename No confidence thresholds specified Genotype read as a single field of two characters (which specify the alleles) Initial scan of file First sample: 04-0677/J279 First snp: BICF2G630100019 Last snp: YNp1-608 Last sample: 10-1160 96x173662 matrix to be read Reading genotypes from file 20% 40% 60% 80% 100% .........|.........|.........|.........|.........| -Error in read.long(file = "filename", : at line 1: C (expecting a 2-character genotype field) In addition: Warning message: closing unused connection 3 (filename) So I tried: > CanineHD <- read.long(file="filename", + fields=c(snp=1, sample=2, genotype=3), + gcodes="\t", codes="nucleotide", verbose=TRUE) Error in read.long(file = "filename", : unused argument(s) (codes = "nucleotide") > CanineHD <- read.long(file="filename", + fields=c(snp=1, sample=2, genotype=3), + split="\t", verbose=TRUE) Data to be read from the file filename No confidence thresholds specified Genotype read as a single field of two characters (which specify the alleles) Initial scan of file First sample: 04-0677/J279 First snp: BICF2G630100019 Last snp: YNp1-608 Last sample: 10-1160 96x173662 matrix to be read Reading genotypes from file 20% 40% 60% 80% 100% .........|.........|.........|.........|.........| -Error in read.long(file = "filename", : at line 1: C (expecting a 2-character genotype field) In addition: Warning message: closing unused connection 12 (filename) Is there a keyword for alleles rather than genotypes? I tried substituting the word 'allele' but didn't get anywhere. I suspect I'm not understanding something in the Details section of the documentation. Thanks, Liz -- Liz Hare PhD Dog Genetics LLC doggene at earthlink.net http://www.doggenetics.com

SNP Genetics snpStats SNP Genetics snpStats • 1.9k views

ADD COMMENT • link updated 13.1 years ago by David Clayton ▴ 20 • written 13.1 years ago by Liz Hare ▴ 30

score 0 · Answer 1 · 2012-03-07

I think you might have read the documentation more clearly, but it is a somewhat tricky function, so here are some pointers. suppose we have dem2.txt as a tab-delimited file with the contents you indicate > cat(readLines("dem2.txt"), sep="\n") BICF2G630100019 04-0677/J279 C C BICF2G630100032 04-0677/J279 T T BICF2G630100034 04-0677/J279 G G BICF2G630100043 04-0677/J279 A A BICF2G630100054 04-0677/J279 T T BICF2G630100063 04-0677/J279 T C BICF2G630100075 04-0677/J279 T T BICF2G63010009 04-0677/J279 G G BICF2G630100090 04-0677/J279 C C > dput(id) c("04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279") > dput(snp) c("BICF2G630100019", "BICF2G630100032", "BICF2G630100034", "BICF2G630100043", "BICF2G630100054", "BICF2G630100063", "BICF2G630100075", "BICF2G63010009", "BICF2G630100090") Then > nn = read.snps.long("dem2.txt", unique(id), snp, fields=c(snp=1, sample=2, allele1=3, allele2=4), codes="nucleotide", sep="\t") 9 genotypes successfully read > nn A SnpMatrix with 1 rows and 9 columns Row name: 04-0677/J279 Col names: BICF2G630100019 ... BICF2G630100090 > sessionInfo() R Under development (unstable) (2012-02-04 r58266) Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit) locale: [1] en_US.US-ASCII/en_US.US-ASCII/en_US.US-ASCII/C/en_US.US-ASCII/en_US .US-ASCII attached base packages: [1] splines stats graphics grDevices datasets utils tools [8] methods base other attached packages: [1] snpStats_1.5.4 Matrix_1.0-4 lattice_0.20-0 [4] survival_2.36-12 BiocInstaller_1.3.7 weaver_1.21.0 [7] codetools_0.2-8 digest_0.5.1 loaded via a namespace (and not attached): [1] grid_2.15.0 On Wed, Mar 7, 2012 at 10:16 AM, Liz Hare <doggene@earthlink.net> wrote: > Hello, > > I am trying to read an Illumina final format .txt file (tab- delimited) > into snpStats. The file contains 4 columns: snp, sample, allele 1, and > allele 2. Some sample lines: > > BICF2G630100019 04-0677/J279 C C > BICF2G630100032 04-0677/J279 T T > BICF2G630100034 04-0677/J279 G G > BICF2G630100043 04-0677/J279 A A > BICF2G630100054 04-0677/J279 T T > BICF2G630100063 04-0677/J279 T C > BICF2G630100075 04-0677/J279 T T > BICF2G63010009 04-0677/J279 G G > BICF2G630100090 04-0677/J279 C C > > I can't figure out from the documentation or vignette on data input how to > specify that the alleles are in two columns. > > This doesn't work: > > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3, genotype=4), > + verbose=TRUE) > Data to be read from the file filename > No confidence thresholds specified > Genotype read as a single field of two characters (which specify the > alleles) > Initial scan of file > First sample: 04-0677/J279 > First snp: BICF2G630100019 > Last snp: YNp1-608 > Last sample: 10-1160 > 96x173662 matrix to be read > Reading genotypes from file > 20% 40% 60% 80% 100% > .........|.........|.........|**.........|.........| > -Error in read.long(file = "filename", : > at line 1: C (expecting a 2-character genotype field) > In addition: Warning message: > closing unused connection 3 (filename) > > So I tried: > > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3), > + gcodes="\t", codes="nucleotide", verbose=TRUE) > Error in read.long(file = "filename", : > unused argument(s) (codes = "nucleotide") > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3), > + split="\t", verbose=TRUE) > Data to be read from the file filename > No confidence thresholds specified > Genotype read as a single field of two characters (which specify the > alleles) > Initial scan of file > First sample: 04-0677/J279 > First snp: BICF2G630100019 > Last snp: YNp1-608 > Last sample: 10-1160 > 96x173662 matrix to be read > Reading genotypes from file > 20% 40% 60% 80% 100% > .........|.........|.........|**.........|.........| > -Error in read.long(file = "filename", : > at line 1: C (expecting a 2-character genotype field) > In addition: Warning message: > closing unused connection 12 (filename) > > Is there a keyword for alleles rather than genotypes? I tried substituting > the word 'allele' but didn't get anywhere. I suspect I'm not understanding > something in the Details section of the documentation. > > Thanks, > Liz > > -- > Liz Hare PhD > Dog Genetics LLC > doggene@earthlink.net > http://www.doggenetics.com > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

score 0 · Answer 2 · 2012-03-07

0

Entering edit mode

David Clayton ▴ 20

@david-clayton-4729

Last seen 10.6 years ago

What _should_ work is ..., fields(genotype=NA, allele.A=3, allele.B=4), ... but I have to agree that the documentation is distinctly lacking. Let me know if this doesn't work. David Clayton On 07/03/12 15:16, Liz Hare wrote: > Hello, > > I am trying to read an Illumina final format .txt file (tab- delimited) > into snpStats. The file contains 4 columns: snp, sample, allele 1, and > allele 2. Some sample lines: > > BICF2G630100019 04-0677/J279 C C > BICF2G630100032 04-0677/J279 T T > BICF2G630100034 04-0677/J279 G G > BICF2G630100043 04-0677/J279 A A > BICF2G630100054 04-0677/J279 T T > BICF2G630100063 04-0677/J279 T C > BICF2G630100075 04-0677/J279 T T > BICF2G63010009 04-0677/J279 G G > BICF2G630100090 04-0677/J279 C C > > I can't figure out from the documentation or vignette on data input how > to specify that the alleles are in two columns. > > This doesn't work: > > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3, genotype=4), > + verbose=TRUE) > Data to be read from the file filename > No confidence thresholds specified > Genotype read as a single field of two characters (which specify the > alleles) > Initial scan of file > First sample: 04-0677/J279 > First snp: BICF2G630100019 > Last snp: YNp1-608 > Last sample: 10-1160 > 96x173662 matrix to be read > Reading genotypes from file > 20% 40% 60% 80% 100% > .........|.........|.........|.........|.........| > -Error in read.long(file = "filename", : > at line 1: C (expecting a 2-character genotype field) > In addition: Warning message: > closing unused connection 3 (filename) > > So I tried: > > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3), > + gcodes="\t", codes="nucleotide", verbose=TRUE) > Error in read.long(file = "filename", : > unused argument(s) (codes = "nucleotide") > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3), > + split="\t", verbose=TRUE) > Data to be read from the file filename > No confidence thresholds specified > Genotype read as a single field of two characters (which specify the > alleles) > Initial scan of file > First sample: 04-0677/J279 > First snp: BICF2G630100019 > Last snp: YNp1-608 > Last sample: 10-1160 > 96x173662 matrix to be read > Reading genotypes from file > 20% 40% 60% 80% 100% > .........|.........|.........|.........|.........| > -Error in read.long(file = "filename", : > at line 1: C (expecting a 2-character genotype field) > In addition: Warning message: > closing unused connection 12 (filename) > > Is there a keyword for alleles rather than genotypes? I tried > substituting the word 'allele' but didn't get anywhere. I suspect I'm > not understanding something in the Details section of the documentation. > > Thanks, > Liz > -- Professor David Clayton Wellcome Trust/Juvenile Diabetes Research Foundation Principal Research Fellow Diabetes and Inflammation Laboratory Cambridge University, Department of Medical Genetics Cambridge Institute for Medical Research Wellcome Trust/MRC Building Addenbrooke's Hospital Hills Road Cambridge CB2 0XY Tel: (44) 1223 762669 Email: david.clayton at cimr.cam.ac.uk

ADD COMMENT • link 13.1 years ago David Clayton ▴ 20

0

Entering edit mode

So I missed the distinction between read.snps.long and read.long.... With the materials I gave in the previous mail, we have CanineHD <- read.long(file="dem2.txt", gcodes=NA, fields=c(snp=1, sample=2, genotype=NA, allele.A=3, allele.B=4), split="\t", verbose=TRUE) > CanineHD $genotypes A SnpMatrix with 1 rows and 9 columns Row name: 04-0677/J279 Col names: BICF2G630100019 ... BICF2G630100090 $alleles allele.A allele.B BICF2G630100019 C <na> BICF2G630100032 T <na> BICF2G630100034 G <na> BICF2G630100043 A <na> BICF2G630100054 T <na> BICF2G630100063 T C BICF2G630100075 T <na> BICF2G63010009 G <na> BICF2G630100090 C <na> The gcodes setting is necessary -- but here the doc may need some amplification. On Wed, Mar 7, 2012 at 11:17 AM, David Clayton <dc208@cam.ac.uk> wrote: > What _should_ work is > > ..., fields(genotype=NA, allele.A=3, allele.B=4), ... > > but I have to agree that the documentation is distinctly lacking. > > Let me know if this doesn't work. > > David Clayton > > > > On 07/03/12 15:16, Liz Hare wrote: > >> Hello, >> >> I am trying to read an Illumina final format .txt file (tab- delimited) >> into snpStats. The file contains 4 columns: snp, sample, allele 1, and >> allele 2. Some sample lines: >> >> BICF2G630100019 04-0677/J279 C C >> BICF2G630100032 04-0677/J279 T T >> BICF2G630100034 04-0677/J279 G G >> BICF2G630100043 04-0677/J279 A A >> BICF2G630100054 04-0677/J279 T T >> BICF2G630100063 04-0677/J279 T C >> BICF2G630100075 04-0677/J279 T T >> BICF2G63010009 04-0677/J279 G G >> BICF2G630100090 04-0677/J279 C C >> >> I can't figure out from the documentation or vignette on data input how >> to specify that the alleles are in two columns. >> >> This doesn't work: >> >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3, genotype=4), >> + verbose=TRUE) >> Data to be read from the file filename >> No confidence thresholds specified >> Genotype read as a single field of two characters (which specify the >> alleles) >> Initial scan of file >> First sample: 04-0677/J279 >> First snp: BICF2G630100019 >> Last snp: YNp1-608 >> Last sample: 10-1160 >> 96x173662 matrix to be read >> Reading genotypes from file >> 20% 40% 60% 80% 100% >> .........|.........|.........|**.........|.........| >> -Error in read.long(file = "filename", : >> at line 1: C (expecting a 2-character genotype field) >> In addition: Warning message: >> closing unused connection 3 (filename) >> >> So I tried: >> >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3), >> + gcodes="\t", codes="nucleotide", verbose=TRUE) >> Error in read.long(file = "filename", : >> unused argument(s) (codes = "nucleotide") >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3), >> + split="\t", verbose=TRUE) >> Data to be read from the file filename >> No confidence thresholds specified >> Genotype read as a single field of two characters (which specify the >> alleles) >> Initial scan of file >> First sample: 04-0677/J279 >> First snp: BICF2G630100019 >> Last snp: YNp1-608 >> Last sample: 10-1160 >> 96x173662 matrix to be read >> Reading genotypes from file >> 20% 40% 60% 80% 100% >> .........|.........|.........|**.........|.........| >> -Error in read.long(file = "filename", : >> at line 1: C (expecting a 2-character genotype field) >> In addition: Warning message: >> closing unused connection 12 (filename) >> >> Is there a keyword for alleles rather than genotypes? I tried >> substituting the word 'allele' but didn't get anywhere. I suspect I'm >> not understanding something in the Details section of the documentation. >> >> Thanks, >> Liz >> >> > -- > Professor David Clayton > Wellcome Trust/Juvenile Diabetes Research Foundation Principal Research > Fellow > > Diabetes and Inflammation Laboratory > Cambridge University, Department of Medical Genetics > Cambridge Institute for Medical Research > Wellcome Trust/MRC Building > Addenbrooke's Hospital > Hills Road > Cambridge > CB2 0XY > > Tel: (44) 1223 762669 > Email: david.clayton@cimr.cam.ac.uk > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]

ADD REPLY • link 13.1 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Hi David and Vincent, Thanks so much for the quick responses to my problem! When I did: CanineHD <- read.long(file="filename", fields=c(snp=1, sample=2, genotype=NA, allele.A=3, allele.B=4), verbose=TRUE) I got: Data to be read from the file filename No confidence thresholds specified Initial scan of file First sample: 04-0677/J279 First snp: BICF2G630100019 Last snp: YNp1-608 Last sample: 10-1160 96x173662 matrix to be read Error in length(gcodes) : 'gcodes' is missing It looks from the help page like gcodes should only be used if the genotype is in one field. My allele fields contain either A, C, T, G or -. What should I tell gcodes? Thanks, Liz On 3/7/2012 11:17 AM, David Clayton wrote: > What _should_ work is > > ..., fields(genotype=NA, allele.A=3, allele.B=4), ... > > but I have to agree that the documentation is distinctly lacking. > > Let me know if this doesn't work. > > David Clayton > > > On 07/03/12 15:16, Liz Hare wrote: >> Hello, >> >> I am trying to read an Illumina final format .txt file (tab- delimited) >> into snpStats. The file contains 4 columns: snp, sample, allele 1, and >> allele 2. Some sample lines: >> >> BICF2G630100019 04-0677/J279 C C >> BICF2G630100032 04-0677/J279 T T >> BICF2G630100034 04-0677/J279 G G >> BICF2G630100043 04-0677/J279 A A >> BICF2G630100054 04-0677/J279 T T >> BICF2G630100063 04-0677/J279 T C >> BICF2G630100075 04-0677/J279 T T >> BICF2G63010009 04-0677/J279 G G >> BICF2G630100090 04-0677/J279 C C >> >> I can't figure out from the documentation or vignette on data input how >> to specify that the alleles are in two columns. >> >> This doesn't work: >> >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3, genotype=4), >> + verbose=TRUE) >> Data to be read from the file filename >> No confidence thresholds specified >> Genotype read as a single field of two characters (which specify the >> alleles) >> Initial scan of file >> First sample: 04-0677/J279 >> First snp: BICF2G630100019 >> Last snp: YNp1-608 >> Last sample: 10-1160 >> 96x173662 matrix to be read >> Reading genotypes from file >> 20% 40% 60% 80% 100% >> .........|.........|.........|.........|.........| >> -Error in read.long(file = "filename", : >> at line 1: C (expecting a 2-character genotype field) >> In addition: Warning message: >> closing unused connection 3 (filename) >> >> So I tried: >> >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3), >> + gcodes="\t", codes="nucleotide", verbose=TRUE) >> Error in read.long(file = "filename", : >> unused argument(s) (codes = "nucleotide") >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3), >> + split="\t", verbose=TRUE) >> Data to be read from the file filename >> No confidence thresholds specified >> Genotype read as a single field of two characters (which specify the >> alleles) >> Initial scan of file >> First sample: 04-0677/J279 >> First snp: BICF2G630100019 >> Last snp: YNp1-608 >> Last sample: 10-1160 >> 96x173662 matrix to be read >> Reading genotypes from file >> 20% 40% 60% 80% 100% >> .........|.........|.........|.........|.........| >> -Error in read.long(file = "filename", : >> at line 1: C (expecting a 2-character genotype field) >> In addition: Warning message: >> closing unused connection 12 (filename) >> >> Is there a keyword for alleles rather than genotypes? I tried >> substituting the word 'allele' but didn't get anywhere. I suspect I'm >> not understanding something in the Details section of the documentation. >> >> Thanks, >> Liz >> >

ADD REPLY • link 13.1 years ago Liz Hare ▴ 30