snpStats, read.long, alleles in two columns
2
0
Entering edit mode
Liz Hare ▴ 30
@liz-hare-5148
Last seen 9.7 years ago
Hello, I am trying to read an Illumina final format .txt file (tab-delimited) into snpStats. The file contains 4 columns: snp, sample, allele 1, and allele 2. Some sample lines: BICF2G630100019 04-0677/J279 C C BICF2G630100032 04-0677/J279 T T BICF2G630100034 04-0677/J279 G G BICF2G630100043 04-0677/J279 A A BICF2G630100054 04-0677/J279 T T BICF2G630100063 04-0677/J279 T C BICF2G630100075 04-0677/J279 T T BICF2G63010009 04-0677/J279 G G BICF2G630100090 04-0677/J279 C C I can't figure out from the documentation or vignette on data input how to specify that the alleles are in two columns. This doesn't work: > CanineHD <- read.long(file="filename", + fields=c(snp=1, sample=2, genotype=3, genotype=4), + verbose=TRUE) Data to be read from the file filename No confidence thresholds specified Genotype read as a single field of two characters (which specify the alleles) Initial scan of file First sample: 04-0677/J279 First snp: BICF2G630100019 Last snp: YNp1-608 Last sample: 10-1160 96x173662 matrix to be read Reading genotypes from file 20% 40% 60% 80% 100% .........|.........|.........|.........|.........| -Error in read.long(file = "filename", : at line 1: C (expecting a 2-character genotype field) In addition: Warning message: closing unused connection 3 (filename) So I tried: > CanineHD <- read.long(file="filename", + fields=c(snp=1, sample=2, genotype=3), + gcodes="\t", codes="nucleotide", verbose=TRUE) Error in read.long(file = "filename", : unused argument(s) (codes = "nucleotide") > CanineHD <- read.long(file="filename", + fields=c(snp=1, sample=2, genotype=3), + split="\t", verbose=TRUE) Data to be read from the file filename No confidence thresholds specified Genotype read as a single field of two characters (which specify the alleles) Initial scan of file First sample: 04-0677/J279 First snp: BICF2G630100019 Last snp: YNp1-608 Last sample: 10-1160 96x173662 matrix to be read Reading genotypes from file 20% 40% 60% 80% 100% .........|.........|.........|.........|.........| -Error in read.long(file = "filename", : at line 1: C (expecting a 2-character genotype field) In addition: Warning message: closing unused connection 12 (filename) Is there a keyword for alleles rather than genotypes? I tried substituting the word 'allele' but didn't get anywhere. I suspect I'm not understanding something in the Details section of the documentation. Thanks, Liz -- Liz Hare PhD Dog Genetics LLC doggene at earthlink.net http://www.doggenetics.com
SNP Genetics snpStats SNP Genetics snpStats • 1.5k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 15 hours ago
United States
I think you might have read the documentation more clearly, but it is a somewhat tricky function, so here are some pointers. suppose we have dem2.txt as a tab-delimited file with the contents you indicate > cat(readLines("dem2.txt"), sep="\n") BICF2G630100019 04-0677/J279 C C BICF2G630100032 04-0677/J279 T T BICF2G630100034 04-0677/J279 G G BICF2G630100043 04-0677/J279 A A BICF2G630100054 04-0677/J279 T T BICF2G630100063 04-0677/J279 T C BICF2G630100075 04-0677/J279 T T BICF2G63010009 04-0677/J279 G G BICF2G630100090 04-0677/J279 C C > dput(id) c("04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279", "04-0677/J279") > dput(snp) c("BICF2G630100019", "BICF2G630100032", "BICF2G630100034", "BICF2G630100043", "BICF2G630100054", "BICF2G630100063", "BICF2G630100075", "BICF2G63010009", "BICF2G630100090") Then > nn = read.snps.long("dem2.txt", unique(id), snp, fields=c(snp=1, sample=2, allele1=3, allele2=4), codes="nucleotide", sep="\t") 9 genotypes successfully read > nn A SnpMatrix with 1 rows and 9 columns Row name: 04-0677/J279 Col names: BICF2G630100019 ... BICF2G630100090 > sessionInfo() R Under development (unstable) (2012-02-04 r58266) Platform: x86_64-apple-darwin10.8.0/x86_64 (64-bit) locale: [1] en_US.US-ASCII/en_US.US-ASCII/en_US.US-ASCII/C/en_US.US-ASCII/en_US .US-ASCII attached base packages: [1] splines stats graphics grDevices datasets utils tools [8] methods base other attached packages: [1] snpStats_1.5.4 Matrix_1.0-4 lattice_0.20-0 [4] survival_2.36-12 BiocInstaller_1.3.7 weaver_1.21.0 [7] codetools_0.2-8 digest_0.5.1 loaded via a namespace (and not attached): [1] grid_2.15.0 On Wed, Mar 7, 2012 at 10:16 AM, Liz Hare <doggene@earthlink.net> wrote: > Hello, > > I am trying to read an Illumina final format .txt file (tab- delimited) > into snpStats. The file contains 4 columns: snp, sample, allele 1, and > allele 2. Some sample lines: > > BICF2G630100019 04-0677/J279 C C > BICF2G630100032 04-0677/J279 T T > BICF2G630100034 04-0677/J279 G G > BICF2G630100043 04-0677/J279 A A > BICF2G630100054 04-0677/J279 T T > BICF2G630100063 04-0677/J279 T C > BICF2G630100075 04-0677/J279 T T > BICF2G63010009 04-0677/J279 G G > BICF2G630100090 04-0677/J279 C C > > I can't figure out from the documentation or vignette on data input how to > specify that the alleles are in two columns. > > This doesn't work: > > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3, genotype=4), > + verbose=TRUE) > Data to be read from the file filename > No confidence thresholds specified > Genotype read as a single field of two characters (which specify the > alleles) > Initial scan of file > First sample: 04-0677/J279 > First snp: BICF2G630100019 > Last snp: YNp1-608 > Last sample: 10-1160 > 96x173662 matrix to be read > Reading genotypes from file > 20% 40% 60% 80% 100% > .........|.........|.........|**.........|.........| > -Error in read.long(file = "filename", : > at line 1: C (expecting a 2-character genotype field) > In addition: Warning message: > closing unused connection 3 (filename) > > So I tried: > > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3), > + gcodes="\t", codes="nucleotide", verbose=TRUE) > Error in read.long(file = "filename", : > unused argument(s) (codes = "nucleotide") > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3), > + split="\t", verbose=TRUE) > Data to be read from the file filename > No confidence thresholds specified > Genotype read as a single field of two characters (which specify the > alleles) > Initial scan of file > First sample: 04-0677/J279 > First snp: BICF2G630100019 > Last snp: YNp1-608 > Last sample: 10-1160 > 96x173662 matrix to be read > Reading genotypes from file > 20% 40% 60% 80% 100% > .........|.........|.........|**.........|.........| > -Error in read.long(file = "filename", : > at line 1: C (expecting a 2-character genotype field) > In addition: Warning message: > closing unused connection 12 (filename) > > Is there a keyword for alleles rather than genotypes? I tried substituting > the word 'allele' but didn't get anywhere. I suspect I'm not understanding > something in the Details section of the documentation. > > Thanks, > Liz > > -- > Liz Hare PhD > Dog Genetics LLC > doggene@earthlink.net > http://www.doggenetics.com > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@david-clayton-4729
Last seen 9.7 years ago
What _should_ work is ..., fields(genotype=NA, allele.A=3, allele.B=4), ... but I have to agree that the documentation is distinctly lacking. Let me know if this doesn't work. David Clayton On 07/03/12 15:16, Liz Hare wrote: > Hello, > > I am trying to read an Illumina final format .txt file (tab- delimited) > into snpStats. The file contains 4 columns: snp, sample, allele 1, and > allele 2. Some sample lines: > > BICF2G630100019 04-0677/J279 C C > BICF2G630100032 04-0677/J279 T T > BICF2G630100034 04-0677/J279 G G > BICF2G630100043 04-0677/J279 A A > BICF2G630100054 04-0677/J279 T T > BICF2G630100063 04-0677/J279 T C > BICF2G630100075 04-0677/J279 T T > BICF2G63010009 04-0677/J279 G G > BICF2G630100090 04-0677/J279 C C > > I can't figure out from the documentation or vignette on data input how > to specify that the alleles are in two columns. > > This doesn't work: > > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3, genotype=4), > + verbose=TRUE) > Data to be read from the file filename > No confidence thresholds specified > Genotype read as a single field of two characters (which specify the > alleles) > Initial scan of file > First sample: 04-0677/J279 > First snp: BICF2G630100019 > Last snp: YNp1-608 > Last sample: 10-1160 > 96x173662 matrix to be read > Reading genotypes from file > 20% 40% 60% 80% 100% > .........|.........|.........|.........|.........| > -Error in read.long(file = "filename", : > at line 1: C (expecting a 2-character genotype field) > In addition: Warning message: > closing unused connection 3 (filename) > > So I tried: > > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3), > + gcodes="\t", codes="nucleotide", verbose=TRUE) > Error in read.long(file = "filename", : > unused argument(s) (codes = "nucleotide") > > CanineHD <- read.long(file="filename", > + fields=c(snp=1, sample=2, genotype=3), > + split="\t", verbose=TRUE) > Data to be read from the file filename > No confidence thresholds specified > Genotype read as a single field of two characters (which specify the > alleles) > Initial scan of file > First sample: 04-0677/J279 > First snp: BICF2G630100019 > Last snp: YNp1-608 > Last sample: 10-1160 > 96x173662 matrix to be read > Reading genotypes from file > 20% 40% 60% 80% 100% > .........|.........|.........|.........|.........| > -Error in read.long(file = "filename", : > at line 1: C (expecting a 2-character genotype field) > In addition: Warning message: > closing unused connection 12 (filename) > > Is there a keyword for alleles rather than genotypes? I tried > substituting the word 'allele' but didn't get anywhere. I suspect I'm > not understanding something in the Details section of the documentation. > > Thanks, > Liz > -- Professor David Clayton Wellcome Trust/Juvenile Diabetes Research Foundation Principal Research Fellow Diabetes and Inflammation Laboratory Cambridge University, Department of Medical Genetics Cambridge Institute for Medical Research Wellcome Trust/MRC Building Addenbrooke's Hospital Hills Road Cambridge CB2 0XY Tel: (44) 1223 762669 Email: david.clayton at cimr.cam.ac.uk
ADD COMMENT
0
Entering edit mode
So I missed the distinction between read.snps.long and read.long.... With the materials I gave in the previous mail, we have CanineHD <- read.long(file="dem2.txt", gcodes=NA, fields=c(snp=1, sample=2, genotype=NA, allele.A=3, allele.B=4), split="\t", verbose=TRUE) > CanineHD $genotypes A SnpMatrix with 1 rows and 9 columns Row name: 04-0677/J279 Col names: BICF2G630100019 ... BICF2G630100090 $alleles allele.A allele.B BICF2G630100019 C <na> BICF2G630100032 T <na> BICF2G630100034 G <na> BICF2G630100043 A <na> BICF2G630100054 T <na> BICF2G630100063 T C BICF2G630100075 T <na> BICF2G63010009 G <na> BICF2G630100090 C <na> The gcodes setting is necessary -- but here the doc may need some amplification. On Wed, Mar 7, 2012 at 11:17 AM, David Clayton <dc208@cam.ac.uk> wrote: > What _should_ work is > > ..., fields(genotype=NA, allele.A=3, allele.B=4), ... > > but I have to agree that the documentation is distinctly lacking. > > Let me know if this doesn't work. > > David Clayton > > > > On 07/03/12 15:16, Liz Hare wrote: > >> Hello, >> >> I am trying to read an Illumina final format .txt file (tab- delimited) >> into snpStats. The file contains 4 columns: snp, sample, allele 1, and >> allele 2. Some sample lines: >> >> BICF2G630100019 04-0677/J279 C C >> BICF2G630100032 04-0677/J279 T T >> BICF2G630100034 04-0677/J279 G G >> BICF2G630100043 04-0677/J279 A A >> BICF2G630100054 04-0677/J279 T T >> BICF2G630100063 04-0677/J279 T C >> BICF2G630100075 04-0677/J279 T T >> BICF2G63010009 04-0677/J279 G G >> BICF2G630100090 04-0677/J279 C C >> >> I can't figure out from the documentation or vignette on data input how >> to specify that the alleles are in two columns. >> >> This doesn't work: >> >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3, genotype=4), >> + verbose=TRUE) >> Data to be read from the file filename >> No confidence thresholds specified >> Genotype read as a single field of two characters (which specify the >> alleles) >> Initial scan of file >> First sample: 04-0677/J279 >> First snp: BICF2G630100019 >> Last snp: YNp1-608 >> Last sample: 10-1160 >> 96x173662 matrix to be read >> Reading genotypes from file >> 20% 40% 60% 80% 100% >> .........|.........|.........|**.........|.........| >> -Error in read.long(file = "filename", : >> at line 1: C (expecting a 2-character genotype field) >> In addition: Warning message: >> closing unused connection 3 (filename) >> >> So I tried: >> >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3), >> + gcodes="\t", codes="nucleotide", verbose=TRUE) >> Error in read.long(file = "filename", : >> unused argument(s) (codes = "nucleotide") >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3), >> + split="\t", verbose=TRUE) >> Data to be read from the file filename >> No confidence thresholds specified >> Genotype read as a single field of two characters (which specify the >> alleles) >> Initial scan of file >> First sample: 04-0677/J279 >> First snp: BICF2G630100019 >> Last snp: YNp1-608 >> Last sample: 10-1160 >> 96x173662 matrix to be read >> Reading genotypes from file >> 20% 40% 60% 80% 100% >> .........|.........|.........|**.........|.........| >> -Error in read.long(file = "filename", : >> at line 1: C (expecting a 2-character genotype field) >> In addition: Warning message: >> closing unused connection 12 (filename) >> >> Is there a keyword for alleles rather than genotypes? I tried >> substituting the word 'allele' but didn't get anywhere. I suspect I'm >> not understanding something in the Details section of the documentation. >> >> Thanks, >> Liz >> >> > -- > Professor David Clayton > Wellcome Trust/Juvenile Diabetes Research Foundation Principal Research > Fellow > > Diabetes and Inflammation Laboratory > Cambridge University, Department of Medical Genetics > Cambridge Institute for Medical Research > Wellcome Trust/MRC Building > Addenbrooke's Hospital > Hills Road > Cambridge > CB2 0XY > > Tel: (44) 1223 762669 > Email: david.clayton@cimr.cam.ac.uk > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi David and Vincent, Thanks so much for the quick responses to my problem! When I did: CanineHD <- read.long(file="filename", fields=c(snp=1, sample=2, genotype=NA, allele.A=3, allele.B=4), verbose=TRUE) I got: Data to be read from the file filename No confidence thresholds specified Initial scan of file First sample: 04-0677/J279 First snp: BICF2G630100019 Last snp: YNp1-608 Last sample: 10-1160 96x173662 matrix to be read Error in length(gcodes) : 'gcodes' is missing It looks from the help page like gcodes should only be used if the genotype is in one field. My allele fields contain either A, C, T, G or -. What should I tell gcodes? Thanks, Liz On 3/7/2012 11:17 AM, David Clayton wrote: > What _should_ work is > > ..., fields(genotype=NA, allele.A=3, allele.B=4), ... > > but I have to agree that the documentation is distinctly lacking. > > Let me know if this doesn't work. > > David Clayton > > > On 07/03/12 15:16, Liz Hare wrote: >> Hello, >> >> I am trying to read an Illumina final format .txt file (tab- delimited) >> into snpStats. The file contains 4 columns: snp, sample, allele 1, and >> allele 2. Some sample lines: >> >> BICF2G630100019 04-0677/J279 C C >> BICF2G630100032 04-0677/J279 T T >> BICF2G630100034 04-0677/J279 G G >> BICF2G630100043 04-0677/J279 A A >> BICF2G630100054 04-0677/J279 T T >> BICF2G630100063 04-0677/J279 T C >> BICF2G630100075 04-0677/J279 T T >> BICF2G63010009 04-0677/J279 G G >> BICF2G630100090 04-0677/J279 C C >> >> I can't figure out from the documentation or vignette on data input how >> to specify that the alleles are in two columns. >> >> This doesn't work: >> >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3, genotype=4), >> + verbose=TRUE) >> Data to be read from the file filename >> No confidence thresholds specified >> Genotype read as a single field of two characters (which specify the >> alleles) >> Initial scan of file >> First sample: 04-0677/J279 >> First snp: BICF2G630100019 >> Last snp: YNp1-608 >> Last sample: 10-1160 >> 96x173662 matrix to be read >> Reading genotypes from file >> 20% 40% 60% 80% 100% >> .........|.........|.........|.........|.........| >> -Error in read.long(file = "filename", : >> at line 1: C (expecting a 2-character genotype field) >> In addition: Warning message: >> closing unused connection 3 (filename) >> >> So I tried: >> >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3), >> + gcodes="\t", codes="nucleotide", verbose=TRUE) >> Error in read.long(file = "filename", : >> unused argument(s) (codes = "nucleotide") >> > CanineHD <- read.long(file="filename", >> + fields=c(snp=1, sample=2, genotype=3), >> + split="\t", verbose=TRUE) >> Data to be read from the file filename >> No confidence thresholds specified >> Genotype read as a single field of two characters (which specify the >> alleles) >> Initial scan of file >> First sample: 04-0677/J279 >> First snp: BICF2G630100019 >> Last snp: YNp1-608 >> Last sample: 10-1160 >> 96x173662 matrix to be read >> Reading genotypes from file >> 20% 40% 60% 80% 100% >> .........|.........|.........|.........|.........| >> -Error in read.long(file = "filename", : >> at line 1: C (expecting a 2-character genotype field) >> In addition: Warning message: >> closing unused connection 12 (filename) >> >> Is there a keyword for alleles rather than genotypes? I tried >> substituting the word 'allele' but didn't get anywhere. I suspect I'm >> not understanding something in the Details section of the documentation. >> >> Thanks, >> Liz >> >
ADD REPLY

Login before adding your answer.

Traffic: 449 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6