Problems Reading SNPs into snpMatrix

0

Entering edit mode

Chris K. Fuller ▴ 10

@chris-k-fuller-4641

Last seen 9.6 years ago

Hello List, I'm at a loss to explain why the read.snps.long function in snpStats refuses to properly read my SNPs. I've attempted this several different ways without success. I have 195 samples (individuals) and 440,000 SNPs, arranged in the "one call per line" format. I'm running this on Linux Ubuntu, and all R packages are up to date. I've reproduced these cases below. Example 1: Reading SNPs with Nucleotide Genotype With a file like: 1110043 rs2847443 TT 1110059 rs2847443 TT 1110060 rs2847443 TT 1110063 rs2847443 TA 1110066 rs2847443 TT 1110067 rs2847443 TT 1110070 rs2847443 TT I use the command: foo = read.snps.long('one_call_per_line.txt', sample.id=as.character(sample_id[[1]]), snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2, genotype = 3), codes = "nucleotide", sep='\t', verbose=TRUE, in.order=TRUE) And receive the result: Error in read.snps.long("one_call_per_line.txt", sample.id = as.character(sample_id[[1]]), : Nucleotide coded genotype should be 2 character string: line 1 Example 2: Perform my own SNP genotype coding and try again With a file like: 110043 rs2847443 3 1110059 rs2847443 3 1110060 rs2847443 3 1110063 rs2847443 2 1110066 rs2847443 3 1110067 rs2847443 3 1110070 rs2847443 3 I use this command: foo = read.snps.long('one_call_per_line_coded.txt', sample.id=as.character(sample_id[[1]]), snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2, genotype = 3), codes = c("1", "2", "3"), sep='\t', verbose=TRUE, in.order=TRUE) And receive this result: Reading SnpMatrix with 195 rows and 443136 columns Cumulative totals ----------------------------------- File Line Accepted Rejected No call Skipped File name 1 9142000 0 0 9141999 0 0 ...er_line_coded.txt In the first example, it apparently doesn't treat the third column of my tab-delimited text file as two characters. Why? I also generated a file with quotes around each nucleotide (e.g. 'AA'), but this did not change the behavior. In the second example, it seems to treat everything as no call. Why? I've been at this for far longer than I care to admit, so any insight would be greatly appreciated. Thank you, Chris Fuller chris at genome.ucsf.edu

SNP snpMatrix SNP snpMatrix • 1.1k views

ADD COMMENT • link updated 13.0 years ago by Vincent J. Carey, Jr. 6.7k • written 13.0 years ago by Chris K. Fuller ▴ 10

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 6 weeks ago

United States

rex-bash-3.2$ cat demo.txt # tab deilmited 1110043 rs2847443 TT 1110059 rs2847443 TT 1110060 rs2847443 TT 1110063 rs2847443 TA 1110066 rs2847443 TT 1110067 rs2847443 TT 1110070 rs2847443 TT rex-bash-3.2$ fg R214 > ids = read.table("demo.txt", colClasses=c("character", "NULL", "NULL"), h=FALSE)[[1]] > snp.ids = read.table("demo.txt", colClasses=c("NULL", "character", "NULL"), h=FALSE)[[1]] > x = read.snps.long("demo.txt", sample.id=ids, snp.id=snp.ids[1], codes="nucleotide", fields=c(sample=1, snp=2, genotype=3), sep="\t") 7 genotypes successfully read > as(x, "character") [,1] [1,] "B/B" [2,] "B/B" [3,] "B/B" [4,] "A/B" [5,] "B/B" [6,] "B/B" [7,] "B/B" > sessionInfo() R version 2.14.0 Under development (unstable) (2011-04-22 r55596) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] splines stats graphics grDevices datasets utils methods [8] base other attached packages: [1] snpStats_1.1.15 Matrix_0.999375-50 lattice_0.19-23 survival_2.36-5 loaded via a namespace (and not attached): [1] grid_2.14.0 tools_2.14.0 Warning message: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible On Wed, May 11, 2011 at 2:38 PM, Chris K. Fuller <chris at="" genome.ucsf.edu=""> wrote: > Hello List, > > I'm at a loss to explain why the read.snps.long function in snpStats > refuses to properly read my SNPs. ?I've attempted this several > different ways without success. ?I have 195 samples (individuals) and > 440,000 SNPs, arranged in the "one call per line" format. ?I'm running > this on Linux Ubuntu, and all R packages are up to date. ?I've > reproduced these cases below. > > Example 1: ?Reading SNPs with Nucleotide Genotype > > With a file like: > 1110043 rs2847443 ? ? ? TT > 1110059 rs2847443 ? ? ? TT > 1110060 rs2847443 ? ? ? TT > 1110063 rs2847443 ? ? ? TA > 1110066 rs2847443 ? ? ? TT > 1110067 rs2847443 ? ? ? TT > 1110070 rs2847443 ? ? ? TT > > I use the command: > foo = read.snps.long('one_call_per_line.txt', > sample.id=as.character(sample_id[[1]]), > snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2, > genotype = 3), codes = "nucleotide", sep='\t', verbose=TRUE, > in.order=TRUE) > > And receive the result: > Error in read.snps.long("one_call_per_line.txt", sample.id = > as.character(sample_id[[1]]), ?: > Nucleotide coded genotype should be 2 character string: line 1 > > > Example 2: ?Perform my own SNP genotype coding and try again > > With a file like: > 110043 ?rs2847443 ? ? ? 3 > 1110059 rs2847443 ? ? ? 3 > 1110060 rs2847443 ? ? ? 3 > 1110063 rs2847443 ? ? ? 2 > 1110066 rs2847443 ? ? ? 3 > 1110067 rs2847443 ? ? ? 3 > 1110070 rs2847443 ? ? ? 3 > > I use this command: > foo = read.snps.long('one_call_per_line_coded.txt', > sample.id=as.character(sample_id[[1]]), > snp.id=as.character(snp_id_mod[[1]]), fields = c(sample = 1, snp = 2, > genotype = 3), codes = c("1", "2", "3"), sep='\t', verbose=TRUE, > in.order=TRUE) > > And receive this result: > Reading SnpMatrix with 195 rows and 443136 columns > ? ? ? ? ? ? ? ? ? ? ? ? ? ? Cumulative totals > ? ? ? ? ? ? ? ? ? ?----------------------------------- > ? ?File ? ? ?Line ? ? Accepted ? Rejected ? ? No call ? ? Skipped ? ?File name > ? ? 1 ? ?9142000 ? ? ? ? ?0 ? ? ? ? ? ?0 ? ? ? ? 9141999 ? ? ? ? 0 0 > ...er_line_coded.txt > > > In the first example, it apparently doesn't treat the third column of > my tab-delimited text file as two characters. ?Why? ?I also generated > a file with quotes around each nucleotide (e.g. 'AA'), but this did > not change the behavior. ?In the second example, it seems to treat > everything as no call. ?Why? > > I've been at this for far longer than I care to admit, so any insight > would be greatly appreciated. > > Thank you, > > Chris Fuller > chris at genome.ucsf.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.0 years ago Vincent J. Carey, Jr. 6.7k

Login before adding your answer.