how to convert genotype snp matrix to nucleotide genotypes?
1
0
Entering edit mode
@tereza-jezkova-unlv-5934
Last seen 7.1 years ago
I created a Snp matrix using a genotypeToSnpMatrix command The matrix looks like this: > mat $genotypes A SnpMatrix with 10 rows and 50581 columns Row names: Rodriguez_Lizard_1240_sequence_1_pileup.txt ... Rodriguez_Lizard_623_sequence_1_pileup.txt Col names: RADid_0000001_depth_39:0000000058 ... RADid_0078132_depth_33:0000000081 $map DataFrame with 50581 rows and 4 columns snp.names allele.1 allele.2 ignore <character> <dnastringset> <dnastringsetlist> <logical> 1 RADid_0000001_depth_39:0000000058 C A FALSE 2 RADid_0000003_depth_152:0000000007 G A,T TRUE 3 RADid_0000003_depth_152:0000000034 G T,C TRUE 4 RADid_0000003_depth_152:0000000046 T C FALSE 5 RADid_0000010_depth_57:0000000010 T C FALSE ... ... ... ... ... 50577 RADid_0078129_depth_31:0000000062 C T FALSE 50578 RADid_0078132_depth_33:0000000025 T C FALSE 50579 RADid_0078132_depth_33:0000000033 C T FALSE 50580 RADid_0078132_depth_33:0000000044 C A FALSE 50581 RADid_0078132_depth_33:0000000081 C T FALSE How do I convert my matrix to a a nucleotide genotype matrix? I would like my data to look something like: Sample 1 Snp 1 T/T Sample 2 Snp 1 T/A Sample 3 Snp 1 A/A etc. I noticed that people have been creating their Snp matrix from a txt file. I am importing from a vcf file and I can’t figure out how to get the desired format. Thanks a lot for your kind help, Tereza [[alternative HTML version deleted]]
SNP convert snpMatrix SNP convert snpMatrix • 2.7k views
ADD COMMENT
0
Entering edit mode
@vincent-j-carey-jr-4
Last seen 3 days ago
United States
On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV < jezkovat@unlv.nevada.edu> wrote: > I created a Snp matrix using a genotypeToSnpMatrix command > > The matrix looks like this: > > > mat > $genotypes > A SnpMatrix with 10 rows and 50581 columns > Row names: Rodriguez_Lizard_1240_sequence_1_pileup.txt ... > Rodriguez_Lizard_623_sequence_1_pileup.txt > Col names: RADid_0000001_depth_39:0000000058 ... > RADid_0078132_depth_33:0000000081 > > $map > DataFrame with 50581 rows and 4 columns > snp.names allele.1 allele.2 > ignore > <character> <dnastringset> <dnastringsetlist> > <logical> > 1 RADid_0000001_depth_39:0000000058 C A > FALSE > 2 RADid_0000003_depth_152:0000000007 G A,T > TRUE > 3 RADid_0000003_depth_152:0000000034 G T,C > TRUE > 4 RADid_0000003_depth_152:0000000046 T C > FALSE > 5 RADid_0000010_depth_57:0000000010 T C > FALSE > ... ... ... ... > ... > 50577 RADid_0078129_depth_31:0000000062 C T > FALSE > 50578 RADid_0078132_depth_33:0000000025 T C > FALSE > 50579 RADid_0078132_depth_33:0000000033 C T > FALSE > 50580 RADid_0078132_depth_33:0000000044 C A > FALSE > 50581 RADid_0078132_depth_33:0000000081 C T > FALSE > > > How do I convert my matrix to a a nucleotide genotype matrix? > I would like my data to look something like: > > Sample 1 Snp 1 T/T > Sample 2 Snp 1 T/A > Sample 3 Snp 1 A/A etc. > > as(mat, "character") will yield a conforming matrix with "A/B" notation in each cell. the representation is only useful for diallelic SNP from ?read.snps.long For nucleotide coding, nucleotides are assigned to the nominal alleles in alphabetic order. Thus, for a SNP with either "T" and "A" nucleotides in the variant position, the nominal genotypes AA, AB and BB will refer to A/A, A/T and T/T. provided this convention is observed in the VCF translation, you could use string substitutions to transform the A/B notation to nucleotide notation. oftentimes this is not really needed. > I noticed that people have been creating their Snp matrix from a txt file. > I am importing from a vcf file and I can’t figure out how to get the > desired format. > > Thanks a lot for your kind help, > > Tereza > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
0
Entering edit mode
Hi Vincent, Thanks so much for your help. I did what you suggested but the resulting matrix has only two values: NA (majority of cells) or A/B. There is no A/A or B/B. SO I know I did something wrong. This is my entire code: > fl <- system.file("extdata", "Lizard_std.vcf", package="VariantAnnotation") > vcf <- readVcf(fl, "1342gen_fasta") > mat <- genotypeToSnpMatrix(vcf) Warning messages: 1: In .local(x, ...) : variants with >1 ALT allele are set to NA 2: In .local(x, ...) : non-single nucleotide variations are set to NA 3: In .local(x, ...) : non-diploid variants are set to NA > MAT <- as(mat$genotypes, "character") > MAT_TRAN <- t(MAT) > write.csv(MAT_TRAN, "mat.csv") The resulting matrix (when opened in excel looks like this. I assume it has something to do with the warning messages I received but I am not sure what to do. 1 2 3 4 5 6 7 8 9 10 RADid_0000001_depth_39:0000000058 NA NA NA NA NA NA NA NA A/B NA RADid_0000003_depth_152:0000000007 NA NA NA NA NA NA NA NA NA NA RADid_0000003_depth_152:0000000034 NA NA NA NA NA NA NA NA NA NA RADid_0000003_depth_152:0000000046 NA NA NA A/B NA A/B NA NA NA NA RADid_0000010_depth_57:0000000010 NA NA NA NA NA NA NA A/B NA NA RADid_0000010_depth_57:0000000019 A/B A/B NA NA NA NA NA NA NA NA RADid_0000010_depth_57:0000000020 A/B A/B A/B A/B NA NA NA A/B NA NA RADid_0000010_depth_57:0000000059 A/B A/B NA A/B NA NA A/B NA NA NA RADid_0000010_depth_57:0000000062 NA NA A/B NA NA NA NA NA NA NA Will be very grateful for any advice. Thanks, Tereza From: Vincent Carey Sent: Friday, May 10, 2013 7:51 PM To: Tereza Jezkova UNLV Cc: bioconductor@r-project.org Subject: Re: [BioC] how to convert genotype snp matrix to nucleotide genotypes? On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV <jezkovat@unlv.nevada.edu> wrote: I created a Snp matrix using a genotypeToSnpMatrix command The matrix looks like this: > mat $genotypes A SnpMatrix with 10 rows and 50581 columns Row names: Rodriguez_Lizard_1240_sequence_1_pileup.txt ... Rodriguez_Lizard_623_sequence_1_pileup.txt Col names: RADid_0000001_depth_39:0000000058 ... RADid_0078132_depth_33:0000000081 $map DataFrame with 50581 rows and 4 columns snp.names allele.1 allele.2 ignore <character> <dnastringset> <dnastringsetlist> <logical> 1 RADid_0000001_depth_39:0000000058 C A FALSE 2 RADid_0000003_depth_152:0000000007 G A,T TRUE 3 RADid_0000003_depth_152:0000000034 G T,C TRUE 4 RADid_0000003_depth_152:0000000046 T C FALSE 5 RADid_0000010_depth_57:0000000010 T C FALSE ... ... ... ... ... 50577 RADid_0078129_depth_31:0000000062 C T FALSE 50578 RADid_0078132_depth_33:0000000025 T C FALSE 50579 RADid_0078132_depth_33:0000000033 C T FALSE 50580 RADid_0078132_depth_33:0000000044 C A FALSE 50581 RADid_0078132_depth_33:0000000081 C T FALSE How do I convert my matrix to a a nucleotide genotype matrix? I would like my data to look something like: Sample 1 Snp 1 T/T Sample 2 Snp 1 T/A Sample 3 Snp 1 A/A etc. as(mat, "character") will yield a conforming matrix with "A/B" notation in each cell. the representation is only useful for diallelic SNP from ?read.snps.long For nucleotide coding, nucleotides are assigned to the nominal alleles in alphabetic order. Thus, for a SNP with either "T" and "A" nucleotides in the variant position, the nominal genotypes AA, AB and BB will refer to A/A, A/T and T/T. provided this convention is observed in the VCF translation, you could use string substitutions to transform the A/B notation to nucleotide notation. oftentimes this is not really needed. I noticed that people have been creating their Snp matrix from a txt file. I am importing from a vcf file and I can’t figure out how to get the desired format. Thanks a lot for your kind help, Tereza [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On Sat, May 11, 2013 at 9:18 PM, Tereza Jezkova UNLV < jezkovat@unlv.nevada.edu> wrote: > Hi Vincent, > > Thanks so much for your help. I did what you suggested but the resulting > matrix has only two values: NA (majority of cells) or A/B. There is no A/A > or B/B. SO I know I did something wrong. This is my entire code: > > > fl <- system.file("extdata", "Lizard_std.vcf", > package="VariantAnnotation") > > vcf <- readVcf(fl, "1342gen_fasta") > > mat <- genotypeToSnpMatrix(vcf) > Warning messages: > 1: In .local(x, ...) : variants with >1 ALT allele are set to NA > 2: In .local(x, ...) : non-single nucleotide variations are set to NA > 3: In .local(x, ...) : non-diploid variants are set to NA > could it be that your genome does not have many diallelic loci? you may need to work directly with the genotype data and not the SnpMatrix representation. > > MAT <- as(mat$genotypes, "character") > > MAT_TRAN <- t(MAT) > > write.csv(MAT_TRAN, "mat.csv") > > > The resulting matrix (when opened in excel looks like this. I assume it > has something to do with the warning messages I received but I am not sure > what to do. > 1 2 3 4 5 6 7 8 9 10 RADid_0000001_depth_39:0000000058 NA NA NA NA NA > NA NA NA A/B NA RADid_0000003_depth_152:0000000007 NA NA NA NA NA NA NA NA > NA NA RADid_0000003_depth_152:0000000034 NA NA NA NA NA NA NA NA NA NA > RADid_0000003_depth_152:0000000046 NA NA NA A/B NA A/B NA NA NA NA > RADid_0000010_depth_57:0000000010 NA NA NA NA NA NA NA A/B NA NA > RADid_0000010_depth_57:0000000019 A/B A/B NA NA NA NA NA NA NA NA > RADid_0000010_depth_57:0000000020 A/B A/B A/B A/B NA NA NA A/B NA NA > RADid_0000010_depth_57:0000000059 A/B A/B NA A/B NA NA A/B NA NA NA > RADid_0000010_depth_57:0000000062 NA NA A/B NA NA NA NA NA NA NA > > > Will be very grateful for any advice. > > Thanks, Tereza > > > > > > *From:* Vincent Carey <stvjc@channing.harvard.edu> > *Sent:* Friday, May 10, 2013 7:51 PM > *To:* Tereza Jezkova UNLV <jezkovat@unlv.nevada.edu> > *Cc:* bioconductor@r-project.org > *Subject:* Re: [BioC] how to convert genotype snp matrix to nucleotide > genotypes? > > > > On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV < > jezkovat@unlv.nevada.edu> wrote: > >> I created a Snp matrix using a genotypeToSnpMatrix command >> >> The matrix looks like this: >> >> > mat >> $genotypes >> A SnpMatrix with 10 rows and 50581 columns >> Row names: Rodriguez_Lizard_1240_sequence_1_pileup.txt ... >> Rodriguez_Lizard_623_sequence_1_pileup.txt >> Col names: RADid_0000001_depth_39:0000000058 ... >> RADid_0078132_depth_33:0000000081 >> >> $map >> DataFrame with 50581 rows and 4 columns >> snp.names allele.1 >> allele.2 ignore >> <character> <dnastringset> >> <dnastringsetlist> <logical> >> 1 RADid_0000001_depth_39:0000000058 C >> A FALSE >> 2 RADid_0000003_depth_152:0000000007 G >> A,T TRUE >> 3 RADid_0000003_depth_152:0000000034 G >> T,C TRUE >> 4 RADid_0000003_depth_152:0000000046 T >> C FALSE >> 5 RADid_0000010_depth_57:0000000010 T >> C FALSE >> ... ... ... >> ... ... >> 50577 RADid_0078129_depth_31:0000000062 C >> T FALSE >> 50578 RADid_0078132_depth_33:0000000025 T >> C FALSE >> 50579 RADid_0078132_depth_33:0000000033 C >> T FALSE >> 50580 RADid_0078132_depth_33:0000000044 C >> A FALSE >> 50581 RADid_0078132_depth_33:0000000081 C >> T FALSE >> >> >> How do I convert my matrix to a a nucleotide genotype matrix? >> I would like my data to look something like: >> >> Sample 1 Snp 1 T/T >> Sample 2 Snp 1 T/A >> Sample 3 Snp 1 A/A etc. >> >> > > as(mat, "character") will yield a conforming matrix with "A/B" notation > in each cell. the > representation is only useful for diallelic SNP > > from ?read.snps.long > > For nucleotide coding, nucleotides are assigned to the nominal alleles > in alphabetic order. Thus, for a SNP with either "T" and "A" > nucleotides in the variant position, > the nominal genotypes AA, AB and BB will refer to A/A, > A/T and T/T. > > provided this convention is observed in the VCF translation, you could use > string substitutions > to transform the A/B notation to nucleotide notation. oftentimes this is > not really needed. > > >> I noticed that people have been creating their Snp matrix from a txt >> file. I am importing from a vcf file and I can’t figure out how to get the >> desired format. >> >> Thanks a lot for your kind help, >> >> Tereza >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
You might have better luck using the methods in VariantAnnotation to access the genotype matrix, rather than converting to SnpMatrix. For example: fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") geno <- geno(vcf)$GT geno NA00001 NA00002 NA00003 rs6054257 "0|0" "1|0" "1/1" 20:17330 "0|0" "0|1" "0/0" rs6040355 "1|2" "2|1" "2/2" 20:1230237 "0|0" "0|0" "0/0" microsat1 "0/1" "0/2" "1/1" ref <- ref(vcf) alt <- alt(vcf) geno2 <- geno for (i in 1:nrow(geno)) { geno2[i,] <- gsub("0", as.character(ref[i]), geno[i,]) for (j in 1:elementLengths(alt[i])) { geno2[i,] <- gsub(as.character(j), as.character(alt[[i]][j]), geno2[i,]) } } geno2 NA00001 NA00002 NA00003 rs6054257 "G|G" "A|G" "A/A" 20:17330 "T|T" "T|A" "T/T" rs6040355 "G|T" "T|G" "T/T" 20:1230237 "T|T" "T|T" "T/T" microsat1 "GTC/G" "GTC/GTCT" "G/G" There is probably a much more efficient way to do the string substitution than my nested for loop, but this gives you the idea. Stephanie On 5/12/13 6:47 PM, Vincent Carey wrote: > On Sat, May 11, 2013 at 9:18 PM, Tereza Jezkova UNLV < > jezkovat at unlv.nevada.edu> wrote: > >> Hi Vincent, >> >> Thanks so much for your help. I did what you suggested but the resulting >> matrix has only two values: NA (majority of cells) or A/B. There is no A/A >> or B/B. SO I know I did something wrong. This is my entire code: >> >>> fl <- system.file("extdata", "Lizard_std.vcf", >> package="VariantAnnotation") >>> vcf <- readVcf(fl, "1342gen_fasta") >>> mat <- genotypeToSnpMatrix(vcf) >> Warning messages: >> 1: In .local(x, ...) : variants with >1 ALT allele are set to NA >> 2: In .local(x, ...) : non-single nucleotide variations are set to NA >> 3: In .local(x, ...) : non-diploid variants are set to NA >> > > could it be that your genome does not have many diallelic loci? you may > need to work directly with > the genotype data and not the SnpMatrix representation. > > >>> MAT <- as(mat$genotypes, "character") >>> MAT_TRAN <- t(MAT) >>> write.csv(MAT_TRAN, "mat.csv") >> >> >> The resulting matrix (when opened in excel looks like this. I assume it >> has something to do with the warning messages I received but I am not sure >> what to do. >> 1 2 3 4 5 6 7 8 9 10 RADid_0000001_depth_39:0000000058 NA NA NA NA NA >> NA NA NA A/B NA RADid_0000003_depth_152:0000000007 NA NA NA NA NA NA NA NA >> NA NA RADid_0000003_depth_152:0000000034 NA NA NA NA NA NA NA NA NA NA >> RADid_0000003_depth_152:0000000046 NA NA NA A/B NA A/B NA NA NA NA >> RADid_0000010_depth_57:0000000010 NA NA NA NA NA NA NA A/B NA NA >> RADid_0000010_depth_57:0000000019 A/B A/B NA NA NA NA NA NA NA NA >> RADid_0000010_depth_57:0000000020 A/B A/B A/B A/B NA NA NA A/B NA NA >> RADid_0000010_depth_57:0000000059 A/B A/B NA A/B NA NA A/B NA NA NA >> RADid_0000010_depth_57:0000000062 NA NA A/B NA NA NA NA NA NA NA >> >> >> Will be very grateful for any advice. >> >> Thanks, Tereza >> >> >> >> >> >> *From:* Vincent Carey <stvjc at="" channing.harvard.edu=""> >> *Sent:* Friday, May 10, 2013 7:51 PM >> *To:* Tereza Jezkova UNLV <jezkovat at="" unlv.nevada.edu=""> >> *Cc:* bioconductor at r-project.org >> *Subject:* Re: [BioC] how to convert genotype snp matrix to nucleotide >> genotypes? >> >> >> >> On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV < >> jezkovat at unlv.nevada.edu> wrote: >> >>> I created a Snp matrix using a genotypeToSnpMatrix command >>> >>> The matrix looks like this: >>> >>>> mat >>> $genotypes >>> A SnpMatrix with 10 rows and 50581 columns >>> Row names: Rodriguez_Lizard_1240_sequence_1_pileup.txt ... >>> Rodriguez_Lizard_623_sequence_1_pileup.txt >>> Col names: RADid_0000001_depth_39:0000000058 ... >>> RADid_0078132_depth_33:0000000081 >>> >>> $map >>> DataFrame with 50581 rows and 4 columns >>> snp.names allele.1 >>> allele.2 ignore >>> <character> <dnastringset> >>> <dnastringsetlist> <logical> >>> 1 RADid_0000001_depth_39:0000000058 C >>> A FALSE >>> 2 RADid_0000003_depth_152:0000000007 G >>> A,T TRUE >>> 3 RADid_0000003_depth_152:0000000034 G >>> T,C TRUE >>> 4 RADid_0000003_depth_152:0000000046 T >>> C FALSE >>> 5 RADid_0000010_depth_57:0000000010 T >>> C FALSE >>> ... ... ... >>> ... ... >>> 50577 RADid_0078129_depth_31:0000000062 C >>> T FALSE >>> 50578 RADid_0078132_depth_33:0000000025 T >>> C FALSE >>> 50579 RADid_0078132_depth_33:0000000033 C >>> T FALSE >>> 50580 RADid_0078132_depth_33:0000000044 C >>> A FALSE >>> 50581 RADid_0078132_depth_33:0000000081 C >>> T FALSE >>> >>> >>> How do I convert my matrix to a a nucleotide genotype matrix? >>> I would like my data to look something like: >>> >>> Sample 1 Snp 1 T/T >>> Sample 2 Snp 1 T/A >>> Sample 3 Snp 1 A/A etc. >>> >>> >> >> as(mat, "character") will yield a conforming matrix with "A/B" notation >> in each cell. the >> representation is only useful for diallelic SNP >> >> from ?read.snps.long >> >> For nucleotide coding, nucleotides are assigned to the nominal alleles >> in alphabetic order. Thus, for a SNP with either "T" and "A" >> nucleotides in the variant position, >> the nominal genotypes AA, AB and BB will refer to A/A, >> A/T and T/T. >> >> provided this convention is observed in the VCF translation, you could use >> string substitutions >> to transform the A/B notation to nucleotide notation. oftentimes this is >> not really needed. >> >> >>> I noticed that people have been creating their Snp matrix from a txt >>> file. I am importing from a vcf file and I can?t figure out how to get the >>> desired format. >>> >>> Thanks a lot for your kind help, >>> >>> Tereza >>> [[alternative HTML version deleted]] >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
On 05/13/2013 08:58 AM, Stephanie M. Gogarten wrote: > You might have better luck using the methods in VariantAnnotation to access the > genotype matrix, rather than converting to SnpMatrix. For example: > > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") > geno <- geno(vcf)$GT > geno > NA00001 NA00002 NA00003 > rs6054257 "0|0" "1|0" "1/1" > 20:17330 "0|0" "0|1" "0/0" > rs6040355 "1|2" "2|1" "2/2" > 20:1230237 "0|0" "0|0" "0/0" > microsat1 "0/1" "0/2" "1/1" > > ref <- ref(vcf) > alt <- alt(vcf) > geno2 <- geno > for (i in 1:nrow(geno)) { > geno2[i,] <- gsub("0", as.character(ref[i]), geno[i,]) > for (j in 1:elementLengths(alt[i])) { > geno2[i,] <- gsub(as.character(j), > as.character(alt[[i]][j]), > geno2[i,]) > } > } > geno2 > NA00001 NA00002 NA00003 > rs6054257 "G|G" "A|G" "A/A" > 20:17330 "T|T" "T|A" "T/T" > rs6040355 "G|T" "T|G" "T/T" > 20:1230237 "T|T" "T|T" "T/T" > microsat1 "GTC/G" "GTC/GTCT" "G/G" > > There is probably a much more efficient way to do the string substitution than > my nested for loop, but this gives you the idea. Maybe the following gets away from iterating? The idea is to update the "0|0" elements in a vectorized fashion, then the "0|1", then the "1|1" (dealing with elementLength(alt(vcf)) != 1 is left as an exercise...!). geno2geno <- function(vcf) { ## standardize idx <- elementLengths(alt(vcf)) == 1L if (!all(idx)) { warning("only coercing single-element 'alt' records") vcf <- vcf[idx] } geno(vcf)$GT <- sub("/", "|", geno(vcf)$GT) ## replacement rule .replace <- function(GT, patterns, alleleA, alleleB) { repl <- paste(alleleA, alleleB, sep="|") # replacement value idx0 <- which(GT %in% patterns) # update these entries in GT... idx <- (idx0 - 1L) %% nrow(GT) + 1L # ...with these entries in repl GT[idx0] <- repl[idx] GT } ## replace GT <- geno(vcf)$GT GT <- .replace(GT, "0|0", ref(vcf), ref(vcf)) ## FIXME: loop over subset with alt(vcf) > 1 ? GT <- .replace(GT, c("0|1", "1|0"), ref(vcf), unlist(alt(vcf))) GT <- .replace(GT, "1|1", unlist(alt(vcf)), unlist(alt(vcf))) geno(vcf)$GT <- GT vcf } Mostly I tried not to separate out, say, unlist(alt(vcf)) into a separate variable, even though this might be more efficient / less typing, because it opens the door for mixing up the ordering of genotype and alt rows. > geno(geno2geno(vcf))$GT NA00001 NA00002 NA00003 rs6054257 "G|G" "G|A" "A|A" 20:17330 "T|T" "T|A" "T|T" 20:1230237 "T|T" "T|T" "T|T" Warning message: In geno2geno(vcf) : only coercing single-element 'alt' records Martin > > Stephanie > > On 5/12/13 6:47 PM, Vincent Carey wrote: >> On Sat, May 11, 2013 at 9:18 PM, Tereza Jezkova UNLV < >> jezkovat at unlv.nevada.edu> wrote: >> >>> Hi Vincent, >>> >>> Thanks so much for your help. I did what you suggested but the resulting >>> matrix has only two values: NA (majority of cells) or A/B. There is no A/A >>> or B/B. SO I know I did something wrong. This is my entire code: >>> >>>> fl <- system.file("extdata", "Lizard_std.vcf", >>> package="VariantAnnotation") >>>> vcf <- readVcf(fl, "1342gen_fasta") >>>> mat <- genotypeToSnpMatrix(vcf) >>> Warning messages: >>> 1: In .local(x, ...) : variants with >1 ALT allele are set to NA >>> 2: In .local(x, ...) : non-single nucleotide variations are set to NA >>> 3: In .local(x, ...) : non-diploid variants are set to NA >>> >> >> could it be that your genome does not have many diallelic loci? you may >> need to work directly with >> the genotype data and not the SnpMatrix representation. >> >> >>>> MAT <- as(mat$genotypes, "character") >>>> MAT_TRAN <- t(MAT) >>>> write.csv(MAT_TRAN, "mat.csv") >>> >>> >>> The resulting matrix (when opened in excel looks like this. I assume it >>> has something to do with the warning messages I received but I am not sure >>> what to do. >>> 1 2 3 4 5 6 7 8 9 10 RADid_0000001_depth_39:0000000058 NA NA NA NA NA >>> NA NA NA A/B NA RADid_0000003_depth_152:0000000007 NA NA NA NA NA NA NA NA >>> NA NA RADid_0000003_depth_152:0000000034 NA NA NA NA NA NA NA NA NA NA >>> RADid_0000003_depth_152:0000000046 NA NA NA A/B NA A/B NA NA NA NA >>> RADid_0000010_depth_57:0000000010 NA NA NA NA NA NA NA A/B NA NA >>> RADid_0000010_depth_57:0000000019 A/B A/B NA NA NA NA NA NA NA NA >>> RADid_0000010_depth_57:0000000020 A/B A/B A/B A/B NA NA NA A/B NA NA >>> RADid_0000010_depth_57:0000000059 A/B A/B NA A/B NA NA A/B NA NA NA >>> RADid_0000010_depth_57:0000000062 NA NA A/B NA NA NA NA NA NA NA >>> >>> >>> Will be very grateful for any advice. >>> >>> Thanks, Tereza >>> >>> >>> >>> >>> >>> *From:* Vincent Carey <stvjc at="" channing.harvard.edu=""> >>> *Sent:* Friday, May 10, 2013 7:51 PM >>> *To:* Tereza Jezkova UNLV <jezkovat at="" unlv.nevada.edu=""> >>> *Cc:* bioconductor at r-project.org >>> *Subject:* Re: [BioC] how to convert genotype snp matrix to nucleotide >>> genotypes? >>> >>> >>> >>> On Fri, May 10, 2013 at 10:02 PM, Tereza Jezkova UNLV < >>> jezkovat at unlv.nevada.edu> wrote: >>> >>>> I created a Snp matrix using a genotypeToSnpMatrix command >>>> >>>> The matrix looks like this: >>>> >>>>> mat >>>> $genotypes >>>> A SnpMatrix with 10 rows and 50581 columns >>>> Row names: Rodriguez_Lizard_1240_sequence_1_pileup.txt ... >>>> Rodriguez_Lizard_623_sequence_1_pileup.txt >>>> Col names: RADid_0000001_depth_39:0000000058 ... >>>> RADid_0078132_depth_33:0000000081 >>>> >>>> $map >>>> DataFrame with 50581 rows and 4 columns >>>> snp.names allele.1 >>>> allele.2 ignore >>>> <character> <dnastringset> >>>> <dnastringsetlist> <logical> >>>> 1 RADid_0000001_depth_39:0000000058 C >>>> A FALSE >>>> 2 RADid_0000003_depth_152:0000000007 G >>>> A,T TRUE >>>> 3 RADid_0000003_depth_152:0000000034 G >>>> T,C TRUE >>>> 4 RADid_0000003_depth_152:0000000046 T >>>> C FALSE >>>> 5 RADid_0000010_depth_57:0000000010 T >>>> C FALSE >>>> ... ... ... >>>> ... ... >>>> 50577 RADid_0078129_depth_31:0000000062 C >>>> T FALSE >>>> 50578 RADid_0078132_depth_33:0000000025 T >>>> C FALSE >>>> 50579 RADid_0078132_depth_33:0000000033 C >>>> T FALSE >>>> 50580 RADid_0078132_depth_33:0000000044 C >>>> A FALSE >>>> 50581 RADid_0078132_depth_33:0000000081 C >>>> T FALSE >>>> >>>> >>>> How do I convert my matrix to a a nucleotide genotype matrix? >>>> I would like my data to look something like: >>>> >>>> Sample 1 Snp 1 T/T >>>> Sample 2 Snp 1 T/A >>>> Sample 3 Snp 1 A/A etc. >>>> >>>> >>> >>> as(mat, "character") will yield a conforming matrix with "A/B" notation >>> in each cell. the >>> representation is only useful for diallelic SNP >>> >>> from ?read.snps.long >>> >>> For nucleotide coding, nucleotides are assigned to the nominal alleles >>> in alphabetic order. Thus, for a SNP with either "T" and "A" >>> nucleotides in the variant position, >>> the nominal genotypes AA, AB and BB will refer to A/A, >>> A/T and T/T. >>> >>> provided this convention is observed in the VCF translation, you could use >>> string substitutions >>> to transform the A/B notation to nucleotide notation. oftentimes this is >>> not really needed. >>> >>> >>>> I noticed that people have been creating their Snp matrix from a txt >>>> file. I am importing from a vcf file and I can?t figure out how to get the >>>> desired format. >>>> >>>> Thanks a lot for your kind help, >>>> >>>> Tereza >>>> [[alternative HTML version deleted]] >>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY

Login before adding your answer.

Traffic: 270 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6