Question: smlSet
0
9.7 years ago by
claire pujoll30 wrote:
Dear Prof Carey, dear Bioconductor user, Could you please tel me where can I find a clear tutorial on how creating an smlSet class so I can use GGtools (I read the vignettes but still not able to find clear information regarding snp data). For gene expression I can create an eSet but how can I create SNP data slot from my genotype file. I know That I should use snp.matrix but could we have an example (not based on HapMap data). also it would be useful to clarify regarding racExSet and smlist as this is realy very confusing I have 3 matrices: gene expression (log2 scale) indiv1 indiv2 ... indiv 1000 probe1 ..... probe2 probe 48,000 ..... SNP data SNP1 SNP2 SNP 700,000 indiv1 AA AB BB indiv2 ..... indiv1000 ..... ....... ...... and pheno data thanks for your help. best regards, Claire Pujoll, PhD [[alternative HTML version deleted]]
snp hapmap ggtools • 580 views
modified 9.7 years ago by Vincent J. Carey, Jr.6.3k • written 9.7 years ago by claire pujoll30
0
9.7 years ago by
United States
Vincent J. Carey, Jr.6.3k wrote:
Dear Claire, If you are seeing references to racExSet, you are looking at an outdated version of the GGtools package. The GGBase package includes a function make_smlSet that helps create the structure. As you note, the snpMatrix package defines the snp.matrix class. The snpMatrix package has functions that generate snp.matrix instances from ped format, hapmap-formatted data, plink-formatted data, or chiamo-formatted data. If you do not use any of these formats, you will have to create a list of snp.matrix objects, each element created according to specifications in the snpMatrix package. If you have up-to-date versions of the software and would like further guidance, provide a small excerpt of your genotyping data file and I will provide illustration of how to create the associated smlSet, using some dummy expression and phenotype data. Clearly the most convenient approach will be based on a conventional format for genotyping data such as those enumerated above. On Wed, Mar 31, 2010 at 8:29 PM, claire pujoll <cpujoll@gmail.com> wrote: > Dear Prof Carey, dear Bioconductor user, > > Could you please tel me where can I find a clear tutorial on how creating > an smlSet class so I can use GGtools (I read the vignettes but still not > able to find clear information regarding snp data). For gene expression I > can create an eSet but how can I create SNP data slot from my genotype file. > I know That I should use snp.matrix but could we have an example (not based > on HapMap data). > > also it would be useful to clarify regarding racExSet and smlist as this is > realy very confusing > > > I have 3 matrices: > > gene expression (log2 scale) > indiv1 indiv2 ... indiv 1000 > > probe1 ..... > > probe2 > > probe 48,000 ..... > > > SNP data > SNP1 SNP2 SNP 700,000 > indiv1 AA AB BB > indiv2 > ..... > indiv1000 ..... ....... ...... > > and pheno data > > thanks for your help. > > > best regards, > > Claire Pujoll, PhD > > > [[alternative HTML version deleted]]
0
9.7 years ago by
United States
Vincent J. Carey, Jr.6.3k wrote:
Please read the posting guide and do not break the thread from the list. More will learn from the dialogue, and you may have the benefit of expertise of other readers who know more about SNP file formats than I do. Clearly this is not a problem with GGtools. You have a ped file that read.snps.pedfile cannot process. This is one of the drawbacks of using file formats that lack formal validators that can give specific indications on format violations. I cannot see the flaw in the ped file based on your excerpt, but see below. On Wed, Mar 31, 2010 at 10:38 PM, claire pujoll <cpujoll@gmail.com> wrote: > > Dear Prof Carey, > > many thanks for your time and quick response. > > I read your two papers discussing the data structures for genetics of gene > expression. I think it will be useful to add a note in the new manuel to > tell the reader that smlSet has be replaced by racExSet as there as several > tutorial and vignette on the web discussing racExSet. > I will revise the vignette to indicate that the former racExSet class is superseded by the smlSet class. > > # GWAS data > setwd("home/hamming/SNPmatrix") > map <- read.pedfile.map("Filt_CommonSNPChrom24.map") > ped <- read.snps.pedfile("Filt_CommonSNPChrom24.ped") > > > # Gene expression data > library("wp5Express") > data(wp5Express) > es <- exprs(wp5Express) > > ## Create smlSet class > smlSetTest <- make_smlSet(es, # Gene expression > sml, # SNP > data > organism = "Homo sapiens") > > > I tried to read SNP data but I had an error. Please note that I have PED > and MAP by chromosome and > also .txt genotype file with SNP as columon and row as individuals. > > > ped <- read.snps.pedfile("Filt_CommonSNPChrom24.ped") > Found accompanying map file, reading it first > Reading Filt_CommonSNPChrom24.ped ... > Can take a while... > Found 32 fields in the header, i.e. 13 snps, hopefully... > malformed input line: [CXMA-01-001 CXMA-01-001 0 0 > 1 2 C T A C C C T T G G A G G T G > T C T G G C T C C A C > ...] > Error in read.snps.pedfile("Filt_CommonSNPChrom24.ped") : > ped file invalid - see error message above > > You must try to make a valid ped file for your data. Here is an example of a valid one: 1 1 0 0 1 1 A A G T 2 1 0 0 1 1 A C T G 3 1 0 0 1 1 C C G G 4 1 0 0 1 2 A C T T 5 1 0 0 1 2 C C G T 6 1 0 0 1 2 C C T T it is test.ped from the plink distribution It seems to me that having hyphenated text in the second column, as in your example, may be problematic. the associated map file is: 1 snp1 0 1 1 snp2 0 2 if we apply read.snps.pedfile to test.ped, we see > mys = read.snps.pedfile("test.ped") Found accompanying map file, reading it first Reading test.ped ... Can take a while... Found 10 fields in the header, i.e. 2 snps, hopefully... last line [5] : 6 1 0 0 1 2 C C T... EOF reached after 6 samples Read 6 samples from input, now converting... ...done, 6 samples with 2 snps > mys $snp.data A snp.matrix with 6 rows and 2 columns Row names: Family.1.Individual.1 ... Family.6.Individual.1 Col names: snp1 ... snp2$subject.support Family Member Father Mother Sex Affected Family.1.Individual.1 1 1 NA NA Male Unaffected Family.2.Individual.1 2 1 NA NA Male Unaffected Family.3.Individual.1 3 1 NA NA Male Unaffected Family.4.Individual.1 4 1 NA NA Male Affected Family.5.Individual.1 5 1 NA NA Male Affected Family.6.Individual.1 6 1 NA NA Male Affected $snp.support snp.names position chromosome assignment snp1 snp1 1 1 A/C snp2 snp2 2 1 G/T Let us make an smlSet using this and related information. First we need an ExpressionSet instance ex = matrix(rnorm(12), nr=2) # fake expression data rownames(ex) = c("A", "B") es = new("ExpressionSet", exprs=ex) es$x = rnorm(6) # fake covariate of course more metadata should be supplied, but at this point > es ExpressionSet (storageMode: lockedEnvironment) assayData: 2 features, 6 samples element names: exprs protocolData: none phenoData sampleNames: 1, 2, ..., 6 (6 total) varLabels and varMetadata description: x: NA featureData: none experimentData: use 'experimentData(object)' Annotation: now make the smlSet: msl = make_smlSet(es, list(1=mys$snp.data)) sampleNames(msl) = rownames(mys$snp.data) we need a name on the snp matrix list elements, to enumerate chromosomes -- you can use any relevant tokenization. for most examples, we have one named element per chromosome, but other organizations (such as a single snp.matrix that is genome-wide) are possible once this is in hand, we can perform tests: > f1 = gwSnpTests(probeId("A")~x, msl, chrnum("1")) Warning messages: 1: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible 2: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible > topSnps(f1) p.val snp2 0.1470954 snp1 0.9257000 ignore the warnings > sessionInfo() R version 2.11.0 Under development (unstable) (2010-03-24 r51388) x86_64-apple-darwin10.2.0 locale: [1] C attached base packages: [1] splines stats graphics grDevices datasets tools utils [8] methods base other attached packages: [1] GGtools_3.5.71 ff_2.1-2 bit_1.1-3 [4] annotate_1.25.0 AnnotationDbi_1.9.6 GGBase_3.7.7 [7] RSQLite_0.8-4 DBI_0.2-5 Biobase_2.7.5 [10] snpMatrix_1.11.2 survival_2.35-7 weaver_1.13.0 [13] codetools_0.2-2 digest_0.4.1 loaded via a namespace (and not attached): [1] BSgenome_1.15.18 Biostrings_2.15.25 GSEABase_1.9.0 [4] GenomicRanges_0.0.9 IRanges_1.5.74 RCurl_1.3-0 [7] XML_2.6-0 annaffy_1.19.0 graph_1.25.1 [10] multicore_0.1-3 rtracklayer_1.7.12 xtable_1.5-5 Warning messages: 1: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible 2: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible > > > many thanks for your help, > > best regards, > Claire > > > > > > > > On Thu, Apr 1, 2010 at 3:16 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > >> Dear Claire, >> >> If you are seeing references to racExSet, you are looking at an outdated >> version of the GGtools package. The GGBase package includes a function >> make_smlSet that helps create the structure. As you note, the snpMatrix >> package defines the snp.matrix class. The snpMatrix package has functions >> that generate snp.matrix instances from ped format, hapmap- formatted data, >> plink-formatted data, or chiamo-formatted data. If you do not use any of >> these formats, you will have to create a list of snp.matrix objects, each >> element created according to specifications in the snpMatrix package. >> >> If you have up-to-date versions of the software and would like further >> guidance, provide a small excerpt of your genotyping data file and I will >> provide illustration of how to create the associated smlSet, using some >> dummy expression and phenotype data. Clearly the most convenient approach >> will be based on a conventional format for genotyping data such as those >> enumerated above. >> >> >> On Wed, Mar 31, 2010 at 8:29 PM, claire pujoll <cpujoll@gmail.com> wrote: >> >>> Dear Prof Carey, dear Bioconductor user, >>> >>> Could you please tel me where can I find a clear tutorial on how >>> creating an smlSet class so I can use GGtools (I read the vignettes but >>> still not able to find clear information regarding snp data). For gene >>> expression I can create an eSet but how can I create SNP data slot from my >>> genotype file. I know That I should use snp.matrix but could we have an >>> example (not based on HapMap data). >>> >>> also it would be useful to clarify regarding racExSet and smlist as this >>> is realy very confusing >>> >>> >>> I have 3 matrices: >>> >>> gene expression (log2 scale) >>> indiv1 indiv2 ... indiv 1000 >>> >>> probe1 ..... >>> >>> probe2 >>> >>> probe 48,000 ..... >>> >>> >>> SNP data >>> SNP1 SNP2 SNP 700,000 >>> indiv1 AA AB BB >>> indiv2 >>> ..... >>> indiv1000 ..... ....... ...... >>> >>> and pheno data >>> >>> thanks for your help. >>> >>> >>> best regards, >>> >>> Claire Pujoll, PhD >>> >>> >>> >> > [[alternative HTML version deleted]]
Many thanks for your help and aplogies for not cc'ing " bioconductor@stat.math.ethz.ch", I forgot to select reponse to all. I will try to resolve the problem with my PED files. I also found this link which is very useful http://www.biostat.harvard.edu/~carey/copenGG.html best regards, C On Thu, Apr 1, 2010 at 5:43 AM, Vincent Carey <stvjc@channing.harvard.edu>wrote: > Please read the posting guide and do not break the thread from the list. > More will learn from the dialogue, and you may have the benefit of expertise > of other readers who know more about SNP file formats than I do. > > Clearly this is not a problem with GGtools. You have a ped file that > read.snps.pedfile cannot process. This is one of the drawbacks of using > file formats that lack formal validators that can give specific indications > on format violations. I cannot see the flaw in the ped file based on your > excerpt, but see below. > > > On Wed, Mar 31, 2010 at 10:38 PM, claire pujoll <cpujoll@gmail.com> wrote: > >> >> Dear Prof Carey, >> >> many thanks for your time and quick response. >> >> I read your two papers discussing the data structures for genetics of gene >> expression. I think it will be useful to add a note in the new manuel to >> tell the reader that smlSet has be replaced by racExSet as there as several >> tutorial and vignette on the web discussing racExSet. >> > > I will revise the vignette to indicate that the former racExSet class is > superseded by the smlSet class. > > >> >> # GWAS data >> setwd("home/hamming/SNPmatrix") >> map <- read.pedfile.map("Filt_CommonSNPChrom24.map") >> ped <- read.snps.pedfile("Filt_CommonSNPChrom24.ped") >> >> >> # Gene expression data >> library("wp5Express") >> data(wp5Express) >> es <- exprs(wp5Express) >> >> ## Create smlSet class >> smlSetTest <- make_smlSet(es, # Gene expression >> sml, # SNP >> data >> organism = "Homo sapiens") >> >> >> I tried to read SNP data but I had an error. Please note that I have PED >> and MAP by chromosome and >> also .txt genotype file with SNP as columon and row as individuals. >> >> > ped <- read.snps.pedfile("Filt_CommonSNPChrom24.ped") >> Found accompanying map file, reading it first >> Reading Filt_CommonSNPChrom24.ped ... >> Can take a while... >> Found 32 fields in the header, i.e. 13 snps, hopefully... >> malformed input line: [CXMA-01-001 CXMA-01-001 0 0 >> 1 2 C T A C C C T T G G A G G T G >> T C T G G C T C C A C >> ...] >> Error in read.snps.pedfile("Filt_CommonSNPChrom24.ped") : >> ped file invalid - see error message above >> >> > You must try to make a valid ped file for your data. Here is an example of > a valid one: > > 1 1 0 0 1 1 A A G T > 2 1 0 0 1 1 A C T G > 3 1 0 0 1 1 C C G G > 4 1 0 0 1 2 A C T T > 5 1 0 0 1 2 C C G T > 6 1 0 0 1 2 C C T T > > it is test.ped from the plink distribution > > It seems to me that having hyphenated text in the second column, as in your > example, may be problematic. > > the associated map file is: > > 1 snp1 0 1 > 1 snp2 0 2 > > > if we apply read.snps.pedfile to test.ped, we see > > > mys = read.snps.pedfile("test.ped") > > Found accompanying map file, reading it first > Reading test.ped ... > Can take a while... > Found 10 fields in the header, i.e. 2 snps, hopefully... > last line [5] : 6 1 0 0 1 2 C C T... > EOF reached after 6 samples > Read 6 samples from input, now converting... > ...done, 6 samples with 2 snps > > mys > $snp.data > A snp.matrix with 6 rows and 2 columns > Row names: Family.1.Individual.1 ... Family.6.Individual.1 > Col names: snp1 ... snp2 > >$subject.support > Family Member Father Mother Sex Affected > Family.1.Individual.1 1 1 NA NA Male Unaffected > Family.2.Individual.1 2 1 NA NA Male Unaffected > Family.3.Individual.1 3 1 NA NA Male Unaffected > Family.4.Individual.1 4 1 NA NA Male Affected > Family.5.Individual.1 5 1 NA NA Male Affected > Family.6.Individual.1 6 1 NA NA Male Affected > > $snp.support > snp.names position chromosome assignment > snp1 snp1 1 1 A/C > snp2 snp2 2 1 G/T > > Let us make an smlSet using this and related information. First we need an > ExpressionSet > instance > > ex = matrix(rnorm(12), nr=2) # fake expression data > rownames(ex) = c("A", "B") > es = new("ExpressionSet", exprs=ex) > es$x = rnorm(6) # fake covariate > > of course more metadata should be supplied, but at this point > > > es > ExpressionSet (storageMode: lockedEnvironment) > assayData: 2 features, 6 samples > element names: exprs > protocolData: none > phenoData > sampleNames: 1, 2, ..., 6 (6 total) > varLabels and varMetadata description: > x: NA > featureData: none > experimentData: use 'experimentData(object)' > Annotation: > > now make the smlSet: > > msl = make_smlSet(es, list(1=mys$snp.data)) > sampleNames(msl) = rownames(mys$snp.data) > > we need a name on the snp matrix list elements, to enumerate chromosomes -- > you can use any relevant tokenization. for most examples, we have one named > element per chromosome, but other organizations (such as a single snp.matrix > that is genome-wide) are possible > > once this is in hand, we can perform tests: > > > f1 = gwSnpTests(probeId("A")~x, msl, chrnum("1")) > Warning messages: > 1: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible > 2: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible > > topSnps(f1) > p.val > snp2 0.1470954 > snp1 0.9257000 > > ignore the warnings > > > sessionInfo() > R version 2.11.0 Under development (unstable) (2010-03-24 r51388) > x86_64-apple-darwin10.2.0 > > locale: > [1] C > > attached base packages: > [1] splines stats graphics grDevices datasets tools utils > [8] methods base > > other attached packages: > [1] GGtools_3.5.71 ff_2.1-2 bit_1.1-3 > [4] annotate_1.25.0 AnnotationDbi_1.9.6 GGBase_3.7.7 > [7] RSQLite_0.8-4 DBI_0.2-5 Biobase_2.7.5 > [10] snpMatrix_1.11.2 survival_2.35-7 weaver_1.13.0 > [13] codetools_0.2-2 digest_0.4.1 > > loaded via a namespace (and not attached): > [1] BSgenome_1.15.18 Biostrings_2.15.25 GSEABase_1.9.0 > [4] GenomicRanges_0.0.9 IRanges_1.5.74 RCurl_1.3-0 > [7] XML_2.6-0 annaffy_1.19.0 graph_1.25.1 > [10] multicore_0.1-3 rtracklayer_1.7.12 xtable_1.5-5 > Warning messages: > 1: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible > 2: 'DESCRIPTION' file has 'Encoding' field and re-encoding is not possible > > > > > > > >> >> >> many thanks for your help, >> >> best regards, >> Claire >> >> >> >> >> >> >> >> On Thu, Apr 1, 2010 at 3:16 AM, Vincent Carey <stvjc@channing.harvard.edu>> > wrote: >> >>> Dear Claire, >>> >>> If you are seeing references to racExSet, you are looking at an outdated >>> version of the GGtools package. The GGBase package includes a function >>> make_smlSet that helps create the structure. As you note, the snpMatrix >>> package defines the snp.matrix class. The snpMatrix package has functions >>> that generate snp.matrix instances from ped format, hapmap- formatted data, >>> plink-formatted data, or chiamo-formatted data. If you do not use any of >>> these formats, you will have to create a list of snp.matrix objects, each >>> element created according to specifications in the snpMatrix package. >>> >>> If you have up-to-date versions of the software and would like further >>> guidance, provide a small excerpt of your genotyping data file and I will >>> provide illustration of how to create the associated smlSet, using some >>> dummy expression and phenotype data. Clearly the most convenient approach >>> will be based on a conventional format for genotyping data such as those >>> enumerated above. >>> >>> >>> On Wed, Mar 31, 2010 at 8:29 PM, claire pujoll <cpujoll@gmail.com>wrote: >>> >>>> Dear Prof Carey, dear Bioconductor user, >>>> >>>> Could you please tel me where can I find a clear tutorial on how >>>> creating an smlSet class so I can use GGtools (I read the vignettes but >>>> still not able to find clear information regarding snp data). For gene >>>> expression I can create an eSet but how can I create SNP data slot from my >>>> genotype file. I know That I should use snp.matrix but could we have an >>>> example (not based on HapMap data). >>>> >>>> also it would be useful to clarify regarding racExSet and smlist as this >>>> is realy very confusing >>>> >>>> >>>> I have 3 matrices: >>>> >>>> gene expression (log2 scale) >>>> indiv1 indiv2 ... indiv 1000 >>>> >>>> probe1 ..... >>>> >>>> probe2 >>>> >>>> probe 48,000 ..... >>>> >>>> >>>> SNP data >>>> SNP1 SNP2 SNP 700,000 >>>> indiv1 AA AB BB >>>> indiv2 >>>> ..... >>>> indiv1000 ..... ....... ...... >>>> >>>> and pheno data >>>> >>>> thanks for your help. >>>> >>>> >>>> best regards, >>>> >>>> Claire Pujoll, PhD >>>> >>>> >>>> >>> >> > [[alternative HTML version deleted]]