GEOquery and Sample Subsets

0

Entering edit mode

Thomas Hampton ▴ 750

@thomas-hampton-2820

Last seen 10.3 years ago

I am using to GEOquery to establish sample subsets of GEO data -- that is, I would like to know which samples are replicates. I am doing it something like this: gds505 <- getGEO("GDS505") Columns(gds505) > str(Columns(gds505)) 'data.frame': 17 obs. of 4 variables: $ sample : Factor w/ 17 levels "GSM11805","GSM11814",..: 2 4 5 7 9 10 12 14 16 1 ... $ disease.state: Factor w/ 2 levels "normal","RCC": 2 2 2 2 2 2 2 2 2 1 ... $ individual : Factor w/ 10 levels "001","005","011",..: 6 4 1 2 3 5 8 9 10 6 ... $ description : chr "Value for GSM11814: C035 Renal Clear Cell Carcinoma U133A; src: Trizol... The problem I have is that the getGEO command retrieves a rather large object: > print(object.size(gds505), units="Mb") 12.6 Mb' This takes up a lot of time and bandwidth if you plan to do it for thousands of accessions. Is there a way to retrieve less? I am happy to use R, BioConductor, bioperl or whatever. Best, Tom [[alternative HTML version deleted]]

GEOquery GEOquery • 1.5k views

ADD COMMENT • link updated 11.6 years ago by Sean Davis 21k • written 11.6 years ago by Thomas Hampton ▴ 750

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 months ago

United States

On Tue, Jun 4, 2013 at 12:38 PM, Thomas H. Hampton <thomas.h.hampton at="" dartmouth.edu=""> wrote: > I am using to GEOquery to establish sample subsets of GEO data -- that is, I would > like to know which samples are replicates. > > I am doing it something like this: > > gds505 <- getGEO("GDS505") > Columns(gds505) > >> str(Columns(gds505)) > 'data.frame': 17 obs. of 4 variables: > $ sample : Factor w/ 17 levels "GSM11805","GSM11814",..: 2 4 5 7 9 10 12 14 16 1 ... > $ disease.state: Factor w/ 2 levels "normal","RCC": 2 2 2 2 2 2 2 2 2 1 ... > $ individual : Factor w/ 10 levels "001","005","011",..: 6 4 1 2 3 5 8 9 10 6 ... > $ description : chr "Value for GSM11814: C035 Renal Clear Cell Carcinoma U133A; src: Trizol... > > The problem I have is that the getGEO command retrieves a rather large object: > >> print(object.size(gds505), units="Mb") > 12.6 Mb' > > This takes up a lot of time and bandwidth if you plan to do it for thousands of accessions. > > Is there a way to retrieve less? Hi, Tom. Are you saying that you really want just the metadata to start; in other words, you just want the sample information without the expression values? Sean > I am happy to use R, BioConductor, bioperl or whatever. > > Best, > > Tom > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.6 years ago Sean Davis 21k

0

Entering edit mode

Exactly! Thanks. ________________________________________ From: seandavi@gmail.com [seandavi@gmail.com] on behalf of Sean Davis [sdavis2@mail.nih.gov] Sent: Tuesday, June 04, 2013 12:54 PM To: Thomas H. Hampton Cc: bioconductor at r-project.org Subject: Re: [BioC] GEOquery and Sample Subsets On Tue, Jun 4, 2013 at 12:38 PM, Thomas H. Hampton <thomas.h.hampton at="" dartmouth.edu=""> wrote: > I am using to GEOquery to establish sample subsets of GEO data -- that is, I would > like to know which samples are replicates. > > I am doing it something like this: > > gds505 <- getGEO("GDS505") > Columns(gds505) > >> str(Columns(gds505)) > 'data.frame': 17 obs. of 4 variables: > $ sample : Factor w/ 17 levels "GSM11805","GSM11814",..: 2 4 5 7 9 10 12 14 16 1 ... > $ disease.state: Factor w/ 2 levels "normal","RCC": 2 2 2 2 2 2 2 2 2 1 ... > $ individual : Factor w/ 10 levels "001","005","011",..: 6 4 1 2 3 5 8 9 10 6 ... > $ description : chr "Value for GSM11814: C035 Renal Clear Cell Carcinoma U133A; src: Trizol... > > The problem I have is that the getGEO command retrieves a rather large object: > >> print(object.size(gds505), units="Mb") > 12.6 Mb' > > This takes up a lot of time and bandwidth if you plan to do it for thousands of accessions. > > Is there a way to retrieve less? Hi, Tom. Are you saying that you really want just the metadata to start; in other words, you just want the sample information without the expression values? Sean > I am happy to use R, BioConductor, bioperl or whatever. > > Best, > > Tom > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.6 years ago Thomas Hampton ▴ 750

0

Entering edit mode

On Tue, Jun 4, 2013 at 1:14 PM, Thomas H. Hampton <thomas.h.hampton at="" dartmouth.edu=""> wrote: > Exactly! This might help: http://www.bioconductor.org/packages/release/bioc/html/GEOmetadb.html Let us know if you have questions. Sean > Thanks. > > ________________________________________ > From: seandavi at gmail.com [seandavi at gmail.com] on behalf of Sean Davis [sdavis2 at mail.nih.gov] > Sent: Tuesday, June 04, 2013 12:54 PM > To: Thomas H. Hampton > Cc: bioconductor at r-project.org > Subject: Re: [BioC] GEOquery and Sample Subsets > > On Tue, Jun 4, 2013 at 12:38 PM, Thomas H. Hampton > <thomas.h.hampton at="" dartmouth.edu=""> wrote: >> I am using to GEOquery to establish sample subsets of GEO data -- that is, I would >> like to know which samples are replicates. >> >> I am doing it something like this: >> >> gds505 <- getGEO("GDS505") >> Columns(gds505) >> >>> str(Columns(gds505)) >> 'data.frame': 17 obs. of 4 variables: >> $ sample : Factor w/ 17 levels "GSM11805","GSM11814",..: 2 4 5 7 9 10 12 14 16 1 ... >> $ disease.state: Factor w/ 2 levels "normal","RCC": 2 2 2 2 2 2 2 2 2 1 ... >> $ individual : Factor w/ 10 levels "001","005","011",..: 6 4 1 2 3 5 8 9 10 6 ... >> $ description : chr "Value for GSM11814: C035 Renal Clear Cell Carcinoma U133A; src: Trizol... >> >> The problem I have is that the getGEO command retrieves a rather large object: >> >>> print(object.size(gds505), units="Mb") >> 12.6 Mb' >> >> This takes up a lot of time and bandwidth if you plan to do it for thousands of accessions. >> >> Is there a way to retrieve less? > > Hi, Tom. Are you saying that you really want just the metadata to > start; in other words, you just want the sample information without > the expression values? > > Sean > > >> I am happy to use R, BioConductor, bioperl or whatever. >> >> Best, >> >> Tom >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.6 years ago Sean Davis 21k

0

Entering edit mode

This looks totally cool. Is there a place where one can view the schema of the relational db? In any case -- Thanks tons! Tom ________________________________________ From: seandavi@gmail.com [seandavi@gmail.com] on behalf of Sean Davis [sdavis2@mail.nih.gov] Sent: Tuesday, June 04, 2013 1:19 PM To: Thomas H. Hampton Cc: bioconductor at r-project.org; Jack zhu Subject: Re: [BioC] GEOquery and Sample Subsets On Tue, Jun 4, 2013 at 1:14 PM, Thomas H. Hampton <thomas.h.hampton at="" dartmouth.edu=""> wrote: > Exactly! This might help: http://www.bioconductor.org/packages/release/bioc/html/GEOmetadb.html Let us know if you have questions. Sean > Thanks. > > ________________________________________ > From: seandavi at gmail.com [seandavi at gmail.com] on behalf of Sean Davis [sdavis2 at mail.nih.gov] > Sent: Tuesday, June 04, 2013 12:54 PM > To: Thomas H. Hampton > Cc: bioconductor at r-project.org > Subject: Re: [BioC] GEOquery and Sample Subsets > > On Tue, Jun 4, 2013 at 12:38 PM, Thomas H. Hampton > <thomas.h.hampton at="" dartmouth.edu=""> wrote: >> I am using to GEOquery to establish sample subsets of GEO data -- that is, I would >> like to know which samples are replicates. >> >> I am doing it something like this: >> >> gds505 <- getGEO("GDS505") >> Columns(gds505) >> >>> str(Columns(gds505)) >> 'data.frame': 17 obs. of 4 variables: >> $ sample : Factor w/ 17 levels "GSM11805","GSM11814",..: 2 4 5 7 9 10 12 14 16 1 ... >> $ disease.state: Factor w/ 2 levels "normal","RCC": 2 2 2 2 2 2 2 2 2 1 ... >> $ individual : Factor w/ 10 levels "001","005","011",..: 6 4 1 2 3 5 8 9 10 6 ... >> $ description : chr "Value for GSM11814: C035 Renal Clear Cell Carcinoma U133A; src: Trizol... >> >> The problem I have is that the getGEO command retrieves a rather large object: >> >>> print(object.size(gds505), units="Mb") >> 12.6 Mb' >> >> This takes up a lot of time and bandwidth if you plan to do it for thousands of accessions. >> >> Is there a way to retrieve less? > > Hi, Tom. Are you saying that you really want just the metadata to > start; in other words, you just want the sample information without > the expression values? > > Sean > > >> I am happy to use R, BioConductor, bioperl or whatever. >> >> Best, >> >> Tom >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.6 years ago Thomas Hampton ▴ 750

0

Entering edit mode

On Tue, Jun 4, 2013 at 2:02 PM, Thomas H. Hampton <thomas.h.hampton at="" dartmouth.edu=""> wrote: > This looks totally cool. > > Is there a place where one can view the schema of the relational db? Hi, Tom. See the vignette for a diagram and for examples. We obviously also assume some familiarity with SQL. Sean > In any case -- Thanks tons! > > Tom > > > > ________________________________________ > From: seandavi at gmail.com [seandavi at gmail.com] on behalf of Sean Davis [sdavis2 at mail.nih.gov] > Sent: Tuesday, June 04, 2013 1:19 PM > To: Thomas H. Hampton > Cc: bioconductor at r-project.org; Jack zhu > Subject: Re: [BioC] GEOquery and Sample Subsets > > On Tue, Jun 4, 2013 at 1:14 PM, Thomas H. Hampton > <thomas.h.hampton at="" dartmouth.edu=""> wrote: >> Exactly! > > This might help: > > http://www.bioconductor.org/packages/release/bioc/html/GEOmetadb.html > > Let us know if you have questions. > > Sean > > >> Thanks. >> >> ________________________________________ >> From: seandavi at gmail.com [seandavi at gmail.com] on behalf of Sean Davis [sdavis2 at mail.nih.gov] >> Sent: Tuesday, June 04, 2013 12:54 PM >> To: Thomas H. Hampton >> Cc: bioconductor at r-project.org >> Subject: Re: [BioC] GEOquery and Sample Subsets >> >> On Tue, Jun 4, 2013 at 12:38 PM, Thomas H. Hampton >> <thomas.h.hampton at="" dartmouth.edu=""> wrote: >>> I am using to GEOquery to establish sample subsets of GEO data -- that is, I would >>> like to know which samples are replicates. >>> >>> I am doing it something like this: >>> >>> gds505 <- getGEO("GDS505") >>> Columns(gds505) >>> >>>> str(Columns(gds505)) >>> 'data.frame': 17 obs. of 4 variables: >>> $ sample : Factor w/ 17 levels "GSM11805","GSM11814",..: 2 4 5 7 9 10 12 14 16 1 ... >>> $ disease.state: Factor w/ 2 levels "normal","RCC": 2 2 2 2 2 2 2 2 2 1 ... >>> $ individual : Factor w/ 10 levels "001","005","011",..: 6 4 1 2 3 5 8 9 10 6 ... >>> $ description : chr "Value for GSM11814: C035 Renal Clear Cell Carcinoma U133A; src: Trizol... >>> >>> The problem I have is that the getGEO command retrieves a rather large object: >>> >>>> print(object.size(gds505), units="Mb") >>> 12.6 Mb' >>> >>> This takes up a lot of time and bandwidth if you plan to do it for thousands of accessions. >>> >>> Is there a way to retrieve less? >> >> Hi, Tom. Are you saying that you really want just the metadata to >> start; in other words, you just want the sample information without >> the expression values? >> >> Sean >> >> >>> I am happy to use R, BioConductor, bioperl or whatever. >>> >>> Best, >>> >>> Tom >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.6 years ago Sean Davis 21k

Login before adding your answer.