downloading different kinds of microarray data
4
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
On Tue, Aug 24, 2010 at 10:17 AM, Alex Levitchi <alex.levitchi@cbm.fvg.it>wrote: > Dear Sean Davis, > Since my last letter, I managed to do almost everything. Unfortunately, I > am not definitely understanding the aim of organizing microarray data in > GSEs and GDSs, in sense that GEOquery uses different tactics to load data > and convert them. So, probably, creating a tool I also must take into > consideration all these aspects and allow different steps to load data, > corresponding to the level of data organization, GSM to GPL, converting them > in ExpressionSet type. > Hi, Alex. Your understanding is correct. GSE and GDS contain different information and so are dealt with differently by GEOquery. > Also, there is another problem, regarding the fact that GPLs, GDSs and GSEs > can contain different tables by their size (different number of probes / > rows) which do not allow the analysis straightforward. I am not sure, but I > suppose that, e.g, if a GSE consists of GSMs from different platforms, > expression and phenotypic data are structured in several parts according to > the GPL. Thus, in the example I've sent > Again, I think your understanding is correct. > > > gse=getGEO(idata,GSEMatrix=TRUE) #'idata' the name > of the dataset, especially GSE or user created table > >columns=c('title','type','source_name_ch1','platform_id') > This be probably be about right for 1-color data, but certainly may not be directly useful for 2-color data or for sequencing data. Also, this minimal information may not allow one to capture the appropriate information for every experiment. If all the phenotype data is carried ONLY in the source_name_ch1, then you will be fine, but that will not be the case for many experimental designs. > >pdata=pData(gse[[1]])[,columns] > >expression=exprs(gse[[1]]) > >colnames(expression)=as.vector(pdata[colnames(expression),3]) > This assumes that the source_name_ch1 column has unique entries. They need not be unique. > > I suppose gse[[1]] represent the information extracted only for the first > GPL from 'platform_id' column, which was extracted from phenodata, and, if > there are 2 or more GPLs, it should be 'gse[[2]]' and so on. > Unfortunately, I did not find any article or manual which describe these > peculiarities. > > This is described in the help page for getGEO. getGEO with GSEMatrix=TRUE returns a list of ExpressionSets. > Please, give me a hint if I am right and I use a correct way to interpret > microarray data structure in order to prepare the data for the later > analysis. > The informations I always need to get are: > 1 - expression values table, with > 2 - rows - probe_ids and columns - the name of each sample > 3 - GPL name, to use it for the downloading if the corresponding > Bioconductor annotation package. > > In fact, what you are asking for is an ExpressionSet. The getGEO() returns a list of those directly, so there is no need to do any further post-processing with getting GSEs. For GDS data, you can simply use GDS2eSet(getGEO("GDSXXXX")) and you will get an ExpressionSet. Both methods will load the featureData slot with the full GPL data table, so you can use that for annotation. If you want to use the bioconductor annotation packages instead, see the GEOmetadb package which has mappings from GPL accessions to bioconductor data packages. Sean > Kind regards, > Alex Levitchi > PhD in Genetics, > Bioinformatician at Laboratory of Bioinformatics > CBM, Area Science Park, Trieste, Italy > http://www.cbm.fvg.it/laboratories/bioinformatics_research > > scientific researcher, > Center of Molecular Biology, > University of Academy of Sciences of Moldova > www.edu.asm.md > > > ----- Ð˜ÑÑ Ð¾Ð´Ð½Ð¾Ðµ сообщение ----- > От: "Sean Davis" <sdavis2@mail.nih.gov> > Кому: "Alex Levitchi" <alex.levitchi@cbm.fvg.it> > Копия: bioconductor@stat.math.ethz.ch > Отправленные: Пятница, 23 Июль 2010 г 19:53:47 GMT +01:00 Амстердам, > Берлин, Берн, Вена, Рим, Стокгольм > Тема: Re: [BioC] downloading different kinds of microarray data > > Hi, Alex. You are definitely thinking correctly that you want to be using > ExpressionSets. I would focus your attention on learning to construct an > ExpressionSet for each case you outline. > > Sean > > On Jul 23, 2010 10:12 AM, "Alex Levitchi" <alex.levitchi@cbm.fvg.it> > wrote: > > Dear Bioconductors, > I am working on the development of a tool which use to download microarray > data and then make the connection to Bioconductor annotation packages. > My specific answer is about the way to manage downloading different kinds > of microarrays, which can be: > - GSE > - several GSMs > - users data (excel or tab delimiter file). > I use GEOquery package. > My tool works fine if I am using just GSE file, which has a good structure > and I know how to extract expression values, platform (GPL) and samples > names. > > > gse=getGEO(idata,GSEMatrix=TRUE) > >columns=c('title','type','source_name_ch1','platform_id') > >pdata=pData(gse[[1]])[,columns] > >expression=exprs(gse[[1]]) > >colnames(expression)=as.vector(pdata[colnames(expression),3]) > > But I feel confused, when I think about the way to handle with several GSMs > or user data. > applying getGEO function for GSM I have to use then Table(gse)$VALUE to > extract expression values and Meta(gse)$platform_id to know the GPL. I > understand how to do this easy when I have just 1 GSM. How should I manage > several GSMs? > from the start I supposed to use smth like this: > > >gse=do.call("cbind",lapply('list_of_GSMs'),function(x) { > >getGEO(as.character(x),GSEMatrix=TRUE) > >} > but, thus, I get just expression values matrix, and I still don't know what > is the GPL and sample names. > > Another idea (I did not check it yet, as I am not sure it is correct) is to > try to create an ExpressionSet (also for user data, after downloading them > through 'read.table'), but I also don't know how to create a phenoData file, > simply manually or there is a possibility to make it through the code. > having ExpressionSet I suppose I will can to use "pData" function like in > case of a GSE. > Doing all this I would like to be able to download and arrange the data in > the way, to use the rest of the functions which comes after 'gse=....' in > the up presented example. > > Please, give me some hints at least at one of this points. > > Thank's for you nice job. > Cheers > > Alexei Levitchi > PhD in Genetics, > Bioinformatician at Laboratory of Bioinformatics > CBM, Area Science Park, Trieste, Italy > http://www.cbm.fvg.it/laboratories/bioinformatics_research > > scientific researcher, > Center of Molecular Biology, > University of Academy of Sciences of Moldova > www.edu.asm.md > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]
Sequencing Microarray Genetics Annotation gpls convert GEOquery GEOmetadb Sequencing gpls • 2.3k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
On Tue, Aug 24, 2010 at 11:07 AM, Alex Levitchi <alex.levitchi@cbm.fvg.it>wrote: > Dear Sean, > Thanks a lot. > I did it in the following way (the idea I took from 'getGEOfile'): > > > columns=c('title','type','source_name_ch1','platform_id') > > geo=getGEO(idata,GSEMatrix=TRUE) > > if (idata_name=="GSM") { > expression=matrix(Table(geo)$VALUE) > pdata=Meta(geo)[columns] > colnames(expression)=pdata[3] > expression=apply(expression,2,function(x){ > as.numeric(as.character(x))}) > rownames(expression)=Table(geo)$ID_REF > } > > if (idata_name=="GDS") { > gds_set=GDS2eSet(geo) > pdata=data.frame(title="", type="", > source_name_ch1=pData(gds_set)$tissue, platform_id=Meta(geo)$platform) > expression=exprs(gds_set) > colnames(expression)=pdata[,3] > } > > if (idata_name=="GSE") { > pdata=pData(geo[[1]])[,columns] > expression=exprs(geo[[1]]) > colnames(expression)=as.vector(pdata[colnames(expression),3]) > } > > The thing I still need to figure out is the way to extract informations in > case of multiple platforms. > It seems a little bit huge for me, but I hope it is the way it should be. > > As I mentioned, the return value from getGEO() with a GSE is, by default, a list of ExpressionSets. You just need to use lapply or another loop method your "geo" object above. How you design the data structure to hold the resulting data is going to be up to you. Sean > Regards, > > Alexei Levitchi > PhD in Genetics, > Bioinformatician at Laboratory of Bioinformatics > CBM, Area Science Park, Trieste, Italy > http://www.cbm.fvg.it/laboratories/bioinformatics_research > > scientific researcher, > Center of Molecular Biology, > University of Academy of Sciences of Moldova > www.edu.asm.md > > ----- Ð˜ÑÑ Ð¾Ð´Ð½Ð¾Ðµ сообщение ----- > От: "Sean Davis" <sdavis2@mail.nih.gov> > Кому: "Alex Levitchi" <alex.levitchi@cbm.fvg.it> > Копия: bioconductor@stat.math.ethz.ch > Отправленные: Вторник, 24 Август 2010 г 16:41:50 GMT +01:00 Амстердам, > Берлин, Берн, Вена, Рим, Стокгольм > > Тема: Re: [BioC] downloading different kinds of microarray data > > > > On Tue, Aug 24, 2010 at 10:17 AM, Alex Levitchi <alex.levitchi@cbm.fvg.it>wrote: > >> Dear Sean Davis, >> Since my last letter, I managed to do almost everything. Unfortunately, I >> am not definitely understanding the aim of organizing microarray data in >> GSEs and GDSs, in sense that GEOquery uses different tactics to load data >> and convert them. So, probably, creating a tool I also must take into >> consideration all these aspects and allow different steps to load data, >> corresponding to the level of data organization, GSM to GPL, converting them >> in ExpressionSet type. >> > > Hi, Alex. > > Your understanding is correct. GSE and GDS contain different information > and so are dealt with differently by GEOquery. > > > >> Also, there is another problem, regarding the fact that GPLs, GDSs and >> GSEs can contain different tables by their size (different number of probes >> / rows) which do not allow the analysis straightforward. I am not sure, but >> I suppose that, e.g, if a GSE consists of GSMs from different platforms, >> expression and phenotypic data are structured in several parts according to >> the GPL. Thus, in the example I've sent >> > > Again, I think your understanding is correct. > >> >> > gse=getGEO(idata,GSEMatrix=TRUE) #'idata' the >> name of the dataset, especially GSE or user created table >> >columns=c('title','type','source_name_ch1','platform_id') >> > > This be probably be about right for 1-color data, but certainly may not be > directly useful for 2-color data or for sequencing data. Also, this minimal > information may not allow one to capture the appropriate information for > every experiment. If all the phenotype data is carried ONLY in the > source_name_ch1, then you will be fine, but that will not be the case for > many experimental designs. > > >> >pdata=pData(gse[[1]])[,columns] >> >expression=exprs(gse[[1]]) >> >colnames(expression)=as.vector(pdata[colnames(expression),3]) >> > > This assumes that the source_name_ch1 column has unique entries. They need > not be unique. > >> >> I suppose gse[[1]] represent the information extracted only for the first >> GPL from 'platform_id' column, which was extracted from phenodata, and, if >> there are 2 or more GPLs, it should be 'gse[[2]]' and so on. >> Unfortunately, I did not find any article or manual which describe these >> peculiarities. >> >> This is described in the help page for getGEO. getGEO with GSEMatrix=TRUE > returns a list of ExpressionSets. > > > >> Please, give me a hint if I am right and I use a correct way to interpret >> microarray data structure in order to prepare the data for the later >> analysis. >> The informations I always need to get are: >> 1 - expression values table, with >> 2 - rows - probe_ids and columns - the name of each sample >> 3 - GPL name, to use it for the downloading if the corresponding >> Bioconductor annotation package. >> >> > In fact, what you are asking for is an ExpressionSet. The getGEO() returns > a list of those directly, so there is no need to do any further > post-processing with getting GSEs. For GDS data, you can simply use > GDS2eSet(getGEO("GDSXXXX")) and you will get an ExpressionSet. Both methods > will load the featureData slot with the full GPL data table, so you can use > that for annotation. If you want to use the bioconductor annotation > packages instead, see the GEOmetadb package which has mappings from GPL > accessions to bioconductor data packages. > > Sean > > > >> Kind regards, >> Alex Levitchi >> PhD in Genetics, >> Bioinformatician at Laboratory of Bioinformatics >> CBM, Area Science Park, Trieste, Italy >> http://www.cbm.fvg.it/laboratories/bioinformatics_research >> >> scientific researcher, >> Center of Molecular Biology, >> University of Academy of Sciences of Moldova >> www.edu.asm.md >> >> >> ----- Ð˜ÑÑ Ð¾Ð´Ð½Ð¾Ðµ сообщение ----- >> От: "Sean Davis" <sdavis2@mail.nih.gov> >> Кому: "Alex Levitchi" <alex.levitchi@cbm.fvg.it> >> Копия: bioconductor@stat.math.ethz.ch >> Отправленные: Пятница, 23 Июль 2010 г 19:53:47 GMT +01:00 Амстердам, >> Берлин, Берн, Вена, Рим, Стокгольм >> Тема: Re: [BioC] downloading different kinds of microarray data >> >> Hi, Alex. You are definitely thinking correctly that you want to be using >> ExpressionSets. I would focus your attention on learning to construct an >> ExpressionSet for each case you outline. >> >> Sean >> >> On Jul 23, 2010 10:12 AM, "Alex Levitchi" <alex.levitchi@cbm.fvg.it> >> wrote: >> >> Dear Bioconductors, >> I am working on the development of a tool which use to download microarray >> data and then make the connection to Bioconductor annotation packages. >> My specific answer is about the way to manage downloading different kinds >> of microarrays, which can be: >> - GSE >> - several GSMs >> - users data (excel or tab delimiter file). >> I use GEOquery package. >> My tool works fine if I am using just GSE file, which has a good structure >> and I know how to extract expression values, platform (GPL) and samples >> names. >> >> > gse=getGEO(idata,GSEMatrix=TRUE) >> >columns=c('title','type','source_name_ch1','platform_id') >> >pdata=pData(gse[[1]])[,columns] >> >expression=exprs(gse[[1]]) >> >colnames(expression)=as.vector(pdata[colnames(expression),3]) >> >> But I feel confused, when I think about the way to handle with several >> GSMs or user data. >> applying getGEO function for GSM I have to use then Table(gse)$VALUE to >> extract expression values and Meta(gse)$platform_id to know the GPL. I >> understand how to do this easy when I have just 1 GSM. How should I manage >> several GSMs? >> from the start I supposed to use smth like this: >> >> >gse=do.call("cbind",lapply('list_of_GSMs'),function(x) { >> >getGEO(as.character(x),GSEMatrix=TRUE) >> >} >> but, thus, I get just expression values matrix, and I still don't know >> what is the GPL and sample names. >> >> Another idea (I did not check it yet, as I am not sure it is correct) is >> to try to create an ExpressionSet (also for user data, after downloading >> them through 'read.table'), but I also don't know how to create a phenoData >> file, simply manually or there is a possibility to make it through the code. >> having ExpressionSet I suppose I will can to use "pData" function like in >> case of a GSE. >> Doing all this I would like to be able to download and arrange the data in >> the way, to use the rest of the functions which comes after 'gse=....' in >> the up presented example. >> >> Please, give me some hints at least at one of this points. >> >> Thank's for you nice job. >> Cheers >> >> Alexei Levitchi >> PhD in Genetics, >> Bioinformatician at Laboratory of Bioinformatics >> CBM, Area Science Park, Trieste, Italy >> http://www.cbm.fvg.it/laboratories/bioinformatics_research >> >> scientific researcher, >> Center of Molecular Biology, >> University of Academy of Sciences of Moldova >> www.edu.asm.md >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@alex-levitchi-4179
Last seen 10.3 years ago
Dear Sean Davis, Since my last letter, I managed to do almost everything. Unfortunately, I am not definitely understanding the aim of organizing microarray data in GSEs and GDSs, in sense that GEOquery uses different tactics to load data and convert them. So, probably, creating a tool I also must take into consideration all these aspects and allow different steps to load data, corresponding to the level of data organization, GSM to GPL, converting them in ExpressionSet type. Also, there is another problem, regarding the fact that GPLs, GDSs and GSEs can contain different tables by their size (different number of probes / rows) which do not allow the analysis straightforward. I am not sure, but I suppose that, e.g, if a GSE consists of GSMs from different platforms, expression and phenotypic data are structured in several parts according to the GPL. Thus, in the example I've sent > gse=getGEO(idata,GSEMatrix=TRUE) #'idata' the name of the dataset, especially GSE or user created table >columns=c('title','type','source_name_ch1','platform_id') >pdata=pData(gse[[1]])[,columns] >expression=exprs(gse[[1]]) >colnames(expression)=as.vector(pdata[colnames(expression),3]) I suppose gse[[1]] represent the information extracted only for the first GPL from 'platform_id' column, which was extracted from phenodata, and, if there are 2 or more GPLs, it should be 'gse[[2]]' and so on. Unfortunately, I did not find any article or manual which describe these peculiarities. Please, give me a hint if I am right and I use a correct way to interpret microarray data structure in order to prepare the data for the later analysis. The informations I always need to get are: 1 - expression values table, with 2 - rows - probe_ids and columns - the name of each sample 3 - GPL name, to use it for the downloading if the corresponding Bioconductor annotation package. Kind regards, Alex Levitchi PhD in Genetics, Bioinformatician at Laboratory of Bioinformatics CBM, Area Science Park, Trieste, Italy http://www.cbm.fvg.it/laboratories/bioinformatics_research scientific researcher, Center of Molecular Biology, University of Academy of Sciences of Moldova www.edu.asm.md ----- Ð˜ÑÑ Ð¾Ð´Ð½Ð¾Ðµ сообщение ----- От: "Sean Davis" <sdavis2@mail.nih.gov> Кому: "Alex Levitchi" <alex.levitchi@cbm.fvg.it> Копия: bioconductor@stat.math.ethz.ch Отправленные: Пятница, 23 Июль 2010 г 19:53:47 GMT +01:00 Амстердам, Берлин, Берн, Вена, Рим, Стокгольм Тема: Re: [BioC] downloading different kinds of microarray data Hi, Alex. You are definitely thinking correctly that you want to be using ExpressionSets. I would focus your attention on learning to construct an ExpressionSet for each case you outline. Sean On Jul 23, 2010 10:12 AM, "Alex Levitchi" < alex.levitchi@cbm.fvg.it > wrote: Dear Bioconductors, I am working on the development of a tool which use to download microarray data and then make the connection to Bioconductor annotation packages. My specific answer is about the way to manage downloading different kinds of microarrays, which can be: - GSE - several GSMs - users data (excel or tab delimiter file). I use GEOquery package. My tool works fine if I am using just GSE file, which has a good structure and I know how to extract expression values, platform (GPL) and samples names. > gse=getGEO(idata,GSEMatrix=TRUE) >columns=c('title','type','source_name_ch1','platform_id') >pdata=pData(gse[[1]])[,columns] >expression=exprs(gse[[1]]) >colnames(expression)=as.vector(pdata[colnames(expression),3]) But I feel confused, when I think about the way to handle with several GSMs or user data. applying getGEO function for GSM I have to use then Table(gse)$VALUE to extract expression values and Meta(gse)$platform_id to know the GPL. I understand how to do this easy when I have just 1 GSM. How should I manage several GSMs? from the start I supposed to use smth like this: >gse=do.call("cbind",lapply('list_of_GSMs'),function(x) { >getGEO(as.character(x),GSEMatrix=TRUE) >} but, thus, I get just expression values matrix, and I still don't know what is the GPL and sample names. Another idea (I did not check it yet, as I am not sure it is correct) is to try to create an ExpressionSet (also for user data, after downloading them through 'read.table'), but I also don't know how to create a phenoData file, simply manually or there is a possibility to make it through the code. having ExpressionSet I suppose I will can to use "pData" function like in case of a GSE. Doing all this I would like to be able to download and arrange the data in the way, to use the rest of the functions which comes after 'gse=....' in the up presented example. Please, give me some hints at least at one of this points. Thank's for you nice job. Cheers Alexei Levitchi PhD in Genetics, Bioinformatician at Laboratory of Bioinformatics CBM, Area Science Park, Trieste, Italy http://www.cbm.fvg.it/laboratories/bioinformatics_research scientific researcher, Center of Molecular Biology, University of Academy of Sciences of Moldova www.edu.asm.md [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Marc Noguera ▴ 100
@marc-noguera-3883
Last seen 10.3 years ago
Hi all, I have been learning to use ChIPpeakAnno package to annotate the peaks I obtain with CHiPseq experimetns. I see that, useing biomaRt, i can get to some annotation on ensembl and annotate the peaks according to TSS, exon, miRNA and some other features. I would like, also, to be able to search for repetitive elements, such as ALUs and CpG Islands. Is that kind of information also accessible using biomaRt? I can't seem to find it. If not how should I proceed? Thanks in advance Marc -- ----------------------------------------------------- Marc Noguera i Julian, PhD Genomics unit / Bioinformatics Institut de Medicina Predictiva i Personalitzada del C?ncer (IMPPC) B-10 Office Carretera de Can Ruti Cam? de les Escoles s/n 08916 Badalona, Barcelona
ADD COMMENT
0
Entering edit mode
Marc Use "rtracklayer" which interfaces with the UCSC genome browser. You have access to all the tables with the "track" function. Have a look at help for "track". - This corresponds to using the following web page: http://genome.ucsc.edu/cgi-bin/hgTables?command=start In your case, you need the "RepeatMasker" track. From the data you get back, you need to select for the different alu types. You do get all repeats. For CpG islands, take the "CpG Islands" track Hans On 08/24/2010 04:52 PM, Marc Noguera wrote: > Hi all, > I have been learning to use ChIPpeakAnno package to annotate the peaks I > obtain with CHiPseq experimetns. I see that, useing biomaRt, i can get > to some annotation on ensembl and annotate the peaks according to TSS, > exon, miRNA and some other features. > I would like, also, to be able to search for repetitive elements, such > as ALUs and CpG Islands. Is that kind of information also accessible > using biomaRt? I can't seem to find it. If not how should I proceed? > > Thanks in advance > > Marc >
ADD REPLY
0
Entering edit mode
@alex-levitchi-4179
Last seen 10.3 years ago
Dear Sean, Thanks a lot. I did it in the following way (the idea I took from 'getGEOfile'): columns=c('title','type','source_name_ch1','platform_id') geo=getGEO(idata,GSEMatrix=TRUE) if (idata_name=="GSM") { expression=matrix(Table(geo)$VALUE) pdata=Meta(geo)[columns] colnames(expression)=pdata[3] expression=apply(expression,2,function(x){ as.numeric(as.character(x))}) rownames(expression)=Table(geo)$ID_REF } if (idata_name=="GDS") { gds_set=GDS2eSet(geo) pdata=data.frame(title="", type="", source_name_ch1=pData(gds_set)$tissue, platform_id=Meta(geo)$platform) expression=exprs(gds_set) colnames(expression)=pdata[,3] } if (idata_name=="GSE") { pdata=pData(geo[[1]])[,columns] expression=exprs(geo[[1]]) colnames(expression)=as.vector(pdata[colnames(expression),3]) } The thing I still need to figure out is the way to extract informations in case of multiple platforms. It seems a little bit huge for me, but I hope it is the way it should be. Regards, Alexei Levitchi PhD in Genetics, Bioinformatician at Laboratory of Bioinformatics CBM, Area Science Park, Trieste, Italy http://www.cbm.fvg.it/laboratories/bioinformatics_research scientific researcher, Center of Molecular Biology, University of Academy of Sciences of Moldova www.edu.asm.md ----- Ð˜ÑÑ Ð¾Ð´Ð½Ð¾Ðµ сообщение ----- От: "Sean Davis" <sdavis2@mail.nih.gov> Кому: "Alex Levitchi" <alex.levitchi@cbm.fvg.it> Копия: bioconductor@stat.math.ethz.ch Отправленные: Вторник, 24 Август 2010 г 16:41:50 GMT +01:00 Амстердам, Берлин, Берн, Вена, Рим, Стокгольм Тема: Re: [BioC] downloading different kinds of microarray data On Tue, Aug 24, 2010 at 10:17 AM, Alex Levitchi < alex.levitchi@cbm.fvg.it > wrote: Dear Sean Davis, Since my last letter, I managed to do almost everything. Unfortunately, I am not definitely understanding the aim of organizing microarray data in GSEs and GDSs, in sense that GEOquery uses different tactics to load data and convert them. So, probably, creating a tool I also must take into consideration all these aspects and allow different steps to load data, corresponding to the level of data organization, GSM to GPL, converting them in ExpressionSet type. Hi, Alex. Your understanding is correct. GSE and GDS contain different information and so are dealt with differently by GEOquery. Also, there is another problem, regarding the fact that GPLs, GDSs and GSEs can contain different tables by their size (different number of probes / rows) which do not allow the analysis straightforward. I am not sure, but I suppose that, e.g, if a GSE consists of GSMs from different platforms, expression and phenotypic data are structured in several parts according to the GPL. Thus, in the example I've sent Again, I think your understanding is correct. > gse=getGEO(idata,GSEMatrix=TRUE) #'idata' the name of the dataset, especially GSE or user created table >columns=c('title','type','source_name_ch1','platform_id') This be probably be about right for 1-color data, but certainly may not be directly useful for 2-color data or for sequencing data. Also, this minimal information may not allow one to capture the appropriate information for every experiment. If all the phenotype data is carried ONLY in the source_name_ch1, then you will be fine, but that will not be the case for many experimental designs. >pdata=pData(gse[[1]])[,columns] >expression=exprs(gse[[1]]) >colnames(expression)=as.vector(pdata[colnames(expression),3]) This assumes that the source_name_ch1 column has unique entries. They need not be unique. I suppose gse[[1]] represent the information extracted only for the first GPL from 'platform_id' column, which was extracted from phenodata, and, if there are 2 or more GPLs, it should be 'gse[[2]]' and so on. Unfortunately, I did not find any article or manual which describe these peculiarities. This is described in the help page for getGEO. getGEO with GSEMatrix=TRUE returns a list of ExpressionSets. Please, give me a hint if I am right and I use a correct way to interpret microarray data structure in order to prepare the data for the later analysis. The informations I always need to get are: 1 - expression values table, with 2 - rows - probe_ids and columns - the name of each sample 3 - GPL name, to use it for the downloading if the corresponding Bioconductor annotation package. In fact, what you are asking for is an ExpressionSet. The getGEO() returns a list of those directly, so there is no need to do any further post-processing with getting GSEs. For GDS data, you can simply use GDS2eSet(getGEO("GDSXXXX")) and you will get an ExpressionSet. Both methods will load the featureData slot with the full GPL data table, so you can use that for annotation. If you want to use the bioconductor annotation packages instead, see the GEOmetadb package which has mappings from GPL accessions to bioconductor data packages. Sean Kind regards, Alex Levitchi PhD in Genetics, Bioinformatician at Laboratory of Bioinformatics CBM, Area Science Park, Trieste, Italy http://www.cbm.fvg.it/laboratories/bioinformatics_research scientific researcher, Center of Molecular Biology, University of Academy of Sciences of Moldova www.edu.asm.md ----- Ð˜ÑÑ Ð¾Ð´Ð½Ð¾Ðµ сообщение ----- От: "Sean Davis" < sdavis2@mail.nih.gov > Кому: "Alex Levitchi" < alex.levitchi@cbm.fvg.it > Копия: bioconductor@stat.math.ethz.ch Отправленные: Пятница, 23 Июль 2010 г 19:53:47 GMT +01:00 Амстердам, Берлин, Берн, Вена, Рим, Стокгольм Тема: Re: [BioC] downloading different kinds of microarray data Hi, Alex. You are definitely thinking correctly that you want to be using ExpressionSets. I would focus your attention on learning to construct an ExpressionSet for each case you outline. Sean On Jul 23, 2010 10:12 AM, "Alex Levitchi" < alex.levitchi@cbm.fvg.it > wrote: Dear Bioconductors, I am working on the development of a tool which use to download microarray data and then make the connection to Bioconductor annotation packages. My specific answer is about the way to manage downloading different kinds of microarrays, which can be: - GSE - several GSMs - users data (excel or tab delimiter file). I use GEOquery package. My tool works fine if I am using just GSE file, which has a good structure and I know how to extract expression values, platform (GPL) and samples names. > gse=getGEO(idata,GSEMatrix=TRUE) >columns=c('title','type','source_name_ch1','platform_id') >pdata=pData(gse[[1]])[,columns] >expression=exprs(gse[[1]]) >colnames(expression)=as.vector(pdata[colnames(expression),3]) But I feel confused, when I think about the way to handle with several GSMs or user data. applying getGEO function for GSM I have to use then Table(gse)$VALUE to extract expression values and Meta(gse)$platform_id to know the GPL. I understand how to do this easy when I have just 1 GSM. How should I manage several GSMs? from the start I supposed to use smth like this: >gse=do.call("cbind",lapply('list_of_GSMs'),function(x) { >getGEO(as.character(x),GSEMatrix=TRUE) >} but, thus, I get just expression values matrix, and I still don't know what is the GPL and sample names. Another idea (I did not check it yet, as I am not sure it is correct) is to try to create an ExpressionSet (also for user data, after downloading them through 'read.table'), but I also don't know how to create a phenoData file, simply manually or there is a possibility to make it through the code. having ExpressionSet I suppose I will can to use "pData" function like in case of a GSE. Doing all this I would like to be able to download and arrange the data in the way, to use the rest of the functions which comes after 'gse=....' in the up presented example. Please, give me some hints at least at one of this points. Thank's for you nice job. Cheers Alexei Levitchi PhD in Genetics, Bioinformatician at Laboratory of Bioinformatics CBM, Area Science Park, Trieste, Italy http://www.cbm.fvg.it/laboratories/bioinformatics_research scientific researcher, Center of Molecular Biology, University of Academy of Sciences of Moldova www.edu.asm.md [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6