How to read a subset of the .CEL files
4
0
Entering edit mode
@alvord-greg-dms-contr-1603
Last seen 10.4 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060623/ d4cc919d/attachment.pl
• 931 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States
Hi Greg, Alvord, Greg (DMS) [Contr] wrote: > > > Dear List - > > > > I am new to BioConductor and R, working under Windows with a gig of RAM, > version R-2.2.1 of R. I have successfully read in six .CEL files and > created the following AffyBatch object. > > > > >>soy.ab > > > AffyBatch object > > size of arrays=1164x1164 features (63516 kb) > > cdf=Soybean (61170 affyids) > > number of samples=6 > > number of genes=61170 > > annotation=soybean > > > > The investigator for whom I'm working is interested in an analysis of > differential gene expression on a subset of affyids in this AffyBatch > object, specifically in 37,744 of the 61,170 affyids (indicated above) > that relate specifically to the soybean genome. I have learned that the > relevant species of interest is labeled 'Glycine max'. I obtained this > information from another source and have not (due to my ignorance) been > able to identify any slot in soy.ab AffyBatch object that identifies > this species. Here is a table of the species on the soy.ab AffyBatch > object (which I obtained from another source). > > > > >>cbind(table(Species)) > > > [,1] > > Alfalfa mosaic virus 3 > > Bean pod mottle virus strain G-7 2 > > Bean pod mottle virus strain K-Hancock1 1 > > Clover yellow vein virus 1 > > Glycine max 37744 > > Heterodera glycines 7539 > > Phytophthora sojae 15864 > > S. saman 4 > > Southern bean mosaic virus strain SBMV-S 1 > > Soybean mosaic virus 1 > > Soybean mosaic virus strain G5 3 > > Soybean mosaic virus strain G7 1 > > Soybean mosaic virus strain N 1 > > Tobacco ringspot virus 2 > > Tobacco streak virus 3 > > > > > > I want to select from the soy.ab AffyBatch object the relevant > information for the species 'Glycine max' only. I have created a data > frame containing those Affy.ID's for species 'Glycine max', e.g., > > > > >>Glycine.max.Species.AffyID.df[c(1:3,37742:37744),] > > > Species Affy.ID > > 8 Glycine max AFFX-BioB-3_at > > 9 Glycine max AFFX-BioB-5_at > > 10 Glycine max AFFX-BioB-M_at > > 37749 Glycine max soybean_rRNA_838_RC_at > > 37750 Glycine max soybean_rRNA_918_at > > 37751 Glycine max soybean_rRNA_918_RC_at > > > > >>dim(Glycine.max.Species.AffyID.df) > > > [1] 37744 2 > > > > How do I extract/create an AffyBatch object containing only the > appropriate Affy.ID's related to the 'Glycine max' species? An AffyBatch object isn't the best for subsetting this way. Better would be to compute expression values using rma() or your favorite method, and then subset. eset <- rma(soy.ab) subsetted.exprset <- eset[Glycine.max.Species.AffyID.df[,2],] HTH, Jim -- James W. MacDonald University of Michigan Affymetrix and cDNA Microarray Core 1500 E Medical Center Drive Ann Arbor MI 48109 734-647-5623 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
ADD COMMENT
0
Entering edit mode
See the affxparser package, e.g. readCelUnits(filenames, units=c(1,600:612,45)). At the moment,you have to take it from there yourself. Henrik Bengtsson On 6/24/06, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > Hi Greg, > > Alvord, Greg (DMS) [Contr] wrote: > > > > > > Dear List - > > > > > > > > I am new to BioConductor and R, working under Windows with a gig of RAM, > > version R-2.2.1 of R. I have successfully read in six .CEL files and > > created the following AffyBatch object. > > > > > > > > > >>soy.ab > > > > > > AffyBatch object > > > > size of arrays=1164x1164 features (63516 kb) > > > > cdf=Soybean (61170 affyids) > > > > number of samples=6 > > > > number of genes=61170 > > > > annotation=soybean > > > > > > > > The investigator for whom I'm working is interested in an analysis of > > differential gene expression on a subset of affyids in this AffyBatch > > object, specifically in 37,744 of the 61,170 affyids (indicated above) > > that relate specifically to the soybean genome. I have learned that the > > relevant species of interest is labeled 'Glycine max'. I obtained this > > information from another source and have not (due to my ignorance) been > > able to identify any slot in soy.ab AffyBatch object that identifies > > this species. Here is a table of the species on the soy.ab AffyBatch > > object (which I obtained from another source). > > > > > > > > > >>cbind(table(Species)) > > > > > > [,1] > > > > Alfalfa mosaic virus 3 > > > > Bean pod mottle virus strain G-7 2 > > > > Bean pod mottle virus strain K-Hancock1 1 > > > > Clover yellow vein virus 1 > > > > Glycine max 37744 > > > > Heterodera glycines 7539 > > > > Phytophthora sojae 15864 > > > > S. saman 4 > > > > Southern bean mosaic virus strain SBMV-S 1 > > > > Soybean mosaic virus 1 > > > > Soybean mosaic virus strain G5 3 > > > > Soybean mosaic virus strain G7 1 > > > > Soybean mosaic virus strain N 1 > > > > Tobacco ringspot virus 2 > > > > Tobacco streak virus 3 > > > > > > > > > > > > I want to select from the soy.ab AffyBatch object the relevant > > information for the species 'Glycine max' only. I have created a data > > frame containing those Affy.ID's for species 'Glycine max', e.g., > > > > > > > > > >>Glycine.max.Species.AffyID.df[c(1:3,37742:37744),] > > > > > > Species Affy.ID > > > > 8 Glycine max AFFX-BioB-3_at > > > > 9 Glycine max AFFX-BioB-5_at > > > > 10 Glycine max AFFX-BioB-M_at > > > > 37749 Glycine max soybean_rRNA_838_RC_at > > > > 37750 Glycine max soybean_rRNA_918_at > > > > 37751 Glycine max soybean_rRNA_918_RC_at > > > > > > > > > >>dim(Glycine.max.Species.AffyID.df) > > > > > > [1] 37744 2 > > > > > > > > How do I extract/create an AffyBatch object containing only the > > appropriate Affy.ID's related to the 'Glycine max' species? > > An AffyBatch object isn't the best for subsetting this way. Better would > be to compute expression values using rma() or your favorite method, and > then subset. > > eset <- rma(soy.ab) > subsetted.exprset <- eset[Glycine.max.Species.AffyID.df[,2],] > > HTH, > > Jim > > -- > James W. MacDonald > University of Michigan > Affymetrix and cDNA Microarray Core > 1500 E Medical Center Drive > Ann Arbor MI 48109 > 734-647-5623 > > > > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
@roayaei-jean-dms-contr-1779
Last seen 10.4 years ago
Dear all, Henrik's explanation is correct. Similar queries made against NCBI soybean data sets yield the same number of genes. Jean Roayaei DMS, NCI-Frederick -----Original Message----- From: henrik.bengtsson@gmail.com [mailto:henrik.bengtsson@gmail.com] On Behalf Of Henrik Bengtsson Sent: Monday, June 26, 2006 6:31 AM To: James W. MacDonald Cc: Alvord, Greg (DMS) [Contr]; Roayaei, Jean (DMS) [Contr]; bioconductor at stat.math.ethz.ch Subject: Re: [BioC] How to read a subset of the .CEL files See the affxparser package, e.g. readCelUnits(filenames, units=c(1,600:612,45)). At the moment,you have to take it from there yourself. Henrik Bengtsson On 6/24/06, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > Hi Greg, > > Alvord, Greg (DMS) [Contr] wrote: > > > > > > Dear List - > > > > > > > > I am new to BioConductor and R, working under Windows with a gig of RAM, > > version R-2.2.1 of R. I have successfully read in six .CEL files and > > created the following AffyBatch object. > > > > > > > > > >>soy.ab > > > > > > AffyBatch object > > > > size of arrays=1164x1164 features (63516 kb) > > > > cdf=Soybean (61170 affyids) > > > > number of samples=6 > > > > number of genes=61170 > > > > annotation=soybean > > > > > > > > The investigator for whom I'm working is interested in an analysis of > > differential gene expression on a subset of affyids in this AffyBatch > > object, specifically in 37,744 of the 61,170 affyids (indicated above) > > that relate specifically to the soybean genome. I have learned that the > > relevant species of interest is labeled 'Glycine max'. I obtained this > > information from another source and have not (due to my ignorance) been > > able to identify any slot in soy.ab AffyBatch object that identifies > > this species. Here is a table of the species on the soy.ab AffyBatch > > object (which I obtained from another source). > > > > > > > > > >>cbind(table(Species)) > > > > > > [,1] > > > > Alfalfa mosaic virus 3 > > > > Bean pod mottle virus strain G-7 2 > > > > Bean pod mottle virus strain K-Hancock1 1 > > > > Clover yellow vein virus 1 > > > > Glycine max 37744 > > > > Heterodera glycines 7539 > > > > Phytophthora sojae 15864 > > > > S. saman 4 > > > > Southern bean mosaic virus strain SBMV-S 1 > > > > Soybean mosaic virus 1 > > > > Soybean mosaic virus strain G5 3 > > > > Soybean mosaic virus strain G7 1 > > > > Soybean mosaic virus strain N 1 > > > > Tobacco ringspot virus 2 > > > > Tobacco streak virus 3 > > > > > > > > > > > > I want to select from the soy.ab AffyBatch object the relevant > > information for the species 'Glycine max' only. I have created a data > > frame containing those Affy.ID's for species 'Glycine max', e.g., > > > > > > > > > >>Glycine.max.Species.AffyID.df[c(1:3,37742:37744),] > > > > > > Species Affy.ID > > > > 8 Glycine max AFFX-BioB-3_at > > > > 9 Glycine max AFFX-BioB-5_at > > > > 10 Glycine max AFFX-BioB-M_at > > > > 37749 Glycine max soybean_rRNA_838_RC_at > > > > 37750 Glycine max soybean_rRNA_918_at > > > > 37751 Glycine max soybean_rRNA_918_RC_at > > > > > > > > > >>dim(Glycine.max.Species.AffyID.df) > > > > > > [1] 37744 2 > > > > > > > > How do I extract/create an AffyBatch object containing only the > > appropriate Affy.ID's related to the 'Glycine max' species? > > An AffyBatch object isn't the best for subsetting this way. Better would > be to compute expression values using rma() or your favorite method, and > then subset. > > eset <- rma(soy.ab) > subsetted.exprset <- eset[Glycine.max.Species.AffyID.df[,2],] > > HTH, > > Jim > > -- > James W. MacDonald > University of Michigan > Affymetrix and cDNA Microarray Core > 1500 E Medical Center Drive > Ann Arbor MI 48109 > 734-647-5623 > > > > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD COMMENT
0
Entering edit mode
Jenny Drnevich ★ 2.2k
@jenny-drnevich-382
Last seen 10.4 years ago
Hi Greg, I use some nice functions written by Ariel Chernomoretz (thanks!!) and posted to the Bioconductor mailing list on April 21, 2005: http://article.gmane.org/gmane.science.biology.informatics.conductor/4 258/match=, although I do have a caveat/question about the effects of removing probe sets on gcrma values (see below). You can use the main function 'RemoveProbes' to remove individual probes and/or entire probe sets, and I routinely use it to remove the non- soybean probe sets from Affy's soybean array. I made one change to the code of 'RemoveProbes', which was to add a call at the beginning to automatically load the cdf and probe packages if they weren't loaded already: require(cdfpackagename,character.only=T) require(probepackagename,character.only=T) If you contact me off-list I can send you my revised code and a few pointers if you're having trouble figuring out how to get 'RemoveProbes' to work. Basically, all you need to input is a list of the probe set names that you want to remove. CAVEAT: Ariel pointed out in a post on April 20, 2005 (http://article.gmane.org/gmane.science.biology.informatics.conductor/ 4242/match=) that removing even one probe will slightly affect the resulting gcrma values, because of the sampling of affinity values, likely in the internal function 'bg.parameters.ns'. I have found this effect to be greatly increased when I remove the non-soybean probe sets (~40% of the array) before running 'gcrma': 88% of the remaining probe sets have smaller gcrma values than before, and 25% have been decreased 1.4 fold or more. The changes occur in the background adjustment step, not in the normalization or summarization steps. If you remove a random 40% of the probe sets, there isn't the same magnitude (90% < 1.05 fold) or bias in direction (45% smaller) of changes . That's when I noticed that the distribution of expression values in my samples for the soybean genes was very different from the distribution of non-soybean genes. Nearly 80% of the soybean genes are called present in my samples but only 5% of the non-soybean genes are called present. My question/concern is why a change in the distribution of probe values can have a large effect on the gc-based background correction? The background corrections used by RMA and MAS5 aren't affected by removing the non-soybean genes, and neither is the background correction+normalization of VSN. It seems to be that there might be some inherent assumptions in the gc-based method that aren't in the other methods about the distribution of probe values. Which gcrma values are "better"? Oops - I just compared the limma results using both sets of gcrma values, and it appears that whatever changes are occurring don't really affect the significant gene lists nor the magnitude of direction of changes between two treatment groups, so perhaps this caveat of mine is unnecessary! It might be important for users of soybean and other large arrays, if they run 'gcrma' on random subsets of arrays instead of all together in order to overcome memory limitations - make sure you are consistent in cutting out the probe sets before running 'gcrma', else your values will not be comparable between subsets!! Cheers, Jenny At 05:30 AM 6/26/2006, Henrik Bengtsson wrote: >See the affxparser package, e.g. readCelUnits(filenames, >units=c(1,600:612,45)). At the moment,you have to take it from there >yourself. > >Henrik Bengtsson > >On 6/24/06, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > > Hi Greg, > > > > Alvord, Greg (DMS) [Contr] wrote: > > > > > > > > > Dear List - > > > > > > > > > > > > I am new to BioConductor and R, working under Windows with a gig of RAM, > > > version R-2.2.1 of R. I have successfully read in six .CEL files and > > > created the following AffyBatch object. > > > > > > > > > > > > > > >>soy.ab > > > > > > > > > AffyBatch object > > > > > > size of arrays=1164x1164 features (63516 kb) > > > > > > cdf=Soybean (61170 affyids) > > > > > > number of samples=6 > > > > > > number of genes=61170 > > > > > > annotation=soybean > > > > > > > > > > > > The investigator for whom I'm working is interested in an analysis of > > > differential gene expression on a subset of affyids in this AffyBatch > > > object, specifically in 37,744 of the 61,170 affyids (indicated above) > > > that relate specifically to the soybean genome. I have learned that the > > > relevant species of interest is labeled 'Glycine max'. I obtained this > > > information from another source and have not (due to my ignorance) been > > > able to identify any slot in soy.ab AffyBatch object that identifies > > > this species. Here is a table of the species on the soy.ab AffyBatch > > > object (which I obtained from another source). > > > > > > > > > > > > > > >>cbind(table(Species)) > > > > > > > > > [,1] > > > > > > Alfalfa mosaic virus 3 > > > > > > Bean pod mottle virus strain G-7 2 > > > > > > Bean pod mottle virus strain K-Hancock1 1 > > > > > > Clover yellow vein virus 1 > > > > > > Glycine max 37744 > > > > > > Heterodera glycines 7539 > > > > > > Phytophthora sojae 15864 > > > > > > S. saman 4 > > > > > > Southern bean mosaic virus strain SBMV-S 1 > > > > > > Soybean mosaic virus 1 > > > > > > Soybean mosaic virus strain G5 3 > > > > > > Soybean mosaic virus strain G7 1 > > > > > > Soybean mosaic virus strain N 1 > > > > > > Tobacco ringspot virus 2 > > > > > > Tobacco streak virus 3 > > > > > > > > > > > > > > > > > > I want to select from the soy.ab AffyBatch object the relevant > > > information for the species 'Glycine max' only. I have created a data > > > frame containing those Affy.ID's for species 'Glycine max', e.g., > > > > > > > > > > > > > > >>Glycine.max.Species.AffyID.df[c(1:3,37742:37744),] > > > > > > > > > Species Affy.ID > > > > > > 8 Glycine max AFFX-BioB-3_at > > > > > > 9 Glycine max AFFX-BioB-5_at > > > > > > 10 Glycine max AFFX-BioB-M_at > > > > > > 37749 Glycine max soybean_rRNA_838_RC_at > > > > > > 37750 Glycine max soybean_rRNA_918_at > > > > > > 37751 Glycine max soybean_rRNA_918_RC_at > > > > > > > > > > > > > > >>dim(Glycine.max.Species.AffyID.df) > > > > > > > > > [1] 37744 2 > > > > > > > > > > > > How do I extract/create an AffyBatch object containing only the > > > appropriate Affy.ID's related to the 'Glycine max' species? > > > > An AffyBatch object isn't the best for subsetting this way. Better would > > be to compute expression values using rma() or your favorite method, and > > then subset. > > > > eset <- rma(soy.ab) > > subsetted.exprset <- eset[Glycine.max.Species.AffyID.df[,2],] > > > > HTH, > > > > Jim > > > > -- > > James W. MacDonald > > University of Michigan > > Affymetrix and cDNA Microarray Core > > 1500 E Medical Center Drive > > Ann Arbor MI 48109 > > 734-647-5623 > > > > > > > > ********************************************************** > > Electronic Mail is not secure, may not be read every day, and should > not be used for urgent or sensitive issues. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Jenny Drnevich, Ph.D. Functional Genomics Bioinformatics Specialist W.M. Keck Center for Comparative and Functional Genomics Roy J. Carver Biotechnology Center University of Illinois, Urbana-Champaign 330 ERML 1201 W. Gregory Dr. Urbana, IL 61801 USA ph: 217-244-7355 fax: 217-265-5066 e-mail: drnevich at uiuc.edu
ADD COMMENT
0
Entering edit mode
@alvord-greg-dms-contr-1603
Last seen 10.4 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20060629/ e36b64bc/attachment.pl
ADD COMMENT

Login before adding your answer.

Traffic: 657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6