paCalls oligo package
1
1
Entering edit mode
@juan-fernandez-tajes-5273
Last seen 7.9 years ago
Dear list, I would like to use the paCalls from oligo package for filtering probe sets with absence of transcripts. My data are from Hugene 1.1 st array (Affymetrix) My data after reading CEL files is a GeneFeatureSet with 1178100 features and 23 samples >data GeneFeatureSet (storageMode: lockedEnvironment) assayData: 1178100 features, 23 samples element names: exprs protocolData rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) varLabels: exprs dates varMetadata: labelDescription channel phenoData rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) varLabels: Sample_ID INIBIC_ID ... Cluster2 (11 total) varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: pd.hugene.1.1.st.v1 I called the paCalls function: > dabgPS <- paCalls(data, "PSDABG") And I obtained a matrix of 257430x23, how can I used this information to filter those probes without transcript? My aim is to obtain an average expression value in only those probes with a "true" transcription. Many thanks in advance Juan my sessionInfo is: R version 2.15.1 (2012-06-22) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] affy_1.34.0 pd.hugene.1.1.st.v1_3.6.0 genefilter_1.38.0 [4] limma_3.12.1 annotate_1.34.1 multtest_2.12.0 [7] oligo_1.20.4 oligoClasses_1.18.0 hugene11sttranscriptcluster.db_4.0.1 [10] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 DBI_0.2-5 [13] AnnotationDbi_1.18.1 Biobase_2.16.0 BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] affxparser_1.28.1 affyio_1.24.0 BiocInstaller_1.4.7 Biostrings_2.24.1 bit_1.1-8 [6] codetools_0.2-8 ff_2.2-7 foreach_1.4.0 IRanges_1.14.4 iterators_1.0.6 [11] MASS_7.3-20 preprocessCore_1.18.0 splines_2.15.1 stats4_2.15.1 survival_2.36-14 [16] tools_2.15.1 XML_3.9-4 xtable_1.7-0 zlibbioc_1.2.0 --------------------------------------------------------------- Juan Fernandez Tajes, ph. D Grupo XENOMAR Departamento de BiologÃ­a Celular y Molecular Facultad de Ciencias-Universidade da CoruÃ±a Tlf. +34 981 167000 ext 2030 e-mail: jfernandezt@udc.es ---------------------------------------------------------------- [[alternative HTML version deleted]]
Transcription oligo Transcription oligo • 1.8k views
0
Entering edit mode
@james-w-macdonald-5106
Last seen 9 minutes ago
United States
Hi Juan, On 9/24/2012 11:09 AM, Juan Fern?ndez Tajes wrote: > Dear list, > > I would like to use the paCalls from oligo package for filtering probe sets with absence of transcripts. My data are from Hugene 1.1 st array (Affymetrix) > My data after reading CEL files is a GeneFeatureSet with 1178100 features and 23 samples > >> data > GeneFeatureSet (storageMode: lockedEnvironment) > assayData: 1178100 features, 23 samples > element names: exprs > protocolData > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > varLabels: exprs dates > varMetadata: labelDescription channel > phenoData > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > varLabels: Sample_ID INIBIC_ID ... Cluster2 (11 total) > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: pd.hugene.1.1.st.v1 > > I called the paCalls function: > >> dabgPS<- paCalls(data, "PSDABG") > And I obtained a matrix of 257430x23, how can I used this information to filter those probes without transcript? > > My aim is to obtain an average expression value in only those probes with a "true" transcription. I would argue that you can't actually do this with microarray data. What you can do is say if the probeset intensities for a given transcript are significantly brighter than background probesets. I think that is a very different thing from saying a transcript isn't expressed, but opinions differ on that point. Please note that the matrix that paCalls() returns is made up of p-values testing the hypothesis that the given probeset is not different from background probesets (so a small p-value causes you to reject the null hypothesis, and conclude that they *are* different). Also note that it appears you have summarized your data at the exon level, whereas you ran paCalls at the transcript level. This won't work, so you either have to do rma() using target = "core", or paCalls() using "DAGB" in order to be consistent. Personally I wouldn't use rma(target = "probeset") for the Gene ST arrays, because tons of the probesets only have one probe at that summarization level. So the next question is what should you do with these data? You could for instance say that at least N of the probesets for a given gene have to have a p-value < 0.05, where N = the number of samples in the smallest group you are comparing. That way, if the gene is transcribed in at least one sample, you retain it (e.g., if a gene is transcribed in one sample and not in any other, this is still an interesting result). Something like N <- 5 ## or whatever ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) data.filtered <- data[ind,] Best, Jim > > Many thanks in advance > > Juan > > my sessionInfo is: > > R version 2.15.1 (2012-06-22) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] affy_1.34.0 pd.hugene.1.1.st.v1_3.6.0 genefilter_1.38.0 > [4] limma_3.12.1 annotate_1.34.1 multtest_2.12.0 > [7] oligo_1.20.4 oligoClasses_1.18.0 hugene11sttranscriptcluster.db_4.0.1 > [10] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 DBI_0.2-5 > [13] AnnotationDbi_1.18.1 Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affxparser_1.28.1 affyio_1.24.0 BiocInstaller_1.4.7 Biostrings_2.24.1 bit_1.1-8 > [6] codetools_0.2-8 ff_2.2-7 foreach_1.4.0 IRanges_1.14.4 iterators_1.0.6 > [11] MASS_7.3-20 preprocessCore_1.18.0 splines_2.15.1 stats4_2.15.1 survival_2.36-14 > [16] tools_2.15.1 XML_3.9-4 xtable_1.7-0 zlibbioc_1.2.0 > > > --------------------------------------------------------------- > Juan Fernandez Tajes, ph. D > Grupo XENOMAR > Departamento de Biolog??a Celular y Molecular > Facultad de Ciencias-Universidade da Coru??a > Tlf. +34 981 167000 ext 2030 > e-mail: jfernandezt at udc.es > ---------------------------------------------------------------- > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
0
Entering edit mode
Hi James, Many thanks for your answer, for instance, I want to check if the probeset intensities for a given transcript are significantly brighter than background probesets, but I wasnÂ´t able to explain it properly. I have another question regarding your explanation: > Also note that it appears you have summarized your data at the exon >level, whereas you ran paCalls at the transcript level. This won't work, >so you either have to do rma() using target = "core", or paCalls() using >"DAGB" in order to be consistent I've obtained my object "data" running the following code: geneCELs <- list.celfiles("./CEL", full.names=T) affyGeneFS <- read.celfiles(geneCELs) data <- affyGeneFS sampleNames(data) <- sub("\\.CEL$", "", sampleNames(dat)) metadata_array <- read.delim(file="metadata_array_oligo.txt", header=T, sep="\t") rownames(metadata_array) <- metadata_array$Sample_ID phenoData(data) <- new("AnnotatedDataFrame", data=metadata_array) When you say that I should do rma() using target="core" or paCalls() using "DAGB" in order to be consistent, should I do the following? dabgPS<- paCalls(data, "PSDABG") dat.rma <- rma(data, target="core") ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) data.filtered <- dat.rma[ind,] Am I right? Juan --------------------------------------------------------------- Juan Fernandez Tajes, ph. D Grupo XENOMAR Departamento de BiologÃ­a Celular y Molecular Facultad de Ciencias-Universidade da CoruÃ±a Tlf. +34 981 167000 ext 2030 e-mail: jfernandezt@udc.es ---------------------------------------------------------------- De: "James W. MacDonald" <jmacdon@uw.edu> Para: "Juan FernÃ¡ndez Tajes" <jfernandezt@udc.es> CC: bioconductor@r-project.org Enviados: Lunes, 24 de Septiembre 2012 17:52:55 Asunto: Re: [BioC] paCalls oligo package Hi Juan, On 9/24/2012 11:09 AM, Juan FernÃ¡ndez Tajes wrote: > Dear list, > > I would like to use the paCalls from oligo package for filtering probe sets with absence of transcripts. My data are from Hugene 1.1 st array (Affymetrix) > My data after reading CEL files is a GeneFeatureSet with 1178100 features and 23 samples > >> data > GeneFeatureSet (storageMode: lockedEnvironment) > assayData: 1178100 features, 23 samples > element names: exprs > protocolData > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > varLabels: exprs dates > varMetadata: labelDescription channel > phenoData > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > varLabels: Sample_ID INIBIC_ID ... Cluster2 (11 total) > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: pd.hugene.1.1.st.v1 > > I called the paCalls function: > >> dabgPS<- paCalls(data, "PSDABG") > And I obtained a matrix of 257430x23, how can I used this information to filter those probes without transcript? > > My aim is to obtain an average expression value in only those probes with a "true" transcription. I would argue that you can't actually do this with microarray data. What you can do is say if the probeset intensities for a given transcript are significantly brighter than background probesets. I think that is a very different thing from saying a transcript isn't expressed, but opinions differ on that point. Please note that the matrix that paCalls() returns is made up of p-values testing the hypothesis that the given probeset is not different from background probesets (so a small p-value causes you to reject the null hypothesis, and conclude that they *are* different). Also note that it appears you have summarized your data at the exon level, whereas you ran paCalls at the transcript level. This won't work, so you either have to do rma() using target = "core", or paCalls() using "DAGB" in order to be consistent. Personally I wouldn't use rma(target = "probeset") for the Gene ST arrays, because tons of the probesets only have one probe at that summarization level. So the next question is what should you do with these data? You could for instance say that at least N of the probesets for a given gene have to have a p-value < 0.05, where N = the number of samples in the smallest group you are comparing. That way, if the gene is transcribed in at least one sample, you retain it (e.g., if a gene is transcribed in one sample and not in any other, this is still an interesting result). Something like N <- 5 ## or whatever ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) data.filtered <- data[ind,] Best, Jim > > Many thanks in advance > > Juan > > my sessionInfo is: > > R version 2.15.1 (2012-06-22) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] affy_1.34.0 pd.hugene.1.1.st.v1_3.6.0 genefilter_1.38.0 > [4] limma_3.12.1 annotate_1.34.1 multtest_2.12.0 > [7] oligo_1.20.4 oligoClasses_1.18.0 hugene11sttranscriptcluster.db_4.0.1 > [10] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 DBI_0.2-5 > [13] AnnotationDbi_1.18.1 Biobase_2.16.0 BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] affxparser_1.28.1 affyio_1.24.0 BiocInstaller_1.4.7 Biostrings_2.24.1 bit_1.1-8 > [6] codetools_0.2-8 ff_2.2-7 foreach_1.4.0 IRanges_1.14.4 iterators_1.0.6 > [11] MASS_7.3-20 preprocessCore_1.18.0 splines_2.15.1 stats4_2.15.1 survival_2.36-14 > [16] tools_2.15.1 XML_3.9-4 xtable_1.7-0 zlibbioc_1.2.0 > > > --------------------------------------------------------------- > Juan Fernandez Tajes, ph. D > Grupo XENOMAR > Departamento de BiologÃÂ­a Celular y Molecular > Facultad de Ciencias-Universidade da CoruÃÂ±a > Tlf. +34 981 167000 ext 2030 > e-mail: jfernandezt@udc.es > ---------------------------------------------------------------- > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]
0
Entering edit mode
Hi Juan, On 9/24/2012 12:24 PM, Juan Fern?ndez Tajes wrote: > Hi James, > > Many thanks for your answer, for instance, I want to check if the > probeset intensities for a given transcript are significantly brighter > than background probesets, but I wasn?t able to explain it properly. I > have another question regarding your explanation: > > >Also note that it appears you have summarized your data at the exon > >level, whereas you ran paCalls at the transcript level. This won't work, > >so you either have to do rma() using target = "core", or paCalls() using > >"DAGB" in order to be consistent > > I've obtained my object "data" running the following code: > > geneCELs <- list.celfiles("./CEL", full.names=T) > affyGeneFS <- read.celfiles(geneCELs) > data <- affyGeneFS > sampleNames(data) <- sub("\\.CEL$", "", sampleNames(dat)) > metadata_array <- read.delim(file="metadata_array_oligo.txt", > header=T, sep="\t") > rownames(metadata_array) <- metadata_array$Sample_ID > phenoData(data) <- new("AnnotatedDataFrame", data=metadata_array) > > When you say that I should do rma() using target="core" or paCalls() > using "DAGB" in order to be consistent, should I do the following? > > dabgPS<- paCalls(data, "PSDABG") > dat.rma <- rma(data, target="core") > ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) > data.filtered <- dat.rma[ind,] > > Am I right? Yes, contingent upon your smallest group having five samples. Best, Jim > > Juan > > > --------------------------------------------------------------- > Juan Fernandez Tajes, ph. D > Grupo XENOMAR > Departamento de Biolog?a Celular y Molecular > Facultad de Ciencias-Universidade da Coru?a > Tlf. +34 981 167000 ext 2030 > e-mail: jfernandezt at udc.es > ---------------------------------------------------------------- > > > -------------------------------------------------------------------- ---- > *De: *"James W. MacDonald" <jmacdon at="" uw.edu=""> > *Para: *"Juan Fern?ndez Tajes" <jfernandezt at="" udc.es=""> > *CC: *bioconductor at r-project.org > *Enviados: *Lunes, 24 de Septiembre 2012 17:52:55 > *Asunto: *Re: [BioC] paCalls oligo package > > Hi Juan, > > On 9/24/2012 11:09 AM, Juan Fern?ndez Tajes wrote: > > Dear list, > > > > I would like to use the paCalls from oligo package for filtering > probe sets with absence of transcripts. My data are from Hugene 1.1 st > array (Affymetrix) > > My data after reading CEL files is a GeneFeatureSet with 1178100 > features and 23 samples > > > >> data > > GeneFeatureSet (storageMode: lockedEnvironment) > > assayData: 1178100 features, 23 samples > > element names: exprs > > protocolData > > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > > varLabels: exprs dates > > varMetadata: labelDescription channel > > phenoData > > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > > varLabels: Sample_ID INIBIC_ID ... Cluster2 (11 total) > > varMetadata: labelDescription > > featureData: none > > experimentData: use 'experimentData(object)' > > Annotation: pd.hugene.1.1.st.v1 > > > > I called the paCalls function: > > > >> dabgPS<- paCalls(data, "PSDABG") > > And I obtained a matrix of 257430x23, how can I used this > information to filter those probes without transcript? > > > > My aim is to obtain an average expression value in only those probes > with a "true" transcription. > > I would argue that you can't actually do this with microarray data. What > you can do is say if the probeset intensities for a given transcript are > significantly brighter than background probesets. I think that is a very > different thing from saying a transcript isn't expressed, but opinions > differ on that point. > > Please note that the matrix that paCalls() returns is made up of > p-values testing the hypothesis that the given probeset is not different > from background probesets (so a small p-value causes you to reject the > null hypothesis, and conclude that they *are* different). > > Also note that it appears you have summarized your data at the exon > level, whereas you ran paCalls at the transcript level. This won't work, > so you either have to do rma() using target = "core", or paCalls() using > "DAGB" in order to be consistent. Personally I wouldn't use rma(target = > "probeset") for the Gene ST arrays, because tons of the probesets only > have one probe at that summarization level. > > So the next question is what should you do with these data? You could > for instance say that at least N of the probesets for a given gene have > to have a p-value < 0.05, where N = the number of samples in the > smallest group you are comparing. That way, if the gene is transcribed > in at least one sample, you retain it (e.g., if a gene is transcribed in > one sample and not in any other, this is still an interesting result). > > Something like > > N <- 5 ## or whatever > ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) > data.filtered <- data[ind,] > > > Best, > > Jim > > > > > > Many thanks in advance > > > > Juan > > > > my sessionInfo is: > > > > R version 2.15.1 (2012-06-22) > > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > > > locale: > > [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] affy_1.34.0 pd.hugene.1.1.st.v1_3.6.0 genefilter_1.38.0 > > [4] limma_3.12.1 annotate_1.34.1 multtest_2.12.0 > > [7] oligo_1.20.4 oligoClasses_1.18.0 > hugene11sttranscriptcluster.db_4.0.1 > > [10] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 DBI_0.2-5 > > [13] AnnotationDbi_1.18.1 Biobase_2.16.0 BiocGenerics_0.2.0 > > > > loaded via a namespace (and not attached): > > [1] affxparser_1.28.1 affyio_1.24.0 BiocInstaller_1.4.7 > Biostrings_2.24.1 bit_1.1-8 > > [6] codetools_0.2-8 ff_2.2-7 foreach_1.4.0 IRanges_1.14.4 > iterators_1.0.6 > > [11] MASS_7.3-20 preprocessCore_1.18.0 splines_2.15.1 stats4_2.15.1 > survival_2.36-14 > > [16] tools_2.15.1 XML_3.9-4 xtable_1.7-0 zlibbioc_1.2.0 > > > > > > --------------------------------------------------------------- > > Juan Fernandez Tajes, ph. D > > Grupo XENOMAR > > Departamento de Biolog??a Celular y Molecular > > Facultad de Ciencias-Universidade da Coru??a > > Tlf. +34 981 167000 ext 2030 > > e-mail: jfernandezt at udc.es > > ---------------------------------------------------------------- > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
0
Entering edit mode
Hi Jim, Sorry for bothering you, but I need to clarify my code because I think IÂ´ve misunderstood something, here is the code that IÂ´m using: > dat <- affyGeneFS > sampleNames(dat) <- sub("\\.CEL$", "", sampleNames(myAB)) > metadata_array <- read.delim(file="metadata_array_oligo.txt", header=T, sep="\t") > rownames(metadata_array) <- metadata_array$Sample_ID > phenoData(dat) <- new("AnnotatedDataFrame", data=metadata_array) > dabgPS <- paCalls(dat, "PSDABG") > N <- 6 > ind <- apply(dabgPS, 1, function(x) sum(x<0.05) > N) > dat.filtered <- dat[ind,] > dat.rma <- rma(dat.filtered, target="core") Is this right? or Should do I change "PSDABG" by "DABG", i.e: > dabgS <- paCalls(dat, "DABG") > N <- 6 > ind <- apply(dabgS, 1, function(x) sum(x<0.05) > N) > dat.filtered <- dat[ind,] > dat.rma <- rma(dat.filtered, target="core") Many thanks in advance BW, Juan --------------------------------------------------------------- Juan Fernandez Tajes, ph. D Grupo XENOMAR Departamento de BiologÃ­a Celular y Molecular Facultad de Ciencias-Universidade da CoruÃ±a Tlf. +34 981 167000 ext 2030 e-mail: jfernandezt@udc.es ---------------------------------------------------------------- De: "James W. MacDonald" <jmacdon@uw.edu> Para: "Juan FernÃ¡ndez Tajes" <jfernandezt@udc.es> CC: bioconductor@r-project.org Enviados: Lunes, 24 de Septiembre 2012 18:27:42 Asunto: Re: [BioC] paCalls oligo package Hi Juan, On 9/24/2012 12:24 PM, Juan FernÃ¡ndez Tajes wrote: > Hi James, > > Many thanks for your answer, for instance, I want to check if the > probeset intensities for a given transcript are significantly brighter > than background probesets, but I wasnÂ´t able to explain it properly. I > have another question regarding your explanation: > > >Also note that it appears you have summarized your data at the exon > >level, whereas you ran paCalls at the transcript level. This won't work, > >so you either have to do rma() using target = "core", or paCalls() using > >"DAGB" in order to be consistent > > I've obtained my object "data" running the following code: > > geneCELs <- list.celfiles("./CEL", full.names=T) > affyGeneFS <- read.celfiles(geneCELs) > data <- affyGeneFS > sampleNames(data) <- sub("\\.CEL$", "", sampleNames(dat)) > metadata_array <- read.delim(file="metadata_array_oligo.txt", > header=T, sep="\t") > rownames(metadata_array) <- metadata_array$Sample_ID > phenoData(data) <- new("AnnotatedDataFrame", data=metadata_array) > > When you say that I should do rma() using target="core" or paCalls() > using "DAGB" in order to be consistent, should I do the following? > > dabgPS<- paCalls(data, "PSDABG") > dat.rma <- rma(data, target="core") > ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) > data.filtered <- dat.rma[ind,] > > Am I right? Yes, contingent upon your smallest group having five samples. Best, Jim > > Juan > > > --------------------------------------------------------------- > Juan Fernandez Tajes, ph. D > Grupo XENOMAR > Departamento de BiologÃ­a Celular y Molecular > Facultad de Ciencias-Universidade da CoruÃ±a > Tlf. +34 981 167000 ext 2030 > e-mail: jfernandezt@udc.es > ---------------------------------------------------------------- > > > -------------------------------------------------------------------- ---- > *De: *"James W. MacDonald" <jmacdon@uw.edu> > *Para: *"Juan FernÃ¡ndez Tajes" <jfernandezt@udc.es> > *CC: *bioconductor@r-project.org > *Enviados: *Lunes, 24 de Septiembre 2012 17:52:55 > *Asunto: *Re: [BioC] paCalls oligo package > > Hi Juan, > > On 9/24/2012 11:09 AM, Juan FernÃ¡ndez Tajes wrote: > > Dear list, > > > > I would like to use the paCalls from oligo package for filtering > probe sets with absence of transcripts. My data are from Hugene 1.1 st > array (Affymetrix) > > My data after reading CEL files is a GeneFeatureSet with 1178100 > features and 23 samples > > > >> data > > GeneFeatureSet (storageMode: lockedEnvironment) > > assayData: 1178100 features, 23 samples > > element names: exprs > > protocolData > > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > > varLabels: exprs dates > > varMetadata: labelDescription channel > > phenoData > > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > > varLabels: Sample_ID INIBIC_ID ... Cluster2 (11 total) > > varMetadata: labelDescription > > featureData: none > > experimentData: use 'experimentData(object)' > > Annotation: pd.hugene.1.1.st.v1 > > > > I called the paCalls function: > > > >> dabgPS<- paCalls(data, "PSDABG") > > And I obtained a matrix of 257430x23, how can I used this > information to filter those probes without transcript? > > > > My aim is to obtain an average expression value in only those probes > with a "true" transcription. > > I would argue that you can't actually do this with microarray data. What > you can do is say if the probeset intensities for a given transcript are > significantly brighter than background probesets. I think that is a very > different thing from saying a transcript isn't expressed, but opinions > differ on that point. > > Please note that the matrix that paCalls() returns is made up of > p-values testing the hypothesis that the given probeset is not different > from background probesets (so a small p-value causes you to reject the > null hypothesis, and conclude that they *are* different). > > Also note that it appears you have summarized your data at the exon > level, whereas you ran paCalls at the transcript level. This won't work, > so you either have to do rma() using target = "core", or paCalls() using > "DAGB" in order to be consistent. Personally I wouldn't use rma(target = > "probeset") for the Gene ST arrays, because tons of the probesets only > have one probe at that summarization level. > > So the next question is what should you do with these data? You could > for instance say that at least N of the probesets for a given gene have > to have a p-value < 0.05, where N = the number of samples in the > smallest group you are comparing. That way, if the gene is transcribed > in at least one sample, you retain it (e.g., if a gene is transcribed in > one sample and not in any other, this is still an interesting result). > > Something like > > N <- 5 ## or whatever > ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) > data.filtered <- data[ind,] > > > Best, > > Jim > > > > > > Many thanks in advance > > > > Juan > > > > my sessionInfo is: > > > > R version 2.15.1 (2012-06-22) > > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > > > locale: > > [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8 > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] affy_1.34.0 pd.hugene.1.1.st.v1_3.6.0 genefilter_1.38.0 > > [4] limma_3.12.1 annotate_1.34.1 multtest_2.12.0 > > [7] oligo_1.20.4 oligoClasses_1.18.0 > hugene11sttranscriptcluster.db_4.0.1 > > [10] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 DBI_0.2-5 > > [13] AnnotationDbi_1.18.1 Biobase_2.16.0 BiocGenerics_0.2.0 > > > > loaded via a namespace (and not attached): > > [1] affxparser_1.28.1 affyio_1.24.0 BiocInstaller_1.4.7 > Biostrings_2.24.1 bit_1.1-8 > > [6] codetools_0.2-8 ff_2.2-7 foreach_1.4.0 IRanges_1.14.4 > iterators_1.0.6 > > [11] MASS_7.3-20 preprocessCore_1.18.0 splines_2.15.1 stats4_2.15.1 > survival_2.36-14 > > [16] tools_2.15.1 XML_3.9-4 xtable_1.7-0 zlibbioc_1.2.0 > > > > > > --------------------------------------------------------------- > > Juan Fernandez Tajes, ph. D > > Grupo XENOMAR > > Departamento de BiologÃÂ­a Celular y Molecular > > Facultad de Ciencias-Universidade da CoruÃÂ±a > > Tlf. +34 981 167000 ext 2030 > > e-mail: jfernandezt@udc.es > > ---------------------------------------------------------------- > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 [[alternative HTML version deleted]]
0
Entering edit mode
Hi Juan, On 9/25/2012 7:16 AM, Juan Fern?ndez Tajes wrote: > Hi Jim, > > Sorry for bothering you, but I need to clarify my code because I think > I?ve misunderstood something, here is the code that I?m using: > > > dat <- affyGeneFS > > sampleNames(dat) <- sub("\\.CEL$", "", sampleNames(myAB)) > > metadata_array <- read.delim(file="metadata_array_oligo.txt", > header=T, sep="\t") > > rownames(metadata_array) <- metadata_array$Sample_ID > > phenoData(dat) <- new("AnnotatedDataFrame", data=metadata_array) So I don't know what this ^^^^^^ business is all about. But it appears to me that you are trying to do something tricky when simple is the best recourse. dat <- read.celfiles(filenames = list.celfiles()) ## if celfiles aren't in working dir, add path to list.celfiles. eset <- rma(dat) dagbPS <- paCalls(dat, "PSDAGB") ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > 6) eset.filtered <- eset[ind,] Best, Jim > > > dabgPS <- paCalls(dat, "PSDABG") > > N <- 6 > > ind <- apply(dabgPS, 1, function(x) sum(x<0.05) > N) > > dat.filtered <- dat[ind,] > > dat.rma <- rma(dat.filtered, target="core") > > Is this right? or Should do I change "PSDABG" by "DABG", i.e: > > > dabgS <- paCalls(dat, "DABG") > > N <- 6 > > ind <- apply(dabgS, 1, function(x) sum(x<0.05) > N) > > dat.filtered <- dat[ind,] > > dat.rma <- rma(dat.filtered, target="core") > > Many thanks in advance > > BW, > > Juan > > > --------------------------------------------------------------- > Juan Fernandez Tajes, ph. D > Grupo XENOMAR > Departamento de Biolog?a Celular y Molecular > Facultad de Ciencias-Universidade da Coru?a > Tlf. +34 981 167000 ext 2030 > e-mail: jfernandezt at udc.es > ---------------------------------------------------------------- > > > -------------------------------------------------------------------- ---- > *De: *"James W. MacDonald" <jmacdon at="" uw.edu=""> > *Para: *"Juan Fern?ndez Tajes" <jfernandezt at="" udc.es=""> > *CC: *bioconductor at r-project.org > *Enviados: *Lunes, 24 de Septiembre 2012 18:27:42 > *Asunto: *Re: [BioC] paCalls oligo package > > Hi Juan, > > On 9/24/2012 12:24 PM, Juan Fern?ndez Tajes wrote: > > Hi James, > > > > Many thanks for your answer, for instance, I want to check if the > > probeset intensities for a given transcript are significantly brighter > > than background probesets, but I wasn?t able to explain it properly. I > > have another question regarding your explanation: > > > > >Also note that it appears you have summarized your data at the exon > > >level, whereas you ran paCalls at the transcript level. This won't > work, > > >so you either have to do rma() using target = "core", or paCalls() > using > > >"DAGB" in order to be consistent > > > > I've obtained my object "data" running the following code: > > > > geneCELs <- list.celfiles("./CEL", full.names=T) > > affyGeneFS <- read.celfiles(geneCELs) > > data <- affyGeneFS > > sampleNames(data) <- sub("\\.CEL$", "", sampleNames(dat)) > > metadata_array <- read.delim(file="metadata_array_oligo.txt", > > header=T, sep="\t") > > rownames(metadata_array) <- metadata_array$Sample_ID > > phenoData(data) <- new("AnnotatedDataFrame", data=metadata_array) > > > > When you say that I should do rma() using target="core" or paCalls() > > using "DAGB" in order to be consistent, should I do the following? > > > > dabgPS<- paCalls(data, "PSDABG") > > dat.rma <- rma(data, target="core") > > ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) > > data.filtered <- dat.rma[ind,] > > > > Am I right? > > Yes, contingent upon your smallest group having five samples. > > Best, > > Jim > > > > > > Juan > > > > > > --------------------------------------------------------------- > > Juan Fernandez Tajes, ph. D > > Grupo XENOMAR > > Departamento de Biolog?a Celular y Molecular > > Facultad de Ciencias-Universidade da Coru?a > > Tlf. +34 981 167000 ext 2030 > > e-mail: jfernandezt at udc.es > > ---------------------------------------------------------------- > > > > > > ------------------------------------------------------------------ ------ > > *De: *"James W. MacDonald" <jmacdon at="" uw.edu=""> > > *Para: *"Juan Fern?ndez Tajes" <jfernandezt at="" udc.es=""> > > *CC: *bioconductor at r-project.org > > *Enviados: *Lunes, 24 de Septiembre 2012 17:52:55 > > *Asunto: *Re: [BioC] paCalls oligo package > > > > Hi Juan, > > > > On 9/24/2012 11:09 AM, Juan Fern?ndez Tajes wrote: > > > Dear list, > > > > > > I would like to use the paCalls from oligo package for filtering > > probe sets with absence of transcripts. My data are from Hugene 1.1 st > > array (Affymetrix) > > > My data after reading CEL files is a GeneFeatureSet with 1178100 > > features and 23 samples > > > > > >> data > > > GeneFeatureSet (storageMode: lockedEnvironment) > > > assayData: 1178100 features, 23 samples > > > element names: exprs > > > protocolData > > > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > > > varLabels: exprs dates > > > varMetadata: labelDescription channel > > > phenoData > > > rowNames: 10SE191_2 10SE207 ... 10SE360 (23 total) > > > varLabels: Sample_ID INIBIC_ID ... Cluster2 (11 total) > > > varMetadata: labelDescription > > > featureData: none > > > experimentData: use 'experimentData(object)' > > > Annotation: pd.hugene.1.1.st.v1 > > > > > > I called the paCalls function: > > > > > >> dabgPS<- paCalls(data, "PSDABG") > > > And I obtained a matrix of 257430x23, how can I used this > > information to filter those probes without transcript? > > > > > > My aim is to obtain an average expression value in only those probes > > with a "true" transcription. > > > > I would argue that you can't actually do this with microarray data. What > > you can do is say if the probeset intensities for a given transcript are > > significantly brighter than background probesets. I think that is a very > > different thing from saying a transcript isn't expressed, but opinions > > differ on that point. > > > > Please note that the matrix that paCalls() returns is made up of > > p-values testing the hypothesis that the given probeset is not different > > from background probesets (so a small p-value causes you to reject the > > null hypothesis, and conclude that they *are* different). > > > > Also note that it appears you have summarized your data at the exon > > level, whereas you ran paCalls at the transcript level. This won't work, > > so you either have to do rma() using target = "core", or paCalls() using > > "DAGB" in order to be consistent. Personally I wouldn't use rma(target = > > "probeset") for the Gene ST arrays, because tons of the probesets only > > have one probe at that summarization level. > > > > So the next question is what should you do with these data? You could > > for instance say that at least N of the probesets for a given gene have > > to have a p-value < 0.05, where N = the number of samples in the > > smallest group you are comparing. That way, if the gene is transcribed > > in at least one sample, you retain it (e.g., if a gene is transcribed in > > one sample and not in any other, this is still an interesting result). > > > > Something like > > > > N <- 5 ## or whatever > > ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N) > > data.filtered <- data[ind,] > > > > > > Best, > > > > Jim > > > > > > > > > > Many thanks in advance > > > > > > Juan > > > > > > my sessionInfo is: > > > > > > R version 2.15.1 (2012-06-22) > > > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > > > > > locale: > > > [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8 > > > > > > attached base packages: > > > [1] stats graphics grDevices utils datasets methods base > > > > > > other attached packages: > > > [1] affy_1.34.0 pd.hugene.1.1.st.v1_3.6.0 genefilter_1.38.0 > > > [4] limma_3.12.1 annotate_1.34.1 multtest_2.12.0 > > > [7] oligo_1.20.4 oligoClasses_1.18.0 > > hugene11sttranscriptcluster.db_4.0.1 > > > [10] org.Hs.eg.db_2.7.1 RSQLite_0.11.1 DBI_0.2-5 > > > [13] AnnotationDbi_1.18.1 Biobase_2.16.0 BiocGenerics_0.2.0 > > > > > > loaded via a namespace (and not attached): > > > [1] affxparser_1.28.1 affyio_1.24.0 BiocInstaller_1.4.7 > > Biostrings_2.24.1 bit_1.1-8 > > > [6] codetools_0.2-8 ff_2.2-7 foreach_1.4.0 IRanges_1.14.4 > > iterators_1.0.6 > > > [11] MASS_7.3-20 preprocessCore_1.18.0 splines_2.15.1 stats4_2.15.1 > > survival_2.36-14 > > > [16] tools_2.15.1 XML_3.9-4 xtable_1.7-0 zlibbioc_1.2.0 > > > > > > > > > --------------------------------------------------------------- > > > Juan Fernandez Tajes, ph. D > > > Grupo XENOMAR > > > Departamento de Biolog??a Celular y Molecular > > > Facultad de Ciencias-Universidade da Coru??a > > > Tlf. +34 981 167000 ext 2030 > > > e-mail: jfernandezt at udc.es > > > ---------------------------------------------------------------- > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099