Question: Analysis of Affymetrix Human Gene 2.0 ST arrays
0
5.9 years ago by
Guest User12k
Guest User12k wrote:
Dear all, I am analyzing a set of Affymetrix Human Gene 2.0 ST arrays, this is my first time working with this type of arrays so I have a few general questions. I would very much appreciate any advice you could give. (1) I have obtained different lists of differentially expressed genes (using eBayes() from limma). In those lists, some control transcripts are popping up (i.e normgene -> intron category among other categories). I was not expecting this type of transcripts at this point. In theory after normalization, no control transcripts should appear, am I right? Have you experienced this? I have read that one possibility is to use getMainProbes before topTable selection but I wonder if there could be something wrong from the beginning with my normalization process (I have used rma() ??? transcript level - from oligo). What is your opinion? (2) This type of arrays also includes lincRNA transcripts and I am interested in considering them for my analysis. The thing is that I am using hugene20sttranscriptcluster.db for annotation and these lincRNA are not included. Would this library be able to handle them? (3) I tried to make my own annotation package thru makeDBPackage based on .csv annotation file from Affy but I got an error???: Error in [.data.frame(csvFile, , GenBank IDName) : undefined columns selected I have already read in this mailing list that makeDBPackage may expect a HGU133plus2 annotation ???style???. Would the library annotationForge be able to handle this? Many thanks in advance for any help! Mar??a Maqueda Biomedical Engineering Research Centre (CREB) Universitat Polit??cnica de Catalunya (UPC) -- output of sessionInfo(): > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C [5] LC_TIME=Spanish_Spain.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] human.db0_2.9.0 AnnotationForge_1.2.2 [3] hugene20sttranscriptcluster.db_2.12.1 org.Hs.eg.db_2.9.0 [5] AnnotationDbi_1.22.6 BiocInstaller_1.12.0 [7] limma_3.16.8 pd.hugene.2.0.st_3.8.0 [9] oligo_1.24.2 Biobase_2.20.1 [11] oligoClasses_1.22.0 BiocGenerics_0.6.0 [13] RSQLite_0.11.4 DBI_0.2-7 loaded via a namespace (and not attached): [1] affxparser_1.32.3 affyio_1.28.0 annotate_1.38.0 [4] Biostrings_2.28.0 bit_1.1-10 codetools_0.2-8 [7] ff_2.2-12 foreach_1.4.1 genefilter_1.42.0 [10] GenomicRanges_1.12.5 IRanges_1.18.4 iterators_1.0.6 [13] preprocessCore_1.22.0 splines_3.0.1 stats4_3.0.1 [16] survival_2.37-4 tools_3.0.1 XML_3.98-1.1 [19] xtable_1.7-1 zlibbioc_1.6.0 -- Sent via the guest posting facility at bioconductor.org.
modified 5.9 years ago by James W. MacDonald51k • written 5.9 years ago by Guest User12k
Answer: Analysis of Affymetrix Human Gene 2.0 ST arrays
0
5.9 years ago by
United States
James W. MacDonald51k wrote:
Hi Maria, On 11/29/2013 6:18 AM, Mar?a Maqueda [guest] wrote: > Dear all, > > I am analyzing a set of Affymetrix Human Gene 2.0 ST arrays, this is my first time working with this type of arrays so I have a few general questions. I would very much appreciate any advice you could give. > > (1) I have obtained different lists of differentially expressed genes (using eBayes() from limma). In those lists, some control transcripts are popping up (i.e normgene -> intron category among other categories). I was not expecting this type of transcripts at this point. In theory after normalization, no control transcripts should appear, am I right? Have you experienced this? > I have read that one possibility is to use getMainProbes before topTable selection but I wonder if there could be something wrong from the beginning with my normalization process (I have used rma() ??? transcript level - from oligo). What is your opinion? I don't think it has anything to do with the normalization. Instead, I think it is a combination of poorly designed probes and highly expressed genes for which there are sufficient unprocessed mRNA transcripts that still have their introns intact (remember that the processing of samples stops all enzymatic activity very quickly as a first step, so any mRNA that is in the process of being transcribed, or is just finishing transcription will likely still have introns). > > (2) This type of arrays also includes lincRNA transcripts and I am interested in considering them for my analysis. The thing is that I am using hugene20sttranscriptcluster.db for annotation and these lincRNA are not included. Would this library be able to handle them? Hypothetically yes, as of now not really. It doesn't seem like that many have been annotated with Entrez Gene IDs, and until that happens they won't appear in the annotation packages. And even for those that do have Entrez Gene IDs, the information stops there - you go to NCBI and it just says that the lincRNA is supposed to exist, but nothing else. > > (3) I tried to make my own annotation package thru makeDBPackage based on .csv annotation file from Affy but I got an error???: Error in [.data.frame(csvFile, , GenBank IDName) : undefined columns selected > I have already read in this mailing list that makeDBPackage may expect a HGU133plus2 annotation ???style???. Would the library annotationForge be able to handle this? AnnotationForge cannot handle the csv files for these arrays directly, as they are completely different from the old style 3'-biased arrays like the hgu133plus2 that you mention. I have a function I can give you to make the input file for the annotation package, but I don't think it is worth it because it would be the function that I already used to make the annotation package you can get from BioC. So you could go through all the effort to make something you can already get. But if you want it, I will send it to you. Best, Jim > > > Many thanks in advance for any help! > > > Mar??a Maqueda > > Biomedical Engineering Research Centre (CREB) > Universitat Polit??cnica de Catalunya (UPC) > > -- output of sessionInfo(): > >> sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 > [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C > [5] LC_TIME=Spanish_Spain.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] human.db0_2.9.0 AnnotationForge_1.2.2 > [3] hugene20sttranscriptcluster.db_2.12.1 org.Hs.eg.db_2.9.0 > [5] AnnotationDbi_1.22.6 BiocInstaller_1.12.0 > [7] limma_3.16.8 pd.hugene.2.0.st_3.8.0 > [9] oligo_1.24.2 Biobase_2.20.1 > [11] oligoClasses_1.22.0 BiocGenerics_0.6.0 > [13] RSQLite_0.11.4 DBI_0.2-7 > > loaded via a namespace (and not attached): > [1] affxparser_1.32.3 affyio_1.28.0 annotate_1.38.0 > [4] Biostrings_2.28.0 bit_1.1-10 codetools_0.2-8 > [7] ff_2.2-12 foreach_1.4.1 genefilter_1.42.0 > [10] GenomicRanges_1.12.5 IRanges_1.18.4 iterators_1.0.6 > [13] preprocessCore_1.22.0 splines_3.0.1 stats4_3.0.1 > [16] survival_2.37-4 tools_3.0.1 XML_3.98-1.1 > [19] xtable_1.7-1 zlibbioc_1.6.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
Hi Jim, Many thanks for your quick and very comprehensive response. >From your comments, I have one more question related: (1) I understand your comments about the intron control transcripts, but I do not fully understand the rescue transcript category that I have also obtained in my topTable transcripts. No need to send the function, but thanks in any case for offering. Regards, Maria > Date: Fri, 29 Nov 2013 09:04:20 -0500 > From: jmacdon@uw.edu > To: guest@bioconductor.org > CC: bioconductor@r-project.org; mmaqueda@live.com > Subject: Re: [BioC] Analysis of Affymetrix Human Gene 2.0 ST arrays > > Hi Maria, > > > On 11/29/2013 6:18 AM, Mar�a Maqueda [guest] wrote: > > Dear all, > > > > I am analyzing a set of Affymetrix Human Gene 2.0 ST arrays, this is my first time working with this type of arrays so I have a few general questions. I would very much appreciate any advice you could give. > > > > (1) I have obtained different lists of differentially expressed genes (using eBayes() from limma). In those lists, some control transcripts are popping up (i.e normgene -> intron category among other categories). I was not expecting this type of transcripts at this point. In theory after normalization, no control transcripts should appear, am I right? Have you experienced this? > > I have read that one possibility is to use getMainProbes before topTable selection but I wonder if there could be something wrong from the beginning with my normalization process (I have used rma() – transcript level - from oligo). What is your opinion? > > I don't think it has anything to do with the normalization. Instead, I > think it is a combination of poorly designed probes and highly expressed > genes for which there are sufficient unprocessed mRNA transcripts that > still have their introns intact (remember that the processing of samples > stops all enzymatic activity very quickly as a first step, so any mRNA > that is in the process of being transcribed, or is just finishing > transcription will likely still have introns). > > > > > (2) This type of arrays also includes lincRNA transcripts and I am interested in considering them for my analysis. The thing is that I am using hugene20sttranscriptcluster.db for annotation and these lincRNA are not included. Would this library be able to handle them? > > Hypothetically yes, as of now not really. It doesn't seem like that many > have been annotated with Entrez Gene IDs, and until that happens they > won't appear in the annotation packages. And even for those that do have > Entrez Gene IDs, the information stops there - you go to NCBI and it > just says that the lincRNA is supposed to exist, but nothing else. > > > > > (3) I tried to make my own annotation package thru makeDBPackage based on .csv annotation file from Affy but I got an error…: Error in [.data.frame(csvFile, , GenBank IDName) : undefined columns selected > > I have already read in this mailing list that makeDBPackage may expect a HGU133plus2 annotation “style”. Would the library annotationForge be able to handle this? > > AnnotationForge cannot handle the csv files for these arrays directly, > as they are completely different from the old style 3'-biased arrays > like the hgu133plus2 that you mention. I have a function I can give you > to make the input file for the annotation package, but I don't think it > is worth it because it would be the function that I already used to make > the annotation package you can get from BioC. So you could go through > all the effort to make something you can already get. > > But if you want it, I will send it to you. > > Best, > > Jim > > > > > > > > Many thanks in advance for any help! > > > > > > María Maqueda > > > > Biomedical Engineering Research Centre (CREB) > > Universitat Politècnica de Catalunya (UPC) > > > > -- output of sessionInfo(): > > > >> sessionInfo() > > R version 3.0.1 (2013-05-16) > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > > > locale: > > [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 > > [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C > > [5] LC_TIME=Spanish_Spain.1252 > > > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] human.db0_2.9.0 AnnotationForge_1.2.2 > > [3] hugene20sttranscriptcluster.db_2.12.1 org.Hs.eg.db_2.9.0 > > [5] AnnotationDbi_1.22.6 BiocInstaller_1.12.0 > > [7] limma_3.16.8 pd.hugene.2.0.st_3.8.0 > > [9] oligo_1.24.2 Biobase_2.20.1 > > [11] oligoClasses_1.22.0 BiocGenerics_0.6.0 > > [13] RSQLite_0.11.4 DBI_0.2-7 > > > > loaded via a namespace (and not attached): > > [1] affxparser_1.32.3 affyio_1.28.0 annotate_1.38.0 > > [4] Biostrings_2.28.0 bit_1.1-10 codetools_0.2-8 > > [7] ff_2.2-12 foreach_1.4.1 genefilter_1.42.0 > > [10] GenomicRanges_1.12.5 IRanges_1.18.4 iterators_1.0.6 > > [13] preprocessCore_1.22.0 splines_3.0.1 stats4_3.0.1 > > [16] survival_2.37-4 tools_3.0.1 XML_3.98-1.1 > > [19] xtable_1.7-1 zlibbioc_1.6.0 > > > > -- > > Sent via the guest posting facility at bioconductor.org. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > [[alternative HTML version deleted]]
Hi Maria, Please don't take messages off-list (e.g., use Reply-all). We like to think of the list archives as a repository of information that people can search, and if messages become private, that hampers the usefulness of the archives. On 11/29/2013 1:38 PM, Mar?a Maqueda Gonz?lez wrote: > Hi Jim, > Many thanks for your quick and very comprehensive response. > > From your comments, I have one more question related: > > (1) I understand your comments about the intron control transcripts, > but I do not fully understand the rescue transcript category that I > have also obtained in my topTable transcripts. There are two things to think about here. First, there is the issue of statistical significance versus biological significance. Note that the t-statistic is a fraction, and in the numerator you have the difference between the means of two groups, and in the denominator you have the standard error of that difference. The standard error is based on the intra-group variability. So if you have a particular probeset and the intra-group variability for that probeset is extremely small, then you can end up with a statistically significant result even if the fold change isn't very large at all. The eBayes step is intended to protect against this to some extent, by adjusting 'too small' standard errors towards the overall variance estimate, but protecting against something and completely eliminating it are two different things. So it may be that the differences for these controls aren't that great, and it is just happenstance that the intra-group variance is small enough to get statistical significance. One thing you can do to protect against that sort of thing is to filter out probesets that don't really change expression very much in any samples (or just use getMainProbes and nuke all these controls in the first place, which is what I would do). Second, just because something shows up in a topTable, doesn't mean it is actually differentially expressed. I don't know how you are adjusting for multiple comparisons, but let's just assume you are using FDR. If you then take the probesets with an FDR > 0.05, you are accepting that up to 5% of the probesets in that list are false positives. In other words, 5% of the probesets in that table aren't really differentially expressed, they just happen to have a large t-statistic by chance. Thus, the rescue probeset(s) that you have might just be false positives. Best, Jim > > No need to send the function, but thanks in any case for offering. > > Regards, > > Maria > > > Date: Fri, 29 Nov 2013 09:04:20 -0500 > > From: jmacdon at uw.edu > > To: guest at bioconductor.org > > CC: bioconductor at r-project.org; mmaqueda at live.com > > Subject: Re: [BioC] Analysis of Affymetrix Human Gene 2.0 ST arrays > > > > Hi Maria, > > > > > > On 11/29/2013 6:18 AM, Mar?a Maqueda [guest] wrote: > > > Dear all, > > > > > > I am analyzing a set of Affymetrix Human Gene 2.0 ST arrays, this > is my first time working with this type of arrays so I have a few > general questions. I would very much appreciate any advice you could give. > > > > > > (1) I have obtained different lists of differentially expressed > genes (using eBayes() from limma). In those lists, some control > transcripts are popping up (i.e normgene -> intron category among > other categories). I was not expecting this type of transcripts at > this point. In theory after normalization, no control transcripts > should appear, am I right? Have you experienced this? > > > I have read that one possibility is to use getMainProbes before > topTable selection but I wonder if there could be something wrong from > the beginning with my normalization process (I have used rma() ??? > transcript level - from oligo). What is your opinion? > > > > I don't think it has anything to do with the normalization. Instead, I > > think it is a combination of poorly designed probes and highly > expressed > > genes for which there are sufficient unprocessed mRNA transcripts that > > still have their introns intact (remember that the processing of > samples > > stops all enzymatic activity very quickly as a first step, so any mRNA > > that is in the process of being transcribed, or is just finishing > > transcription will likely still have introns). > > > > > > > > (2) This type of arrays also includes lincRNA transcripts and I am > interested in considering them for my analysis. The thing is that I am > using hugene20sttranscriptcluster.db for annotation and these lincRNA > are not included. Would this library be able to handle them? > > > > Hypothetically yes, as of now not really. It doesn't seem like that > many > > have been annotated with Entrez Gene IDs, and until that happens they > > won't appear in the annotation packages. And even for those that do > have > > Entrez Gene IDs, the information stops there - you go to NCBI and it > > just says that the lincRNA is supposed to exist, but nothing else. > > > > > > > > (3) I tried to make my own annotation package thru makeDBPackage > based on .csv annotation file from Affy but I got an error???: Error > in [.data.frame(csvFile, , GenBank IDName) : undefined columns selected > > > I have already read in this mailing list that makeDBPackage may > expect a HGU133plus2 annotation ???style???. Would the library > annotationForge be able to handle this? > > > > AnnotationForge cannot handle the csv files for these arrays directly, > > as they are completely different from the old style 3'-biased arrays > > like the hgu133plus2 that you mention. I have a function I can give you > > to make the input file for the annotation package, but I don't think it > > is worth it because it would be the function that I already used to > make > > the annotation package you can get from BioC. So you could go through > > all the effort to make something you can already get. > > > > But if you want it, I will send it to you. > > > > Best, > > > > Jim > > > > > > > > > > > > > Many thanks in advance for any help! > > > > > > > > > Mar??a Maqueda > > > > > > Biomedical Engineering Research Centre (CREB) > > > Universitat Polit??cnica de Catalunya (UPC) > > > > > > -- output of sessionInfo(): > > > > > >> sessionInfo() > > > R version 3.0.1 (2013-05-16) > > > Platform: x86_64-w64-mingw32/x64 (64-bit) > > > > > > locale: > > > [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 > > > [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C > > > [5] LC_TIME=Spanish_Spain.1252 > > > > > > attached base packages: > > > [1] parallel stats graphics grDevices utils datasets methods base > > > > > > other attached packages: > > > [1] human.db0_2.9.0 AnnotationForge_1.2.2 > > > [3] hugene20sttranscriptcluster.db_2.12.1 org.Hs.eg.db_2.9.0 > > > [5] AnnotationDbi_1.22.6 BiocInstaller_1.12.0 > > > [7] limma_3.16.8 pd.hugene.2.0.st_3.8.0 > > > [9] oligo_1.24.2 Biobase_2.20.1 > > > [11] oligoClasses_1.22.0 BiocGenerics_0.6.0 > > > [13] RSQLite_0.11.4 DBI_0.2-7 > > > > > > loaded via a namespace (and not attached): > > > [1] affxparser_1.32.3 affyio_1.28.0 annotate_1.38.0 > > > [4] Biostrings_2.28.0 bit_1.1-10 codetools_0.2-8 > > > [7] ff_2.2-12 foreach_1.4.1 genefilter_1.42.0 > > > [10] GenomicRanges_1.12.5 IRanges_1.18.4 iterators_1.0.6 > > > [13] preprocessCore_1.22.0 splines_3.0.1 stats4_3.0.1 > > > [16] survival_2.37-4 tools_3.0.1 XML_3.98-1.1 > > > [19] xtable_1.7-1 zlibbioc_1.6.0 > > > > > > -- > > > Sent via the guest posting facility at bioconductor.org. > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

Hi Jim, I've just seen there are many non-coding genes, such as miRNAs, in the lastest CSV annotation files on Affy website that are not present (or I'm not able to find) when using the annotation package on BioC.

As an example, the transcript ID TC01002905.hg.1 corresponds to the microRNA 137 according to the HTA-2_0.na36.hg19.transcript.csv annotation file, but I am not able to find it with lookUp function if I use the hta20sttranscriptcluster.db package from BioC.

Could you tell me any way to circumvent this issue?

Thanks  a lot