Hello,
I am currently working with data from 16 HTA 2.0 microarrays which I have normalized using RMA using the following commands in R:
# create and verify a list of the celfiles before processing celFiles<-list.celfiles()
# read in celfiles and verify rawData<-read.celfiles(celFiles)
# for genes, can only pull core probeset summaries, for exons, can pull core, full, or extended eset<-rma(rawData, target='core')
# connect annotation file to data set con = db(pd.hta.2.0)
#list the types of probesets in the dataset dbGetQuery(con, "select * from type_dict;")
When I proceed to filtering, I am running into difficulty. I am doing the following:
antig <- dbGetQuery(con, "select core_mps.meta_fsetid from core_mps inner join featureSet on core_mps.fsetid=featureSet.man_fsetid where featureSet.type = '2';")
I do get a list in antig. I would expect the lines to be in the general format of aa000000000.hg.1
Instead I get a list like this:
dbGetQuery(con, "select core_mps.meta_fsetid from core_mps inner join featureSet on core_mps.fsetid=featureSet.man_fsetid where featureSet.type = '2';") meta_fsetid 1 18677993 2 18677994 3 18677995 4 18677996 5 18677997 6 18677998 7 18677999 8 18678000 9 18678001 10 18678002 11 18678003 12 18678004 13 18678005 14 18678006 15 18678007 16 18678008 17 18678009 18 18678010 19 18678011 20 18678012 21 18678013 22 18678014 23 18678015
Strangely, when I look at all the columns in core_mps, I get:
> dbGetQuery(con, "select * from core_mps inner join featureSet on core_mps.fsetid=featureSet.man_fsetid where featureSet.type = '2';") meta_fsetid transcript_cluster_id fsetid fsetid man_fsetid strand start stop transcript_cluster_id exon_id crosshyb_type level 1 18677993 AFFX-BkGr-GC03_at 18677993 5054 18677993 NA NA NA NA NA NA NA 2 18677994 AFFX-BkGr-GC04_at 18677994 5055 18677994 NA NA NA NA NA NA NA 3 18677995 AFFX-BkGr-GC05_at 18677995 5056 18677995 NA NA NA NA NA NA NA 4 18677996 AFFX-BkGr-GC06_at 18677996 5057 18677996 NA NA NA NA NA NA NA 5 18677997 AFFX-BkGr-GC07_at 18677997 5058 18677997 NA NA NA NA NA NA NA 6 18677998 AFFX-BkGr-GC08_at 18677998 5059 18677998 NA NA NA NA NA NA NA 7 18677999 AFFX-BkGr-GC09_at 18677999 5060 18677999 NA NA NA NA NA NA NA 8 18678000 AFFX-BkGr-GC10_at 18678000 5061 18678000 NA NA NA NA NA NA NA 9 18678001 AFFX-BkGr-GC11_at 18678001 5062 18678001 NA NA NA NA NA NA NA 10 18678002 AFFX-BkGr-GC12_at 18678002 5063 18678002 NA NA NA NA NA NA NA 11 18678003 AFFX-BkGr-GC13_at 18678003 5064 18678003 NA NA NA NA NA NA NA 12 18678004 AFFX-BkGr-GC14_at 18678004 5065 18678004 NA NA NA NA NA NA NA 13 18678005 AFFX-BkGr-GC15_at 18678005 5066 18678005 NA NA NA NA NA NA NA 14 18678006 AFFX-BkGr-GC16_at 18678006 5067 18678006 NA NA NA NA NA NA NA 15 18678007 AFFX-BkGr-GC17_at 18678007 5068 18678007 NA NA NA NA NA NA NA 16 18678008 AFFX-BkGr-GC18_at 18678008 5069 18678008 NA NA NA NA NA NA NA 17 18678009 AFFX-BkGr-GC19_at 18678009 5070 18678009 NA NA NA NA NA NA NA 18 18678010 AFFX-BkGr-GC20_at 18678010 5071 18678010 NA NA NA NA NA NA NA 19 18678011 AFFX-BkGr-GC21_at 18678011 5072 18678011 NA NA NA NA NA NA NA 20 18678012 AFFX-BkGr-GC22_at 18678012 5073 18678012 NA NA NA NA NA NA NA 21 18678013 AFFX-BkGr-GC23_at 18678013 5074 18678013 NA NA NA NA NA NA NA 22 18678014 AFFX-BkGr-GC24_at 18678014 5075 18678014 NA NA NA NA NA NA NA 23 18678015 AFFX-BkGr-GC25_at 18678015 5076 18678015 NA NA NA NA NA NA NA junction_start_edge junction_stop_edge junction_sequence has_cds chrom type 1 NA NA <NA> NA NA 2 2 NA NA <NA> NA NA 2 3 NA NA <NA> NA NA 2 4 NA NA <NA> NA NA 2 5 NA NA <NA> NA NA 2 6 NA NA <NA> NA NA 2 7 NA NA <NA> NA NA 2 8 NA NA <NA> NA NA 2 9 NA NA <NA> NA NA 2 10 NA NA <NA> NA NA 2 11 NA NA <NA> NA NA 2 12 NA NA <NA> NA NA 2 13 NA NA <NA> NA NA 2 14 NA NA <NA> NA NA 2 15 NA NA <NA> NA NA 2 16 NA NA <NA> NA NA 2 17 NA NA <NA> NA NA 2 18 NA NA <NA> NA NA 2 19 NA NA <NA> NA NA 2 20 NA NA <NA> NA NA 2 21 NA NA <NA> NA NA 2 22 NA NA <NA> NA NA 2 23 NA NA <NA> NA NA 2 >
But when I just query core_mps by itself, I get a list with the terms in a very different format.
dbGetQuery(con, "select * from core_mps limit 10") meta_fsetid transcript_cluster_id fsetid 1 TC01000001.hg.1 TC01000001.hg.1 19021059 2 TC01000001.hg.1 TC01000001.hg.1 19021060 3 TC01000001.hg.1 TC01000001.hg.1 19021061 4 TC01000001.hg.1 TC01000001.hg.1 19021062 5 TC01000001.hg.1 TC01000001.hg.1 19021063 6 TC01000002.hg.1 TC01000002.hg.1 19021064 7 TC01000002.hg.1 TC01000002.hg.1 19021065 8 TC01000002.hg.1 TC01000002.hg.1 19021066 9 TC01000002.hg.1 TC01000002.hg.1 19021067 10 TC01000002.hg.1 TC01000002.hg.1 19021068
I realize that in the "dbGetQuery(con, "select * from core_mps inner join featureSet on core_mps.fsetid=featureSet.man_fsetid where featureSet.type = '2';")" example I am looking for just the type 2 probesets (the antigenomic ones), but how do I get it to output a list of the meta_fsetid's that are not just 8 digit numbers? For the list to match what is in my RMA file from my microarray CEL files, I need it to output a list of the meta_fsetid's like those listed above, when I did "dbGetQuery(con, "select * from core_mps limit 10")".
My R sessionInfo is:
R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] limma_3.28.17 pd.hta.2.0_3.12.1 RSQLite_1.0.0 DBI_0.5 oligo_1.36.1 Biostrings_2.40.2 [7] XVector_0.12.1 IRanges_2.6.1 S4Vectors_0.10.2 Biobase_2.32.0 oligoClasses_1.34.0 BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] affxparser_1.44.0 GenomicRanges_1.24.2 splines_3.3.1 zlibbioc_1.18.0 [5] bit_1.1-12 foreach_1.4.3 GenomeInfoDb_1.8.3 tools_3.3.1 [9] SummarizedExperiment_1.2.3 ff_2.2-13 iterators_1.0.8 preprocessCore_1.34.0 [13] affyio_1.42.0 codetools_0.2-14 BiocInstaller_1.22.3
Much thanks!
Susan Munster, Research Geneticist
Functional Genomics Group
Civil Aerospace Medical Institute, AAM-612
6500 S. MacArthur Blvd.
Oklahoma City OK 73169
405-954-8631
To comment on a post, please use the ADD COMMENT button and type in the dialog box that pops up. Using the answer box is confusing for future readers, as you aren't actually answering a question.
Anyway, to answer your question, consider the following.
So none of the antigenomic probes actually get summarized into probesets at the transcript summary level. In fact, only main and NA type probesets (which are like, mysterious and stuff! If you search for them on NetAffx, no results are returned...) get summarized. Using
getMainProbes
from my affycoretools package: