pd.hugene.2.0.st missing normgene->exon mappings

0

Entering edit mode

Mark Cowley ▴ 910

@mark-cowley-2951

Last seen 9.6 years ago

Dear Benilton, James & Bioconductors, Thanks for providing the platform design packages for hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. I use SQL to query these packages & ultimately retain only 'main' probes in my analysis. This works well for 1.0 and 1.1 packages, but nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the normgene->exon control probes are misclassified as 'main' probes. evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st package lists 0, and assigns these 1626 probes to the 'main' category: # probe types: librarypd.hugene.2.0.st) conn <- dbpd.hugene.2.0.st) dbGetQuery(conn,"SELECT * from type_dict") type type_id 1 1 main 2 2 control->affx 3 3 control->chip 4 4 control->bgp->antigenomic 5 5 control->bgp->genomic 6 6 normgene->exon 7 7 normgene->intron 8 8 rescue->FLmRNA->unmapped 9 9 control->affx->bac_spike 10 10 oligo_spike_in 11 11 r1_bac_spike_at # probe counts for each of the probe categories: dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type") type count(*) 1 NA 3728 2 1 345497 3 2 18 4 4 23 5 7 3575 6 9 18 NB: no type 6 probes. I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see this issue for all 2.0 and 2.1 arrays (see below) Can these mappings please be updated? PS, there's a bunch of probes with type = NA in the database. I haven't investigated these in any detail. cheers, Mark ----------------------------------------------------- Mark Cowley, PhD Genome Informatics Division & the Centre for Clinical Genomics The Kinghorn Cancer Centre, Garvan Institute of Medical Research, Sydney, Australia ----------------------------------------------------- All 12 packages below: pd.packages <- c( "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", "pd.hugene.2.1.st", "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", "pd.mogene.2.1.st", "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", "pd.ragene.2.1.st" ) a <- b <- list() forpd.pkg.name in pd.packages) { try({ requirepd.pkg.name, character.only=TRUE) || stop("Can't load the pd.package") conn <- dbgetpd.pkg.name)) a[[pd.pkg.name]] <- dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type") b[[pd.pkg.name]] <- dbGetQuery(conn,"SELECT fsetid from featureSet WHERE type = 6")[,1] }) } dbGetQuery(conn,"SELECT * from type_dict") > a $pd.hugene.1.0.st.v1 type count(*) 1 NA 227 2 1 253002 3 2 57 4 4 45 5 6 1195 6 7 2904 $pd.hugene.1.1.st.v1 type count(*) 1 NA 227 2 1 253002 3 2 57 4 4 45 5 6 1195 6 7 2904 $pd.hugene.2.0.st type count(*) 1 NA 3728 2 1 345497 3 2 18 4 4 23 5 7 3575 6 9 18 $pd.hugene.2.1.st type count(*) 1 NA 3728 2 1 345497 3 2 18 4 4 23 5 7 3575 6 9 18 $pd.mogene.1.0.st.v1 type count(*) 1 NA 86 2 1 234878 3 2 21 4 4 45 5 6 1324 6 7 5222 $pd.mogene.1.1.st.v1 type count(*) 1 NA 86 2 1 234878 3 2 21 4 4 45 5 6 1324 6 7 5222 $pd.mogene.2.0.st type count(*) 1 NA 810 2 1 263551 3 2 18 4 4 23 5 7 5331 6 9 18 $pd.mogene.2.1.st type count(*) 1 NA 810 2 1 263551 3 2 18 4 4 23 5 7 5331 6 9 18 $pd.ragene.1.0.st.v1 type count(*) 1 NA 254 2 1 211195 3 2 21 4 4 45 5 6 399 6 7 1153 $pd.ragene.1.1.st.v1 type count(*) 1 NA 254 2 1 211195 3 2 21 4 4 45 5 6 399 6 7 1153 $pd.ragene.2.0.st type count(*) 1 NA 1071 2 1 214018 3 2 18 4 4 23 5 7 5083 6 9 18 $pd.ragene.2.1.st type count(*) 1 NA 1071 2 1 214018 3 2 18 4 4 23 5 7 5083 6 9 18 > sapply(b,length) pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st pd.hugene.2.1.st 1195 1195 0 0 pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st pd.mogene.2.1.st 1324 1324 0 0 pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st pd.ragene.2.1.st 399 399 0 0 > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 [13] oligo_1.24.0 Biobase_2.20.0 [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 [17] RSQLite_0.11.4 DBI_0.2-7 [19] BiocInstaller_1.10.2 loaded via a namespace (and not attached): [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 [[alternative HTML version deleted]]

Cancer probe Cancer probe • 3.1k views

ADD COMMENT • link updated 10.8 years ago by James W. MacDonald 65k • written 10.8 years ago by Mark Cowley ▴ 910

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

Hi Mark, Thanks for the heads-up. We already knew that Affy messed up the transcript and probeset annotation files (and had them fixed), but didn't think I needed to check the others. Famous last words, no? > x <- readPgf("HuGene-2_0-st.pgf") > table(x$probesetType) control->affx control->affx->bac_spike 18 18 control->affx->ercc_spike control->affx->polya_spike 92 39 control->bgp->antigenomic main 23 349012 normgene->intron reporter 3575 82 > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", comment.char = "#", stringsAsFactors=FALSE, header = TRUE) > table(y$category) control->affx control->affx->bac_spike 18 18 control->affx->ercc-spike control->affx->polya_spike 92 39 control->bgp->antigenomic main 23 44629 normgene->exon normgene->intron 1626 3575 reporter rescue 82 3515 I'll ping Affymetrix and see what they have to say. Best, Jim On 7/9/2013 3:29 AM, Mark Cowley wrote: > Dear Benilton, James& Bioconductors, > Thanks for providing the platform design packages for hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. > > I use SQL to query these packages& ultimately retain only 'main' probes in my analysis. This works well for 1.0 and 1.1 packages, but nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the normgene->exon control probes are misclassified as 'main' probes. > > evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st package lists 0, and assigns these 1626 probes to the 'main' category: > > # probe types: > librarypd.hugene.2.0.st) > conn<- dbpd.hugene.2.0.st) > dbGetQuery(conn,"SELECT * from type_dict") > type type_id > 1 1 main > 2 2 control->affx > 3 3 control->chip > 4 4 control->bgp->antigenomic > 5 5 control->bgp->genomic > 6 6 normgene->exon > 7 7 normgene->intron > 8 8 rescue->FLmRNA->unmapped > 9 9 control->affx->bac_spike > 10 10 oligo_spike_in > 11 11 r1_bac_spike_at > > # probe counts for each of the probe categories: > dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type") > type count(*) > 1 NA 3728 > 2 1 345497 > 3 2 18 > 4 4 23 > 5 7 3575 > 6 9 18 > > NB: no type 6 probes. > I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see this issue for all 2.0 and 2.1 arrays (see below) > > Can these mappings please be updated? > > PS, there's a bunch of probes with type = NA in the database. I haven't investigated these in any detail. > > cheers, > Mark > ----------------------------------------------------- > Mark Cowley, PhD > > Genome Informatics Division& the Centre for Clinical Genomics > The Kinghorn Cancer Centre, Garvan Institute of Medical Research, Sydney, Australia > ----------------------------------------------------- > > All 12 packages below: > pd.packages<- c( > "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", "pd.hugene.2.1.st", > "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", "pd.mogene.2.1.st", > "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", "pd.ragene.2.1.st" > ) > > a<- b<- list() > forpd.pkg.name in pd.packages) { > try({ > requirepd.pkg.name, character.only=TRUE) || stop("Can't load the pd.package") > conn<- db(getpd.pkg.name)) > a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type") > b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from featureSet WHERE type = 6")[,1] > }) > } > dbGetQuery(conn,"SELECT * from type_dict") > >> a > $pd.hugene.1.0.st.v1 > type count(*) > 1 NA 227 > 2 1 253002 > 3 2 57 > 4 4 45 > 5 6 1195 > 6 7 2904 > > $pd.hugene.1.1.st.v1 > type count(*) > 1 NA 227 > 2 1 253002 > 3 2 57 > 4 4 45 > 5 6 1195 > 6 7 2904 > > $pd.hugene.2.0.st > type count(*) > 1 NA 3728 > 2 1 345497 > 3 2 18 > 4 4 23 > 5 7 3575 > 6 9 18 > > $pd.hugene.2.1.st > type count(*) > 1 NA 3728 > 2 1 345497 > 3 2 18 > 4 4 23 > 5 7 3575 > 6 9 18 > > $pd.mogene.1.0.st.v1 > type count(*) > 1 NA 86 > 2 1 234878 > 3 2 21 > 4 4 45 > 5 6 1324 > 6 7 5222 > > $pd.mogene.1.1.st.v1 > type count(*) > 1 NA 86 > 2 1 234878 > 3 2 21 > 4 4 45 > 5 6 1324 > 6 7 5222 > > $pd.mogene.2.0.st > type count(*) > 1 NA 810 > 2 1 263551 > 3 2 18 > 4 4 23 > 5 7 5331 > 6 9 18 > > $pd.mogene.2.1.st > type count(*) > 1 NA 810 > 2 1 263551 > 3 2 18 > 4 4 23 > 5 7 5331 > 6 9 18 > > $pd.ragene.1.0.st.v1 > type count(*) > 1 NA 254 > 2 1 211195 > 3 2 21 > 4 4 45 > 5 6 399 > 6 7 1153 > > $pd.ragene.1.1.st.v1 > type count(*) > 1 NA 254 > 2 1 211195 > 3 2 21 > 4 4 45 > 5 6 399 > 6 7 1153 > > $pd.ragene.2.0.st > type count(*) > 1 NA 1071 > 2 1 214018 > 3 2 18 > 4 4 23 > 5 7 5083 > 6 9 18 > > $pd.ragene.2.1.st > type count(*) > 1 NA 1071 > 2 1 214018 > 3 2 18 > 4 4 23 > 5 7 5083 > 6 9 18 > >> sapply(b,length) > pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st pd.hugene.2.1.st > 1195 1195 0 0 > pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st pd.mogene.2.1.st > 1324 1324 0 0 > pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st pd.ragene.2.1.st > 399 399 0 0 > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 > [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 > [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 > [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 > [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 > [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 > [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 > [13] oligo_1.24.0 Biobase_2.20.0 > [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 > [17] RSQLite_0.11.4 DBI_0.2-7 > [19] BiocInstaller_1.10.2 > > loaded via a namespace (and not attached): > [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 > [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 > [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 > [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 > [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 > > > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.8 years ago James W. MacDonald 65k

0

Entering edit mode

Dear Jim, I did not really follow the discussion so I may be wrong, but if you mean that there is a difference between the number of 'main' types, please note that number of 'main' for pgf, i.e 349012, corresponds to the number of 'main' in the probeset annotation file and not in the transcript annotation file. But as I said, I may have misunderstood the problem. I am mainly replying because at the beginning of this year I had long discussions with DevNet to make sure that the annotation files for the 2.X arrays are correct, and in version na33.2 DevNet did correct everything what I have found. Best regards, Christian On 7/9/13 7:13 PM, James W. MacDonald wrote: > Hi Mark, > > Thanks for the heads-up. We already knew that Affy messed up the > transcript and probeset annotation files (and had them fixed), but > didn't think I needed to check the others. Famous last words, no? > > > x <- readPgf("HuGene-2_0-st.pgf") > > table(x$probesetType) > > control->affx control->affx->bac_spike > 18 18 > control->affx->ercc_spike control->affx->polya_spike > 92 39 > control->bgp->antigenomic main > 23 349012 > normgene->intron reporter > 3575 82 > > > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", > comment.char = "#", stringsAsFactors=FALSE, header = TRUE) > > table(y$category) > > control->affx control->affx->bac_spike > 18 18 > control->affx->ercc-spike control->affx->polya_spike > 92 39 > control->bgp->antigenomic main > 23 44629 > normgene->exon normgene->intron > 1626 3575 > reporter rescue > 82 3515 > > I'll ping Affymetrix and see what they have to say. > > Best, > > Jim > > > > On 7/9/2013 3:29 AM, Mark Cowley wrote: >> Dear Benilton, James& Bioconductors, >> Thanks for providing the platform design packages for >> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >> >> I use SQL to query these packages& ultimately retain only 'main' >> probes in my analysis. This works well for 1.0 and 1.1 packages, but >> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >> normgene->exon control probes are misclassified as 'main' probes. >> >> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >> package lists 0, and assigns these 1626 probes to the 'main' category: >> >> # probe types: >> librarypd.hugene.2.0.st) >> conn<- dbpd.hugene.2.0.st) >> dbGetQuery(conn,"SELECT * from type_dict") >> type type_id >> 1 1 main >> 2 2 control->affx >> 3 3 control->chip >> 4 4 control->bgp->antigenomic >> 5 5 control->bgp->genomic >> 6 6 normgene->exon >> 7 7 normgene->intron >> 8 8 rescue->FLmRNA->unmapped >> 9 9 control->affx->bac_spike >> 10 10 oligo_spike_in >> 11 11 r1_bac_spike_at >> >> # probe counts for each of the probe categories: >> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type") >> type count(*) >> 1 NA 3728 >> 2 1 345497 >> 3 2 18 >> 4 4 23 >> 5 7 3575 >> 6 9 18 >> >> NB: no type 6 probes. >> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >> this issue for all 2.0 and 2.1 arrays (see below) >> >> Can these mappings please be updated? >> >> PS, there's a bunch of probes with type = NA in the database. I >> haven't investigated these in any detail. >> >> cheers, >> Mark >> ----------------------------------------------------- >> Mark Cowley, PhD >> >> Genome Informatics Division& the Centre for Clinical Genomics >> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >> Sydney, Australia >> ----------------------------------------------------- >> >> All 12 packages below: >> pd.packages<- c( >> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >> "pd.hugene.2.1.st", >> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >> "pd.mogene.2.1.st", >> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >> "pd.ragene.2.1.st" >> ) >> >> a<- b<- list() >> forpd.pkg.name in pd.packages) { >> try({ >> requirepd.pkg.name, character.only=TRUE) || stop("Can't load the >> pd.package") >> conn<- db(getpd.pkg.name)) >> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >> featureSet GROUP BY type") >> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from featureSet >> WHERE type = 6")[,1] >> }) >> } >> dbGetQuery(conn,"SELECT * from type_dict") >> >>> a >> $pd.hugene.1.0.st.v1 >> type count(*) >> 1 NA 227 >> 2 1 253002 >> 3 2 57 >> 4 4 45 >> 5 6 1195 >> 6 7 2904 >> >> $pd.hugene.1.1.st.v1 >> type count(*) >> 1 NA 227 >> 2 1 253002 >> 3 2 57 >> 4 4 45 >> 5 6 1195 >> 6 7 2904 >> >> $pd.hugene.2.0.st >> type count(*) >> 1 NA 3728 >> 2 1 345497 >> 3 2 18 >> 4 4 23 >> 5 7 3575 >> 6 9 18 >> >> $pd.hugene.2.1.st >> type count(*) >> 1 NA 3728 >> 2 1 345497 >> 3 2 18 >> 4 4 23 >> 5 7 3575 >> 6 9 18 >> >> $pd.mogene.1.0.st.v1 >> type count(*) >> 1 NA 86 >> 2 1 234878 >> 3 2 21 >> 4 4 45 >> 5 6 1324 >> 6 7 5222 >> >> $pd.mogene.1.1.st.v1 >> type count(*) >> 1 NA 86 >> 2 1 234878 >> 3 2 21 >> 4 4 45 >> 5 6 1324 >> 6 7 5222 >> >> $pd.mogene.2.0.st >> type count(*) >> 1 NA 810 >> 2 1 263551 >> 3 2 18 >> 4 4 23 >> 5 7 5331 >> 6 9 18 >> >> $pd.mogene.2.1.st >> type count(*) >> 1 NA 810 >> 2 1 263551 >> 3 2 18 >> 4 4 23 >> 5 7 5331 >> 6 9 18 >> >> $pd.ragene.1.0.st.v1 >> type count(*) >> 1 NA 254 >> 2 1 211195 >> 3 2 21 >> 4 4 45 >> 5 6 399 >> 6 7 1153 >> >> $pd.ragene.1.1.st.v1 >> type count(*) >> 1 NA 254 >> 2 1 211195 >> 3 2 21 >> 4 4 45 >> 5 6 399 >> 6 7 1153 >> >> $pd.ragene.2.0.st >> type count(*) >> 1 NA 1071 >> 2 1 214018 >> 3 2 18 >> 4 4 23 >> 5 7 5083 >> 6 9 18 >> >> $pd.ragene.2.1.st >> type count(*) >> 1 NA 1071 >> 2 1 214018 >> 3 2 18 >> 4 4 23 >> 5 7 5083 >> 6 9 18 >> >>> sapply(b,length) >> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >> pd.hugene.2.1.st >> 1195 1195 >> 0 0 >> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >> pd.mogene.2.1.st >> 1324 1324 >> 0 0 >> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >> pd.ragene.2.1.st >> 399 399 >> 0 0 >> >>> sessionInfo() >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >> [13] oligo_1.24.0 Biobase_2.20.0 >> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >> [17] RSQLite_0.11.4 DBI_0.2-7 >> [19] BiocInstaller_1.10.2 >> >> loaded via a namespace (and not attached): >> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.8 years ago cstrato ★ 3.9k

0

Entering edit mode

Hi Christian, That's not the issue. Instead, the issue is that the pgf file lists the normgene->exon probeset IDs as being 'main'. I have received a response from Affy stating that the qcc file lists the normgene->exon probesets as pos_control, but that seems orthogonal to the issue at hand. > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", stringsAsFactors=F, header=T) > pgf <- readPgf("HuGene-2_0-st.pgf") > head(qcc) probeset_id group_name probeset_name quantification_in_header 1 16650001 neg_control 16650001 0 2 16650003 neg_control 16650003 0 ## get the positive controls (normgene->exon probesets) from the qcc file > pos_cont <- qcc[qcc[,2] == "pos_control",1] ## compare to pgf file > x <- pgf$probesetType[pgf$probesetId %in% pos_cont] > table(x) x main 1626 So in the pgf file, these probesets are being called 'main' instead of some sort of control. How do you handle this in xps? Do you use the pgf file? Best, Jim On 7/9/2013 2:06 PM, cstrato wrote: > Dear Jim, > > I did not really follow the discussion so I may be wrong, but if you > mean that there is a difference between the number of 'main' types, > please note that number of 'main' for pgf, i.e 349012, corresponds to > the number of 'main' in the probeset annotation file and not in the > transcript annotation file. > > But as I said, I may have misunderstood the problem. > > I am mainly replying because at the beginning of this year I had long > discussions with DevNet to make sure that the annotation files for the > 2.X arrays are correct, and in version na33.2 DevNet did correct > everything what I have found. > > Best regards, > Christian > > > On 7/9/13 7:13 PM, James W. MacDonald wrote: >> Hi Mark, >> >> Thanks for the heads-up. We already knew that Affy messed up the >> transcript and probeset annotation files (and had them fixed), but >> didn't think I needed to check the others. Famous last words, no? >> >> > x <- readPgf("HuGene-2_0-st.pgf") >> > table(x$probesetType) >> >> control->affx control->affx->bac_spike >> 18 18 >> control->affx->ercc_spike control->affx->polya_spike >> 92 39 >> control->bgp->antigenomic main >> 23 349012 >> normgene->intron reporter >> 3575 82 >> >> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >> > table(y$category) >> >> control->affx control->affx->bac_spike >> 18 18 >> control->affx->ercc-spike control->affx->polya_spike >> 92 39 >> control->bgp->antigenomic main >> 23 44629 >> normgene->exon normgene->intron >> 1626 3575 >> reporter rescue >> 82 3515 >> >> I'll ping Affymetrix and see what they have to say. >> >> Best, >> >> Jim >> >> >> >> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>> Dear Benilton, James& Bioconductors, >>> Thanks for providing the platform design packages for >>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>> >>> I use SQL to query these packages& ultimately retain only 'main' >>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>> normgene->exon control probes are misclassified as 'main' probes. >>> >>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>> package lists 0, and assigns these 1626 probes to the 'main' category: >>> >>> # probe types: >>> librarypd.hugene.2.0.st) >>> conn<- dbpd.hugene.2.0.st) >>> dbGetQuery(conn,"SELECT * from type_dict") >>> type type_id >>> 1 1 main >>> 2 2 control->affx >>> 3 3 control->chip >>> 4 4 control->bgp->antigenomic >>> 5 5 control->bgp->genomic >>> 6 6 normgene->exon >>> 7 7 normgene->intron >>> 8 8 rescue->FLmRNA->unmapped >>> 9 9 control->affx->bac_spike >>> 10 10 oligo_spike_in >>> 11 11 r1_bac_spike_at >>> >>> # probe counts for each of the probe categories: >>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type") >>> type count(*) >>> 1 NA 3728 >>> 2 1 345497 >>> 3 2 18 >>> 4 4 23 >>> 5 7 3575 >>> 6 9 18 >>> >>> NB: no type 6 probes. >>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>> this issue for all 2.0 and 2.1 arrays (see below) >>> >>> Can these mappings please be updated? >>> >>> PS, there's a bunch of probes with type = NA in the database. I >>> haven't investigated these in any detail. >>> >>> cheers, >>> Mark >>> ----------------------------------------------------- >>> Mark Cowley, PhD >>> >>> Genome Informatics Division& the Centre for Clinical Genomics >>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>> Sydney, Australia >>> ----------------------------------------------------- >>> >>> All 12 packages below: >>> pd.packages<- c( >>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>> "pd.hugene.2.1.st", >>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>> "pd.mogene.2.1.st", >>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>> "pd.ragene.2.1.st" >>> ) >>> >>> a<- b<- list() >>> forpd.pkg.name in pd.packages) { >>> try({ >>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load the >>> pd.package") >>> conn<- db(getpd.pkg.name)) >>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>> featureSet GROUP BY type") >>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from featureSet >>> WHERE type = 6")[,1] >>> }) >>> } >>> dbGetQuery(conn,"SELECT * from type_dict") >>> >>>> a >>> $pd.hugene.1.0.st.v1 >>> type count(*) >>> 1 NA 227 >>> 2 1 253002 >>> 3 2 57 >>> 4 4 45 >>> 5 6 1195 >>> 6 7 2904 >>> >>> $pd.hugene.1.1.st.v1 >>> type count(*) >>> 1 NA 227 >>> 2 1 253002 >>> 3 2 57 >>> 4 4 45 >>> 5 6 1195 >>> 6 7 2904 >>> >>> $pd.hugene.2.0.st >>> type count(*) >>> 1 NA 3728 >>> 2 1 345497 >>> 3 2 18 >>> 4 4 23 >>> 5 7 3575 >>> 6 9 18 >>> >>> $pd.hugene.2.1.st >>> type count(*) >>> 1 NA 3728 >>> 2 1 345497 >>> 3 2 18 >>> 4 4 23 >>> 5 7 3575 >>> 6 9 18 >>> >>> $pd.mogene.1.0.st.v1 >>> type count(*) >>> 1 NA 86 >>> 2 1 234878 >>> 3 2 21 >>> 4 4 45 >>> 5 6 1324 >>> 6 7 5222 >>> >>> $pd.mogene.1.1.st.v1 >>> type count(*) >>> 1 NA 86 >>> 2 1 234878 >>> 3 2 21 >>> 4 4 45 >>> 5 6 1324 >>> 6 7 5222 >>> >>> $pd.mogene.2.0.st >>> type count(*) >>> 1 NA 810 >>> 2 1 263551 >>> 3 2 18 >>> 4 4 23 >>> 5 7 5331 >>> 6 9 18 >>> >>> $pd.mogene.2.1.st >>> type count(*) >>> 1 NA 810 >>> 2 1 263551 >>> 3 2 18 >>> 4 4 23 >>> 5 7 5331 >>> 6 9 18 >>> >>> $pd.ragene.1.0.st.v1 >>> type count(*) >>> 1 NA 254 >>> 2 1 211195 >>> 3 2 21 >>> 4 4 45 >>> 5 6 399 >>> 6 7 1153 >>> >>> $pd.ragene.1.1.st.v1 >>> type count(*) >>> 1 NA 254 >>> 2 1 211195 >>> 3 2 21 >>> 4 4 45 >>> 5 6 399 >>> 6 7 1153 >>> >>> $pd.ragene.2.0.st >>> type count(*) >>> 1 NA 1071 >>> 2 1 214018 >>> 3 2 18 >>> 4 4 23 >>> 5 7 5083 >>> 6 9 18 >>> >>> $pd.ragene.2.1.st >>> type count(*) >>> 1 NA 1071 >>> 2 1 214018 >>> 3 2 18 >>> 4 4 23 >>> 5 7 5083 >>> 6 9 18 >>> >>>> sapply(b,length) >>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>> pd.hugene.2.1.st >>> 1195 1195 >>> 0 0 >>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>> pd.mogene.2.1.st >>> 1324 1324 >>> 0 0 >>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>> pd.ragene.2.1.st >>> 399 399 >>> 0 0 >>> >>>> sessionInfo() >>> R version 3.0.0 (2013-04-03) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>> [13] oligo_1.24.0 Biobase_2.20.0 >>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>> [17] RSQLite_0.11.4 DBI_0.2-7 >>> [19] BiocInstaller_1.10.2 >>> >>> loaded via a namespace (and not attached): >>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>> >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.8 years ago James W. MacDonald 65k

0

Entering edit mode

Dear Jim, In xps I use as basic file for exon arrays the probeset annotation file and then compare the data to the data from the pgf-file. Any differences will be reported. I have just checked the different files for HuGene-2_0-st. If you check as an example the following probeset_ids: 16934607 16934608 16934609 16934610 Then you will see that the transcript annotation file lists these ids as 'normgene->exon' and 'pos_control'. However, the probeset annotation file lists these ids as 'main' belonging to gene EIF3D with transcript_cluster_id 16934583. Looking for this id in the transcript annotation file reveals that the number of 'total_probes' is 24. Indeed, the probeset annotation file lists 24 probesets including the four above mentioned probeset_ids. This means that although these four probesets are listed in the transcript annotation file as 'normgene->exon' the label 'main' in the pgf-file is correct since these probesets are part of the gene EIF3D. Interestingly, the pgf-file for HuGene-1_0-st has extra probesets listed as 'normgene->exon'. However, in this case these probesets are also listed as 'normgene->exon' in the probeset annotation file, i.e. these probesets do not belong to any transcript listed in the probeset annotation file. Best regards, Christian On 7/9/13 8:46 PM, James W. MacDonald wrote: > Hi Christian, > > That's not the issue. Instead, the issue is that the pgf file lists the > normgene->exon probeset IDs as being 'main'. I have received a response > from Affy stating that the qcc file lists the normgene->exon probesets > as pos_control, but that seems orthogonal to the issue at hand. > > > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", > stringsAsFactors=F, header=T) > > pgf <- readPgf("HuGene-2_0-st.pgf") > > head(qcc) > probeset_id group_name probeset_name quantification_in_header > 1 16650001 neg_control 16650001 0 > 2 16650003 neg_control 16650003 0 > > ## get the positive controls (normgene->exon probesets) from the qcc file > > pos_cont <- qcc[qcc[,2] == "pos_control",1] > > ## compare to pgf file > > x <- pgf$probesetType[pgf$probesetId %in% pos_cont] > > table(x) > x > main > 1626 > > So in the pgf file, these probesets are being called 'main' instead of > some sort of control. How do you handle this in xps? Do you use the pgf > file? > > Best, > > Jim > > > > > On 7/9/2013 2:06 PM, cstrato wrote: >> Dear Jim, >> >> I did not really follow the discussion so I may be wrong, but if you >> mean that there is a difference between the number of 'main' types, >> please note that number of 'main' for pgf, i.e 349012, corresponds to >> the number of 'main' in the probeset annotation file and not in the >> transcript annotation file. >> >> But as I said, I may have misunderstood the problem. >> >> I am mainly replying because at the beginning of this year I had long >> discussions with DevNet to make sure that the annotation files for the >> 2.X arrays are correct, and in version na33.2 DevNet did correct >> everything what I have found. >> >> Best regards, >> Christian >> >> >> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>> Hi Mark, >>> >>> Thanks for the heads-up. We already knew that Affy messed up the >>> transcript and probeset annotation files (and had them fixed), but >>> didn't think I needed to check the others. Famous last words, no? >>> >>> > x <- readPgf("HuGene-2_0-st.pgf") >>> > table(x$probesetType) >>> >>> control->affx control->affx->bac_spike >>> 18 18 >>> control->affx->ercc_spike control->affx->polya_spike >>> 92 39 >>> control->bgp->antigenomic main >>> 23 349012 >>> normgene->intron reporter >>> 3575 82 >>> >>> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>> > table(y$category) >>> >>> control->affx control->affx->bac_spike >>> 18 18 >>> control->affx->ercc-spike control->affx->polya_spike >>> 92 39 >>> control->bgp->antigenomic main >>> 23 44629 >>> normgene->exon normgene->intron >>> 1626 3575 >>> reporter rescue >>> 82 3515 >>> >>> I'll ping Affymetrix and see what they have to say. >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>> Dear Benilton, James& Bioconductors, >>>> Thanks for providing the platform design packages for >>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>> >>>> I use SQL to query these packages& ultimately retain only 'main' >>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>> normgene->exon control probes are misclassified as 'main' probes. >>>> >>>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>> package lists 0, and assigns these 1626 probes to the 'main' category: >>>> >>>> # probe types: >>>> librarypd.hugene.2.0.st) >>>> conn<- dbpd.hugene.2.0.st) >>>> dbGetQuery(conn,"SELECT * from type_dict") >>>> type type_id >>>> 1 1 main >>>> 2 2 control->affx >>>> 3 3 control->chip >>>> 4 4 control->bgp->antigenomic >>>> 5 5 control->bgp->genomic >>>> 6 6 normgene->exon >>>> 7 7 normgene->intron >>>> 8 8 rescue->FLmRNA->unmapped >>>> 9 9 control->affx->bac_spike >>>> 10 10 oligo_spike_in >>>> 11 11 r1_bac_spike_at >>>> >>>> # probe counts for each of the probe categories: >>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type") >>>> type count(*) >>>> 1 NA 3728 >>>> 2 1 345497 >>>> 3 2 18 >>>> 4 4 23 >>>> 5 7 3575 >>>> 6 9 18 >>>> >>>> NB: no type 6 probes. >>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>> this issue for all 2.0 and 2.1 arrays (see below) >>>> >>>> Can these mappings please be updated? >>>> >>>> PS, there's a bunch of probes with type = NA in the database. I >>>> haven't investigated these in any detail. >>>> >>>> cheers, >>>> Mark >>>> ----------------------------------------------------- >>>> Mark Cowley, PhD >>>> >>>> Genome Informatics Division& the Centre for Clinical Genomics >>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>> Sydney, Australia >>>> ----------------------------------------------------- >>>> >>>> All 12 packages below: >>>> pd.packages<- c( >>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>> "pd.hugene.2.1.st", >>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>> "pd.mogene.2.1.st", >>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>> "pd.ragene.2.1.st" >>>> ) >>>> >>>> a<- b<- list() >>>> forpd.pkg.name in pd.packages) { >>>> try({ >>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load the >>>> pd.package") >>>> conn<- db(getpd.pkg.name)) >>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>> featureSet GROUP BY type") >>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from featureSet >>>> WHERE type = 6")[,1] >>>> }) >>>> } >>>> dbGetQuery(conn,"SELECT * from type_dict") >>>> >>>>> a >>>> $pd.hugene.1.0.st.v1 >>>> type count(*) >>>> 1 NA 227 >>>> 2 1 253002 >>>> 3 2 57 >>>> 4 4 45 >>>> 5 6 1195 >>>> 6 7 2904 >>>> >>>> $pd.hugene.1.1.st.v1 >>>> type count(*) >>>> 1 NA 227 >>>> 2 1 253002 >>>> 3 2 57 >>>> 4 4 45 >>>> 5 6 1195 >>>> 6 7 2904 >>>> >>>> $pd.hugene.2.0.st >>>> type count(*) >>>> 1 NA 3728 >>>> 2 1 345497 >>>> 3 2 18 >>>> 4 4 23 >>>> 5 7 3575 >>>> 6 9 18 >>>> >>>> $pd.hugene.2.1.st >>>> type count(*) >>>> 1 NA 3728 >>>> 2 1 345497 >>>> 3 2 18 >>>> 4 4 23 >>>> 5 7 3575 >>>> 6 9 18 >>>> >>>> $pd.mogene.1.0.st.v1 >>>> type count(*) >>>> 1 NA 86 >>>> 2 1 234878 >>>> 3 2 21 >>>> 4 4 45 >>>> 5 6 1324 >>>> 6 7 5222 >>>> >>>> $pd.mogene.1.1.st.v1 >>>> type count(*) >>>> 1 NA 86 >>>> 2 1 234878 >>>> 3 2 21 >>>> 4 4 45 >>>> 5 6 1324 >>>> 6 7 5222 >>>> >>>> $pd.mogene.2.0.st >>>> type count(*) >>>> 1 NA 810 >>>> 2 1 263551 >>>> 3 2 18 >>>> 4 4 23 >>>> 5 7 5331 >>>> 6 9 18 >>>> >>>> $pd.mogene.2.1.st >>>> type count(*) >>>> 1 NA 810 >>>> 2 1 263551 >>>> 3 2 18 >>>> 4 4 23 >>>> 5 7 5331 >>>> 6 9 18 >>>> >>>> $pd.ragene.1.0.st.v1 >>>> type count(*) >>>> 1 NA 254 >>>> 2 1 211195 >>>> 3 2 21 >>>> 4 4 45 >>>> 5 6 399 >>>> 6 7 1153 >>>> >>>> $pd.ragene.1.1.st.v1 >>>> type count(*) >>>> 1 NA 254 >>>> 2 1 211195 >>>> 3 2 21 >>>> 4 4 45 >>>> 5 6 399 >>>> 6 7 1153 >>>> >>>> $pd.ragene.2.0.st >>>> type count(*) >>>> 1 NA 1071 >>>> 2 1 214018 >>>> 3 2 18 >>>> 4 4 23 >>>> 5 7 5083 >>>> 6 9 18 >>>> >>>> $pd.ragene.2.1.st >>>> type count(*) >>>> 1 NA 1071 >>>> 2 1 214018 >>>> 3 2 18 >>>> 4 4 23 >>>> 5 7 5083 >>>> 6 9 18 >>>> >>>>> sapply(b,length) >>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>> pd.hugene.2.1.st >>>> 1195 1195 >>>> 0 0 >>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>> pd.mogene.2.1.st >>>> 1324 1324 >>>> 0 0 >>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>> pd.ragene.2.1.st >>>> 399 399 >>>> 0 0 >>>> >>>>> sessionInfo() >>>> R version 3.0.0 (2013-04-03) >>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>> >>>> locale: >>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>> [7] LC_PAPER=C LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] parallel stats graphics grDevices utils datasets methods >>>> [8] base >>>> >>>> other attached packages: >>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>> [19] BiocInstaller_1.10.2 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>> >>>> >>>> >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >

ADD REPLY • link 10.8 years ago cstrato ★ 3.9k

0

Entering edit mode

Hi Christian, Thanks for pointing that out. There is still a bit of an inconsistency with the pd packages that should probably be corrected, as at the probeset level e.g., 16934607 is intended to measure an exon of EIF3D, whereas at the transcript level, this same probeset is intended to be a positive control (and as you note below, these probes are incorporated into a larger probeset intended to measure the EIF3D transcript). It would be nice to be able to filter out the controls like Mark attempted (and I do regularly as well). Mark - I talked with Benilton Carvalho about this, and he will take a look next week. Best, Jim On 7/9/2013 3:38 PM, cstrato wrote: > Dear Jim, > > In xps I use as basic file for exon arrays the probeset annotation > file and then compare the data to the data from the pgf-file. Any > differences will be reported. > > I have just checked the different files for HuGene-2_0-st. If you > check as an example the following probeset_ids: > 16934607 > 16934608 > 16934609 > 16934610 > > Then you will see that the transcript annotation file lists these ids > as 'normgene->exon' and 'pos_control'. However, the probeset > annotation file lists these ids as 'main' belonging to gene EIF3D with > transcript_cluster_id 16934583. Looking for this id in the transcript > annotation file reveals that the number of 'total_probes' is 24. > Indeed, the probeset annotation file lists 24 probesets including the > four above mentioned probeset_ids. > > This means that although these four probesets are listed in the > transcript annotation file as 'normgene->exon' the label 'main' in the > pgf-file is correct since these probesets are part of the gene EIF3D. > > Interestingly, the pgf-file for HuGene-1_0-st has extra probesets > listed as 'normgene->exon'. However, in this case these probesets are > also listed as 'normgene->exon' in the probeset annotation file, i.e. > these probesets do not belong to any transcript listed in the probeset > annotation file. > > Best regards, > Christian > > > On 7/9/13 8:46 PM, James W. MacDonald wrote: >> Hi Christian, >> >> That's not the issue. Instead, the issue is that the pgf file lists the >> normgene->exon probeset IDs as being 'main'. I have received a response >> from Affy stating that the qcc file lists the normgene->exon probesets >> as pos_control, but that seems orthogonal to the issue at hand. >> >> > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", >> stringsAsFactors=F, header=T) >> > pgf <- readPgf("HuGene-2_0-st.pgf") >> > head(qcc) >> probeset_id group_name probeset_name quantification_in_header >> 1 16650001 neg_control 16650001 0 >> 2 16650003 neg_control 16650003 0 >> >> ## get the positive controls (normgene->exon probesets) from the qcc >> file >> > pos_cont <- qcc[qcc[,2] == "pos_control",1] >> >> ## compare to pgf file >> > x <- pgf$probesetType[pgf$probesetId %in% pos_cont] >> > table(x) >> x >> main >> 1626 >> >> So in the pgf file, these probesets are being called 'main' instead of >> some sort of control. How do you handle this in xps? Do you use the pgf >> file? >> >> Best, >> >> Jim >> >> >> >> >> On 7/9/2013 2:06 PM, cstrato wrote: >>> Dear Jim, >>> >>> I did not really follow the discussion so I may be wrong, but if you >>> mean that there is a difference between the number of 'main' types, >>> please note that number of 'main' for pgf, i.e 349012, corresponds to >>> the number of 'main' in the probeset annotation file and not in the >>> transcript annotation file. >>> >>> But as I said, I may have misunderstood the problem. >>> >>> I am mainly replying because at the beginning of this year I had long >>> discussions with DevNet to make sure that the annotation files for the >>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>> everything what I have found. >>> >>> Best regards, >>> Christian >>> >>> >>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>> Hi Mark, >>>> >>>> Thanks for the heads-up. We already knew that Affy messed up the >>>> transcript and probeset annotation files (and had them fixed), but >>>> didn't think I needed to check the others. Famous last words, no? >>>> >>>> > x <- readPgf("HuGene-2_0-st.pgf") >>>> > table(x$probesetType) >>>> >>>> control->affx control->affx->bac_spike >>>> 18 18 >>>> control->affx->ercc_spike control->affx->polya_spike >>>> 92 39 >>>> control->bgp->antigenomic main >>>> 23 349012 >>>> normgene->intron reporter >>>> 3575 82 >>>> >>>> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>> > table(y$category) >>>> >>>> control->affx control->affx->bac_spike >>>> 18 18 >>>> control->affx->ercc-spike control->affx->polya_spike >>>> 92 39 >>>> control->bgp->antigenomic main >>>> 23 44629 >>>> normgene->exon normgene->intron >>>> 1626 3575 >>>> reporter rescue >>>> 82 3515 >>>> >>>> I'll ping Affymetrix and see what they have to say. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>> Dear Benilton, James& Bioconductors, >>>>> Thanks for providing the platform design packages for >>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>> >>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>> >>>>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>> category: >>>>> >>>>> # probe types: >>>>> librarypd.hugene.2.0.st) >>>>> conn<- dbpd.hugene.2.0.st) >>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>> type type_id >>>>> 1 1 main >>>>> 2 2 control->affx >>>>> 3 3 control->chip >>>>> 4 4 control->bgp->antigenomic >>>>> 5 5 control->bgp->genomic >>>>> 6 6 normgene->exon >>>>> 7 7 normgene->intron >>>>> 8 8 rescue->FLmRNA->unmapped >>>>> 9 9 control->affx->bac_spike >>>>> 10 10 oligo_spike_in >>>>> 11 11 r1_bac_spike_at >>>>> >>>>> # probe counts for each of the probe categories: >>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>> type") >>>>> type count(*) >>>>> 1 NA 3728 >>>>> 2 1 345497 >>>>> 3 2 18 >>>>> 4 4 23 >>>>> 5 7 3575 >>>>> 6 9 18 >>>>> >>>>> NB: no type 6 probes. >>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>> >>>>> Can these mappings please be updated? >>>>> >>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>> haven't investigated these in any detail. >>>>> >>>>> cheers, >>>>> Mark >>>>> ----------------------------------------------------- >>>>> Mark Cowley, PhD >>>>> >>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>> Sydney, Australia >>>>> ----------------------------------------------------- >>>>> >>>>> All 12 packages below: >>>>> pd.packages<- c( >>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>>> "pd.hugene.2.1.st", >>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>>> "pd.mogene.2.1.st", >>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>>> "pd.ragene.2.1.st" >>>>> ) >>>>> >>>>> a<- b<- list() >>>>> forpd.pkg.name in pd.packages) { >>>>> try({ >>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load >>>>> the >>>>> pd.package") >>>>> conn<- db(getpd.pkg.name)) >>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>>> featureSet GROUP BY type") >>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>> featureSet >>>>> WHERE type = 6")[,1] >>>>> }) >>>>> } >>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>> >>>>>> a >>>>> $pd.hugene.1.0.st.v1 >>>>> type count(*) >>>>> 1 NA 227 >>>>> 2 1 253002 >>>>> 3 2 57 >>>>> 4 4 45 >>>>> 5 6 1195 >>>>> 6 7 2904 >>>>> >>>>> $pd.hugene.1.1.st.v1 >>>>> type count(*) >>>>> 1 NA 227 >>>>> 2 1 253002 >>>>> 3 2 57 >>>>> 4 4 45 >>>>> 5 6 1195 >>>>> 6 7 2904 >>>>> >>>>> $pd.hugene.2.0.st >>>>> type count(*) >>>>> 1 NA 3728 >>>>> 2 1 345497 >>>>> 3 2 18 >>>>> 4 4 23 >>>>> 5 7 3575 >>>>> 6 9 18 >>>>> >>>>> $pd.hugene.2.1.st >>>>> type count(*) >>>>> 1 NA 3728 >>>>> 2 1 345497 >>>>> 3 2 18 >>>>> 4 4 23 >>>>> 5 7 3575 >>>>> 6 9 18 >>>>> >>>>> $pd.mogene.1.0.st.v1 >>>>> type count(*) >>>>> 1 NA 86 >>>>> 2 1 234878 >>>>> 3 2 21 >>>>> 4 4 45 >>>>> 5 6 1324 >>>>> 6 7 5222 >>>>> >>>>> $pd.mogene.1.1.st.v1 >>>>> type count(*) >>>>> 1 NA 86 >>>>> 2 1 234878 >>>>> 3 2 21 >>>>> 4 4 45 >>>>> 5 6 1324 >>>>> 6 7 5222 >>>>> >>>>> $pd.mogene.2.0.st >>>>> type count(*) >>>>> 1 NA 810 >>>>> 2 1 263551 >>>>> 3 2 18 >>>>> 4 4 23 >>>>> 5 7 5331 >>>>> 6 9 18 >>>>> >>>>> $pd.mogene.2.1.st >>>>> type count(*) >>>>> 1 NA 810 >>>>> 2 1 263551 >>>>> 3 2 18 >>>>> 4 4 23 >>>>> 5 7 5331 >>>>> 6 9 18 >>>>> >>>>> $pd.ragene.1.0.st.v1 >>>>> type count(*) >>>>> 1 NA 254 >>>>> 2 1 211195 >>>>> 3 2 21 >>>>> 4 4 45 >>>>> 5 6 399 >>>>> 6 7 1153 >>>>> >>>>> $pd.ragene.1.1.st.v1 >>>>> type count(*) >>>>> 1 NA 254 >>>>> 2 1 211195 >>>>> 3 2 21 >>>>> 4 4 45 >>>>> 5 6 399 >>>>> 6 7 1153 >>>>> >>>>> $pd.ragene.2.0.st >>>>> type count(*) >>>>> 1 NA 1071 >>>>> 2 1 214018 >>>>> 3 2 18 >>>>> 4 4 23 >>>>> 5 7 5083 >>>>> 6 9 18 >>>>> >>>>> $pd.ragene.2.1.st >>>>> type count(*) >>>>> 1 NA 1071 >>>>> 2 1 214018 >>>>> 3 2 18 >>>>> 4 4 23 >>>>> 5 7 5083 >>>>> 6 9 18 >>>>> >>>>>> sapply(b,length) >>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>> pd.hugene.2.1.st >>>>> 1195 1195 >>>>> 0 0 >>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>> pd.mogene.2.1.st >>>>> 1324 1324 >>>>> 0 0 >>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>> pd.ragene.2.1.st >>>>> 399 399 >>>>> 0 0 >>>>> >>>>>> sessionInfo() >>>>> R version 3.0.0 (2013-04-03) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>> [7] LC_PAPER=C LC_NAME=C >>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>> >>>>> attached base packages: >>>>> [1] parallel stats graphics grDevices utils datasets >>>>> methods >>>>> [8] base >>>>> >>>>> other attached packages: >>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>> [19] BiocInstaller_1.10.2 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >> -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.8 years ago James W. MacDonald 65k

0

Entering edit mode

Dear Jim, As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain. Best regards, Christian On 7/9/13 11:46 PM, James W. MacDonald wrote: > Hi Christian, > > Thanks for pointing that out. There is still a bit of an inconsistency > with the pd packages that should probably be corrected, as at the > probeset level e.g., 16934607 is intended to measure an exon of EIF3D, > whereas at the transcript level, this same probeset is intended to be a > positive control (and as you note below, these probes are incorporated > into a larger probeset intended to measure the EIF3D transcript). > > It would be nice to be able to filter out the controls like Mark > attempted (and I do regularly as well). > > Mark - I talked with Benilton Carvalho about this, and he will take a > look next week. > > Best, > > Jim > > > > On 7/9/2013 3:38 PM, cstrato wrote: >> Dear Jim, >> >> In xps I use as basic file for exon arrays the probeset annotation >> file and then compare the data to the data from the pgf-file. Any >> differences will be reported. >> >> I have just checked the different files for HuGene-2_0-st. If you >> check as an example the following probeset_ids: >> 16934607 >> 16934608 >> 16934609 >> 16934610 >> >> Then you will see that the transcript annotation file lists these ids >> as 'normgene->exon' and 'pos_control'. However, the probeset >> annotation file lists these ids as 'main' belonging to gene EIF3D with >> transcript_cluster_id 16934583. Looking for this id in the transcript >> annotation file reveals that the number of 'total_probes' is 24. >> Indeed, the probeset annotation file lists 24 probesets including the >> four above mentioned probeset_ids. >> >> This means that although these four probesets are listed in the >> transcript annotation file as 'normgene->exon' the label 'main' in the >> pgf-file is correct since these probesets are part of the gene EIF3D. >> >> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >> listed as 'normgene->exon'. However, in this case these probesets are >> also listed as 'normgene->exon' in the probeset annotation file, i.e. >> these probesets do not belong to any transcript listed in the probeset >> annotation file. >> >> Best regards, >> Christian >> >> >> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>> Hi Christian, >>> >>> That's not the issue. Instead, the issue is that the pgf file lists the >>> normgene->exon probeset IDs as being 'main'. I have received a response >>> from Affy stating that the qcc file lists the normgene->exon probesets >>> as pos_control, but that seems orthogonal to the issue at hand. >>> >>> > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", >>> stringsAsFactors=F, header=T) >>> > pgf <- readPgf("HuGene-2_0-st.pgf") >>> > head(qcc) >>> probeset_id group_name probeset_name quantification_in_header >>> 1 16650001 neg_control 16650001 0 >>> 2 16650003 neg_control 16650003 0 >>> >>> ## get the positive controls (normgene->exon probesets) from the qcc >>> file >>> > pos_cont <- qcc[qcc[,2] == "pos_control",1] >>> >>> ## compare to pgf file >>> > x <- pgf$probesetType[pgf$probesetId %in% pos_cont] >>> > table(x) >>> x >>> main >>> 1626 >>> >>> So in the pgf file, these probesets are being called 'main' instead of >>> some sort of control. How do you handle this in xps? Do you use the pgf >>> file? >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> >>> On 7/9/2013 2:06 PM, cstrato wrote: >>>> Dear Jim, >>>> >>>> I did not really follow the discussion so I may be wrong, but if you >>>> mean that there is a difference between the number of 'main' types, >>>> please note that number of 'main' for pgf, i.e 349012, corresponds to >>>> the number of 'main' in the probeset annotation file and not in the >>>> transcript annotation file. >>>> >>>> But as I said, I may have misunderstood the problem. >>>> >>>> I am mainly replying because at the beginning of this year I had long >>>> discussions with DevNet to make sure that the annotation files for the >>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>> everything what I have found. >>>> >>>> Best regards, >>>> Christian >>>> >>>> >>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>> Hi Mark, >>>>> >>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>> transcript and probeset annotation files (and had them fixed), but >>>>> didn't think I needed to check the others. Famous last words, no? >>>>> >>>>> > x <- readPgf("HuGene-2_0-st.pgf") >>>>> > table(x$probesetType) >>>>> >>>>> control->affx control->affx->bac_spike >>>>> 18 18 >>>>> control->affx->ercc_spike control->affx->polya_spike >>>>> 92 39 >>>>> control->bgp->antigenomic main >>>>> 23 349012 >>>>> normgene->intron reporter >>>>> 3575 82 >>>>> >>>>> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>> > table(y$category) >>>>> >>>>> control->affx control->affx->bac_spike >>>>> 18 18 >>>>> control->affx->ercc-spike control->affx->polya_spike >>>>> 92 39 >>>>> control->bgp->antigenomic main >>>>> 23 44629 >>>>> normgene->exon normgene->intron >>>>> 1626 3575 >>>>> reporter rescue >>>>> 82 3515 >>>>> >>>>> I'll ping Affymetrix and see what they have to say. >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>> Dear Benilton, James& Bioconductors, >>>>>> Thanks for providing the platform design packages for >>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>> >>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>> >>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>> category: >>>>>> >>>>>> # probe types: >>>>>> librarypd.hugene.2.0.st) >>>>>> conn<- dbpd.hugene.2.0.st) >>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>> type type_id >>>>>> 1 1 main >>>>>> 2 2 control->affx >>>>>> 3 3 control->chip >>>>>> 4 4 control->bgp->antigenomic >>>>>> 5 5 control->bgp->genomic >>>>>> 6 6 normgene->exon >>>>>> 7 7 normgene->intron >>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>> 9 9 control->affx->bac_spike >>>>>> 10 10 oligo_spike_in >>>>>> 11 11 r1_bac_spike_at >>>>>> >>>>>> # probe counts for each of the probe categories: >>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>> type") >>>>>> type count(*) >>>>>> 1 NA 3728 >>>>>> 2 1 345497 >>>>>> 3 2 18 >>>>>> 4 4 23 >>>>>> 5 7 3575 >>>>>> 6 9 18 >>>>>> >>>>>> NB: no type 6 probes. >>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>> >>>>>> Can these mappings please be updated? >>>>>> >>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>> haven't investigated these in any detail. >>>>>> >>>>>> cheers, >>>>>> Mark >>>>>> ----------------------------------------------------- >>>>>> Mark Cowley, PhD >>>>>> >>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>> Sydney, Australia >>>>>> ----------------------------------------------------- >>>>>> >>>>>> All 12 packages below: >>>>>> pd.packages<- c( >>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>>>> "pd.hugene.2.1.st", >>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>>>> "pd.mogene.2.1.st", >>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>>>> "pd.ragene.2.1.st" >>>>>> ) >>>>>> >>>>>> a<- b<- list() >>>>>> forpd.pkg.name in pd.packages) { >>>>>> try({ >>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load >>>>>> the >>>>>> pd.package") >>>>>> conn<- db(getpd.pkg.name)) >>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>>>> featureSet GROUP BY type") >>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>> featureSet >>>>>> WHERE type = 6")[,1] >>>>>> }) >>>>>> } >>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>> >>>>>>> a >>>>>> $pd.hugene.1.0.st.v1 >>>>>> type count(*) >>>>>> 1 NA 227 >>>>>> 2 1 253002 >>>>>> 3 2 57 >>>>>> 4 4 45 >>>>>> 5 6 1195 >>>>>> 6 7 2904 >>>>>> >>>>>> $pd.hugene.1.1.st.v1 >>>>>> type count(*) >>>>>> 1 NA 227 >>>>>> 2 1 253002 >>>>>> 3 2 57 >>>>>> 4 4 45 >>>>>> 5 6 1195 >>>>>> 6 7 2904 >>>>>> >>>>>> $pd.hugene.2.0.st >>>>>> type count(*) >>>>>> 1 NA 3728 >>>>>> 2 1 345497 >>>>>> 3 2 18 >>>>>> 4 4 23 >>>>>> 5 7 3575 >>>>>> 6 9 18 >>>>>> >>>>>> $pd.hugene.2.1.st >>>>>> type count(*) >>>>>> 1 NA 3728 >>>>>> 2 1 345497 >>>>>> 3 2 18 >>>>>> 4 4 23 >>>>>> 5 7 3575 >>>>>> 6 9 18 >>>>>> >>>>>> $pd.mogene.1.0.st.v1 >>>>>> type count(*) >>>>>> 1 NA 86 >>>>>> 2 1 234878 >>>>>> 3 2 21 >>>>>> 4 4 45 >>>>>> 5 6 1324 >>>>>> 6 7 5222 >>>>>> >>>>>> $pd.mogene.1.1.st.v1 >>>>>> type count(*) >>>>>> 1 NA 86 >>>>>> 2 1 234878 >>>>>> 3 2 21 >>>>>> 4 4 45 >>>>>> 5 6 1324 >>>>>> 6 7 5222 >>>>>> >>>>>> $pd.mogene.2.0.st >>>>>> type count(*) >>>>>> 1 NA 810 >>>>>> 2 1 263551 >>>>>> 3 2 18 >>>>>> 4 4 23 >>>>>> 5 7 5331 >>>>>> 6 9 18 >>>>>> >>>>>> $pd.mogene.2.1.st >>>>>> type count(*) >>>>>> 1 NA 810 >>>>>> 2 1 263551 >>>>>> 3 2 18 >>>>>> 4 4 23 >>>>>> 5 7 5331 >>>>>> 6 9 18 >>>>>> >>>>>> $pd.ragene.1.0.st.v1 >>>>>> type count(*) >>>>>> 1 NA 254 >>>>>> 2 1 211195 >>>>>> 3 2 21 >>>>>> 4 4 45 >>>>>> 5 6 399 >>>>>> 6 7 1153 >>>>>> >>>>>> $pd.ragene.1.1.st.v1 >>>>>> type count(*) >>>>>> 1 NA 254 >>>>>> 2 1 211195 >>>>>> 3 2 21 >>>>>> 4 4 45 >>>>>> 5 6 399 >>>>>> 6 7 1153 >>>>>> >>>>>> $pd.ragene.2.0.st >>>>>> type count(*) >>>>>> 1 NA 1071 >>>>>> 2 1 214018 >>>>>> 3 2 18 >>>>>> 4 4 23 >>>>>> 5 7 5083 >>>>>> 6 9 18 >>>>>> >>>>>> $pd.ragene.2.1.st >>>>>> type count(*) >>>>>> 1 NA 1071 >>>>>> 2 1 214018 >>>>>> 3 2 18 >>>>>> 4 4 23 >>>>>> 5 7 5083 >>>>>> 6 9 18 >>>>>> >>>>>>> sapply(b,length) >>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>> pd.hugene.2.1.st >>>>>> 1195 1195 >>>>>> 0 0 >>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>> pd.mogene.2.1.st >>>>>> 1324 1324 >>>>>> 0 0 >>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>> pd.ragene.2.1.st >>>>>> 399 399 >>>>>> 0 0 >>>>>> >>>>>>> sessionInfo() >>>>>> R version 3.0.0 (2013-04-03) >>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: >>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>> >>>>>> attached base packages: >>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>> methods >>>>>> [8] base >>>>>> >>>>>> other attached packages: >>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>> [19] BiocInstaller_1.10.2 >>>>>> >>>>>> loaded via a namespace (and not attached): >>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>> >

ADD REPLY • link 10.8 years ago cstrato ★ 3.9k

0

Entering edit mode

Hi, I see the point you're making Christian. I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file. Cheers and thanks for looking into this Mark Sent from my iPhone On 10/07/2013, at 7:53 AM, "cstrato" <cstrato at="" aon.at=""> wrote: > Dear Jim, > > As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain. > > Best regards, > Christian > > > On 7/9/13 11:46 PM, James W. MacDonald wrote: >> Hi Christian, >> >> Thanks for pointing that out. There is still a bit of an inconsistency >> with the pd packages that should probably be corrected, as at the >> probeset level e.g., 16934607 is intended to measure an exon of EIF3D, >> whereas at the transcript level, this same probeset is intended to be a >> positive control (and as you note below, these probes are incorporated >> into a larger probeset intended to measure the EIF3D transcript). >> >> It would be nice to be able to filter out the controls like Mark >> attempted (and I do regularly as well). >> >> Mark - I talked with Benilton Carvalho about this, and he will take a >> look next week. >> >> Best, >> >> Jim >> >> >> >> On 7/9/2013 3:38 PM, cstrato wrote: >>> Dear Jim, >>> >>> In xps I use as basic file for exon arrays the probeset annotation >>> file and then compare the data to the data from the pgf-file. Any >>> differences will be reported. >>> >>> I have just checked the different files for HuGene-2_0-st. If you >>> check as an example the following probeset_ids: >>> 16934607 >>> 16934608 >>> 16934609 >>> 16934610 >>> >>> Then you will see that the transcript annotation file lists these ids >>> as 'normgene->exon' and 'pos_control'. However, the probeset >>> annotation file lists these ids as 'main' belonging to gene EIF3D with >>> transcript_cluster_id 16934583. Looking for this id in the transcript >>> annotation file reveals that the number of 'total_probes' is 24. >>> Indeed, the probeset annotation file lists 24 probesets including the >>> four above mentioned probeset_ids. >>> >>> This means that although these four probesets are listed in the >>> transcript annotation file as 'normgene->exon' the label 'main' in the >>> pgf-file is correct since these probesets are part of the gene EIF3D. >>> >>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >>> listed as 'normgene->exon'. However, in this case these probesets are >>> also listed as 'normgene->exon' in the probeset annotation file, i.e. >>> these probesets do not belong to any transcript listed in the probeset >>> annotation file. >>> >>> Best regards, >>> Christian >>> >>> >>> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>>> Hi Christian, >>>> >>>> That's not the issue. Instead, the issue is that the pgf file lists the >>>> normgene->exon probeset IDs as being 'main'. I have received a response >>>> from Affy stating that the qcc file lists the normgene->exon probesets >>>> as pos_control, but that seems orthogonal to the issue at hand. >>>> >>>> > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", >>>> stringsAsFactors=F, header=T) >>>> > pgf <- readPgf("HuGene-2_0-st.pgf") >>>> > head(qcc) >>>> probeset_id group_name probeset_name quantification_in_header >>>> 1 16650001 neg_control 16650001 0 >>>> 2 16650003 neg_control 16650003 0 >>>> >>>> ## get the positive controls (normgene->exon probesets) from the qcc >>>> file >>>> > pos_cont <- qcc[qcc[,2] == "pos_control",1] >>>> >>>> ## compare to pgf file >>>> > x <- pgf$probesetType[pgf$probesetId %in% pos_cont] >>>> > table(x) >>>> x >>>> main >>>> 1626 >>>> >>>> So in the pgf file, these probesets are being called 'main' instead of >>>> some sort of control. How do you handle this in xps? Do you use the pgf >>>> file? >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> >>>> On 7/9/2013 2:06 PM, cstrato wrote: >>>>> Dear Jim, >>>>> >>>>> I did not really follow the discussion so I may be wrong, but if you >>>>> mean that there is a difference between the number of 'main' types, >>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to >>>>> the number of 'main' in the probeset annotation file and not in the >>>>> transcript annotation file. >>>>> >>>>> But as I said, I may have misunderstood the problem. >>>>> >>>>> I am mainly replying because at the beginning of this year I had long >>>>> discussions with DevNet to make sure that the annotation files for the >>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>>> everything what I have found. >>>>> >>>>> Best regards, >>>>> Christian >>>>> >>>>> >>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>>> Hi Mark, >>>>>> >>>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>>> transcript and probeset annotation files (and had them fixed), but >>>>>> didn't think I needed to check the others. Famous last words, no? >>>>>> >>>>>> > x <- readPgf("HuGene-2_0-st.pgf") >>>>>> > table(x$probesetType) >>>>>> >>>>>> control->affx control->affx->bac_spike >>>>>> 18 18 >>>>>> control->affx->ercc_spike control->affx->polya_spike >>>>>> 92 39 >>>>>> control->bgp->antigenomic main >>>>>> 23 349012 >>>>>> normgene->intron reporter >>>>>> 3575 82 >>>>>> >>>>>> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>>> > table(y$category) >>>>>> >>>>>> control->affx control->affx->bac_spike >>>>>> 18 18 >>>>>> control->affx->ercc-spike control->affx->polya_spike >>>>>> 92 39 >>>>>> control->bgp->antigenomic main >>>>>> 23 44629 >>>>>> normgene->exon normgene->intron >>>>>> 1626 3575 >>>>>> reporter rescue >>>>>> 82 3515 >>>>>> >>>>>> I'll ping Affymetrix and see what they have to say. >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>>> Dear Benilton, James& Bioconductors, >>>>>>> Thanks for providing the platform design packages for >>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>>> >>>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>>> >>>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>>> category: >>>>>>> >>>>>>> # probe types: >>>>>>> librarypd.hugene.2.0.st) >>>>>>> conn<- dbpd.hugene.2.0.st) >>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>> type type_id >>>>>>> 1 1 main >>>>>>> 2 2 control->affx >>>>>>> 3 3 control->chip >>>>>>> 4 4 control->bgp->antigenomic >>>>>>> 5 5 control->bgp->genomic >>>>>>> 6 6 normgene->exon >>>>>>> 7 7 normgene->intron >>>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>>> 9 9 control->affx->bac_spike >>>>>>> 10 10 oligo_spike_in >>>>>>> 11 11 r1_bac_spike_at >>>>>>> >>>>>>> # probe counts for each of the probe categories: >>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>>> type") >>>>>>> type count(*) >>>>>>> 1 NA 3728 >>>>>>> 2 1 345497 >>>>>>> 3 2 18 >>>>>>> 4 4 23 >>>>>>> 5 7 3575 >>>>>>> 6 9 18 >>>>>>> >>>>>>> NB: no type 6 probes. >>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>>> >>>>>>> Can these mappings please be updated? >>>>>>> >>>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>>> haven't investigated these in any detail. >>>>>>> >>>>>>> cheers, >>>>>>> Mark >>>>>>> ----------------------------------------------------- >>>>>>> Mark Cowley, PhD >>>>>>> >>>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>>> Sydney, Australia >>>>>>> ----------------------------------------------------- >>>>>>> >>>>>>> All 12 packages below: >>>>>>> pd.packages<- c( >>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>>>>> "pd.hugene.2.1.st", >>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>>>>> "pd.mogene.2.1.st", >>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>>>>> "pd.ragene.2.1.st" >>>>>>> ) >>>>>>> >>>>>>> a<- b<- list() >>>>>>> forpd.pkg.name in pd.packages) { >>>>>>> try({ >>>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load >>>>>>> the >>>>>>> pd.package") >>>>>>> conn<- db(getpd.pkg.name)) >>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>>>>> featureSet GROUP BY type") >>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>>> featureSet >>>>>>> WHERE type = 6")[,1] >>>>>>> }) >>>>>>> } >>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>> >>>>>>>> a >>>>>>> $pd.hugene.1.0.st.v1 >>>>>>> type count(*) >>>>>>> 1 NA 227 >>>>>>> 2 1 253002 >>>>>>> 3 2 57 >>>>>>> 4 4 45 >>>>>>> 5 6 1195 >>>>>>> 6 7 2904 >>>>>>> >>>>>>> $pd.hugene.1.1.st.v1 >>>>>>> type count(*) >>>>>>> 1 NA 227 >>>>>>> 2 1 253002 >>>>>>> 3 2 57 >>>>>>> 4 4 45 >>>>>>> 5 6 1195 >>>>>>> 6 7 2904 >>>>>>> >>>>>>> $pd.hugene.2.0.st >>>>>>> type count(*) >>>>>>> 1 NA 3728 >>>>>>> 2 1 345497 >>>>>>> 3 2 18 >>>>>>> 4 4 23 >>>>>>> 5 7 3575 >>>>>>> 6 9 18 >>>>>>> >>>>>>> $pd.hugene.2.1.st >>>>>>> type count(*) >>>>>>> 1 NA 3728 >>>>>>> 2 1 345497 >>>>>>> 3 2 18 >>>>>>> 4 4 23 >>>>>>> 5 7 3575 >>>>>>> 6 9 18 >>>>>>> >>>>>>> $pd.mogene.1.0.st.v1 >>>>>>> type count(*) >>>>>>> 1 NA 86 >>>>>>> 2 1 234878 >>>>>>> 3 2 21 >>>>>>> 4 4 45 >>>>>>> 5 6 1324 >>>>>>> 6 7 5222 >>>>>>> >>>>>>> $pd.mogene.1.1.st.v1 >>>>>>> type count(*) >>>>>>> 1 NA 86 >>>>>>> 2 1 234878 >>>>>>> 3 2 21 >>>>>>> 4 4 45 >>>>>>> 5 6 1324 >>>>>>> 6 7 5222 >>>>>>> >>>>>>> $pd.mogene.2.0.st >>>>>>> type count(*) >>>>>>> 1 NA 810 >>>>>>> 2 1 263551 >>>>>>> 3 2 18 >>>>>>> 4 4 23 >>>>>>> 5 7 5331 >>>>>>> 6 9 18 >>>>>>> >>>>>>> $pd.mogene.2.1.st >>>>>>> type count(*) >>>>>>> 1 NA 810 >>>>>>> 2 1 263551 >>>>>>> 3 2 18 >>>>>>> 4 4 23 >>>>>>> 5 7 5331 >>>>>>> 6 9 18 >>>>>>> >>>>>>> $pd.ragene.1.0.st.v1 >>>>>>> type count(*) >>>>>>> 1 NA 254 >>>>>>> 2 1 211195 >>>>>>> 3 2 21 >>>>>>> 4 4 45 >>>>>>> 5 6 399 >>>>>>> 6 7 1153 >>>>>>> >>>>>>> $pd.ragene.1.1.st.v1 >>>>>>> type count(*) >>>>>>> 1 NA 254 >>>>>>> 2 1 211195 >>>>>>> 3 2 21 >>>>>>> 4 4 45 >>>>>>> 5 6 399 >>>>>>> 6 7 1153 >>>>>>> >>>>>>> $pd.ragene.2.0.st >>>>>>> type count(*) >>>>>>> 1 NA 1071 >>>>>>> 2 1 214018 >>>>>>> 3 2 18 >>>>>>> 4 4 23 >>>>>>> 5 7 5083 >>>>>>> 6 9 18 >>>>>>> >>>>>>> $pd.ragene.2.1.st >>>>>>> type count(*) >>>>>>> 1 NA 1071 >>>>>>> 2 1 214018 >>>>>>> 3 2 18 >>>>>>> 4 4 23 >>>>>>> 5 7 5083 >>>>>>> 6 9 18 >>>>>>> >>>>>>>> sapply(b,length) >>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>>> pd.hugene.2.1.st >>>>>>> 1195 1195 >>>>>>> 0 0 >>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>>> pd.mogene.2.1.st >>>>>>> 1324 1324 >>>>>>> 0 0 >>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>>> pd.ragene.2.1.st >>>>>>> 399 399 >>>>>>> 0 0 >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 3.0.0 (2013-04-03) >>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>> methods >>>>>>> [8] base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>>> [19] BiocInstaller_1.10.2 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>

ADD REPLY • link 10.8 years ago Mark Cowley ▴ 910

0

Entering edit mode

Hi Mark, The issue here is that the fsetid AND the meta_fsetid for probeset 16934607 are the same. In other words, when you summarize at the probeset level, there is a probeset 16934607 that is intended to interrogate an exon of EIF3D. And when you summarize at the transcript level, there is a probeset 16934607 that is intended to be a positive control. So at one level, this probeset IS a main probeset, and that is reflected in the pgf file. Prior to this series of arrays, the positive controls were never also a part of an existing transcript, so this issue never came up. As you will note in my earlier response to Christian, I agree that this isn't ideal, and that it would be better to not tag a positive control as 'main' in the pd package. However, I don't know how this would be fixed. Note that > dbGetQuery(con, "select * from core_mps where meta_fsetid='16934607';") meta_fsetid transcript_cluster_id fsetid 1 16934607 16934607 16934607 and > dbGetQuery(con, "select * from featureSet where transcript_cluster_id='16934583';")[,c(1,5,10)] fsetid transcript_cluster_id type 1 16934584 16934583 1 2 16934585 16934583 1 3 16934586 16934583 1 4 16934587 16934583 1 5 16934588 16934583 1 6 16934589 16934583 1 7 16934590 16934583 1 8 16934591 16934583 1 9 16934592 16934583 1 10 16934593 16934583 1 11 16934594 16934583 1 12 16934595 16934583 1 13 16934596 16934583 1 14 16934597 16934583 1 15 16934598 16934583 1 16 16934600 16934583 1 17 16934601 16934583 1 18 16934602 16934583 1 19 16934603 16934583 1 20 16934604 16934583 1 21 16934607 16934583 1 22 16934608 16934583 1 23 16934609 16934583 1 24 16934610 16934583 1 and > dbGetQuery(con, "select * from featureSet where transcript_cluster_id='16934607';") [1] fsetid strand start [4] stop transcript_cluster_id exon_id [7] crosshyb_type level chrom [10] type So this seems like what you would want to happen. The 16934607 probeset isn't summarized at the transcript level, but is incorporated into the 16934583 probeset, in which case the 'main' designation is correct. Best, Jim On 7/10/2013 6:31 AM, Mark Cowley wrote: > Hi, > I see the point you're making Christian. > I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file. > > Cheers and thanks for looking into this > > Mark > > Sent from my iPhone > > On 10/07/2013, at 7:53 AM, "cstrato"<cstrato at="" aon.at=""> wrote: > >> Dear Jim, >> >> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain. >> >> Best regards, >> Christian >> >> >> On 7/9/13 11:46 PM, James W. MacDonald wrote: >>> Hi Christian, >>> >>> Thanks for pointing that out. There is still a bit of an inconsistency >>> with the pd packages that should probably be corrected, as at the >>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D, >>> whereas at the transcript level, this same probeset is intended to be a >>> positive control (and as you note below, these probes are incorporated >>> into a larger probeset intended to measure the EIF3D transcript). >>> >>> It would be nice to be able to filter out the controls like Mark >>> attempted (and I do regularly as well). >>> >>> Mark - I talked with Benilton Carvalho about this, and he will take a >>> look next week. >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> On 7/9/2013 3:38 PM, cstrato wrote: >>>> Dear Jim, >>>> >>>> In xps I use as basic file for exon arrays the probeset annotation >>>> file and then compare the data to the data from the pgf-file. Any >>>> differences will be reported. >>>> >>>> I have just checked the different files for HuGene-2_0-st. If you >>>> check as an example the following probeset_ids: >>>> 16934607 >>>> 16934608 >>>> 16934609 >>>> 16934610 >>>> >>>> Then you will see that the transcript annotation file lists these ids >>>> as 'normgene->exon' and 'pos_control'. However, the probeset >>>> annotation file lists these ids as 'main' belonging to gene EIF3D with >>>> transcript_cluster_id 16934583. Looking for this id in the transcript >>>> annotation file reveals that the number of 'total_probes' is 24. >>>> Indeed, the probeset annotation file lists 24 probesets including the >>>> four above mentioned probeset_ids. >>>> >>>> This means that although these four probesets are listed in the >>>> transcript annotation file as 'normgene->exon' the label 'main' in the >>>> pgf-file is correct since these probesets are part of the gene EIF3D. >>>> >>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >>>> listed as 'normgene->exon'. However, in this case these probesets are >>>> also listed as 'normgene->exon' in the probeset annotation file, i.e. >>>> these probesets do not belong to any transcript listed in the probeset >>>> annotation file. >>>> >>>> Best regards, >>>> Christian >>>> >>>> >>>> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>>>> Hi Christian, >>>>> >>>>> That's not the issue. Instead, the issue is that the pgf file lists the >>>>> normgene->exon probeset IDs as being 'main'. I have received a response >>>>> from Affy stating that the qcc file lists the normgene->exon probesets >>>>> as pos_control, but that seems orthogonal to the issue at hand. >>>>> >>>>>> qcc<- read.table("HuGene-2_0-st.qcc", comment.char="#", >>>>> stringsAsFactors=F, header=T) >>>>>> pgf<- readPgf("HuGene-2_0-st.pgf") >>>>>> head(qcc) >>>>> probeset_id group_name probeset_name quantification_in_header >>>>> 1 16650001 neg_control 16650001 0 >>>>> 2 16650003 neg_control 16650003 0 >>>>> >>>>> ## get the positive controls (normgene->exon probesets) from the qcc >>>>> file >>>>>> pos_cont<- qcc[qcc[,2] == "pos_control",1] >>>>> ## compare to pgf file >>>>>> x<- pgf$probesetType[pgf$probesetId %in% pos_cont] >>>>>> table(x) >>>>> x >>>>> main >>>>> 1626 >>>>> >>>>> So in the pgf file, these probesets are being called 'main' instead of >>>>> some sort of control. How do you handle this in xps? Do you use the pgf >>>>> file? >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> >>>>> On 7/9/2013 2:06 PM, cstrato wrote: >>>>>> Dear Jim, >>>>>> >>>>>> I did not really follow the discussion so I may be wrong, but if you >>>>>> mean that there is a difference between the number of 'main' types, >>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to >>>>>> the number of 'main' in the probeset annotation file and not in the >>>>>> transcript annotation file. >>>>>> >>>>>> But as I said, I may have misunderstood the problem. >>>>>> >>>>>> I am mainly replying because at the beginning of this year I had long >>>>>> discussions with DevNet to make sure that the annotation files for the >>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>>>> everything what I have found. >>>>>> >>>>>> Best regards, >>>>>> Christian >>>>>> >>>>>> >>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>>>> Hi Mark, >>>>>>> >>>>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>>>> transcript and probeset annotation files (and had them fixed), but >>>>>>> didn't think I needed to check the others. Famous last words, no? >>>>>>> >>>>>>>> x<- readPgf("HuGene-2_0-st.pgf") >>>>>>>> table(x$probesetType) >>>>>>> control->affx control->affx->bac_spike >>>>>>> 18 18 >>>>>>> control->affx->ercc_spike control->affx->polya_spike >>>>>>> 92 39 >>>>>>> control->bgp->antigenomic main >>>>>>> 23 349012 >>>>>>> normgene->intron reporter >>>>>>> 3575 82 >>>>>>> >>>>>>>> y<- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>>>>> table(y$category) >>>>>>> control->affx control->affx->bac_spike >>>>>>> 18 18 >>>>>>> control->affx->ercc-spike control->affx->polya_spike >>>>>>> 92 39 >>>>>>> control->bgp->antigenomic main >>>>>>> 23 44629 >>>>>>> normgene->exon normgene->intron >>>>>>> 1626 3575 >>>>>>> reporter rescue >>>>>>> 82 3515 >>>>>>> >>>>>>> I'll ping Affymetrix and see what they have to say. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>>>> Dear Benilton, James& Bioconductors, >>>>>>>> Thanks for providing the platform design packages for >>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>>>> >>>>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>>>> >>>>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>>>> category: >>>>>>>> >>>>>>>> # probe types: >>>>>>>> librarypd.hugene.2.0.st) >>>>>>>> conn<- dbpd.hugene.2.0.st) >>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>> type type_id >>>>>>>> 1 1 main >>>>>>>> 2 2 control->affx >>>>>>>> 3 3 control->chip >>>>>>>> 4 4 control->bgp->antigenomic >>>>>>>> 5 5 control->bgp->genomic >>>>>>>> 6 6 normgene->exon >>>>>>>> 7 7 normgene->intron >>>>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>>>> 9 9 control->affx->bac_spike >>>>>>>> 10 10 oligo_spike_in >>>>>>>> 11 11 r1_bac_spike_at >>>>>>>> >>>>>>>> # probe counts for each of the probe categories: >>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>>>> type") >>>>>>>> type count(*) >>>>>>>> 1 NA 3728 >>>>>>>> 2 1 345497 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 3575 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> NB: no type 6 probes. >>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>>>> >>>>>>>> Can these mappings please be updated? >>>>>>>> >>>>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>>>> haven't investigated these in any detail. >>>>>>>> >>>>>>>> cheers, >>>>>>>> Mark >>>>>>>> ----------------------------------------------------- >>>>>>>> Mark Cowley, PhD >>>>>>>> >>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>>>> Sydney, Australia >>>>>>>> ----------------------------------------------------- >>>>>>>> >>>>>>>> All 12 packages below: >>>>>>>> pd.packages<- c( >>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>>>>>> "pd.hugene.2.1.st", >>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>>>>>> "pd.mogene.2.1.st", >>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>>>>>> "pd.ragene.2.1.st" >>>>>>>> ) >>>>>>>> >>>>>>>> a<- b<- list() >>>>>>>> forpd.pkg.name in pd.packages) { >>>>>>>> try({ >>>>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load >>>>>>>> the >>>>>>>> pd.package") >>>>>>>> conn<- db(getpd.pkg.name)) >>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>>>>>> featureSet GROUP BY type") >>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>>>> featureSet >>>>>>>> WHERE type = 6")[,1] >>>>>>>> }) >>>>>>>> } >>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>> >>>>>>>>> a >>>>>>>> $pd.hugene.1.0.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 227 >>>>>>>> 2 1 253002 >>>>>>>> 3 2 57 >>>>>>>> 4 4 45 >>>>>>>> 5 6 1195 >>>>>>>> 6 7 2904 >>>>>>>> >>>>>>>> $pd.hugene.1.1.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 227 >>>>>>>> 2 1 253002 >>>>>>>> 3 2 57 >>>>>>>> 4 4 45 >>>>>>>> 5 6 1195 >>>>>>>> 6 7 2904 >>>>>>>> >>>>>>>> $pd.hugene.2.0.st >>>>>>>> type count(*) >>>>>>>> 1 NA 3728 >>>>>>>> 2 1 345497 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 3575 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.hugene.2.1.st >>>>>>>> type count(*) >>>>>>>> 1 NA 3728 >>>>>>>> 2 1 345497 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 3575 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.mogene.1.0.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 86 >>>>>>>> 2 1 234878 >>>>>>>> 3 2 21 >>>>>>>> 4 4 45 >>>>>>>> 5 6 1324 >>>>>>>> 6 7 5222 >>>>>>>> >>>>>>>> $pd.mogene.1.1.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 86 >>>>>>>> 2 1 234878 >>>>>>>> 3 2 21 >>>>>>>> 4 4 45 >>>>>>>> 5 6 1324 >>>>>>>> 6 7 5222 >>>>>>>> >>>>>>>> $pd.mogene.2.0.st >>>>>>>> type count(*) >>>>>>>> 1 NA 810 >>>>>>>> 2 1 263551 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 5331 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.mogene.2.1.st >>>>>>>> type count(*) >>>>>>>> 1 NA 810 >>>>>>>> 2 1 263551 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 5331 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.ragene.1.0.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 254 >>>>>>>> 2 1 211195 >>>>>>>> 3 2 21 >>>>>>>> 4 4 45 >>>>>>>> 5 6 399 >>>>>>>> 6 7 1153 >>>>>>>> >>>>>>>> $pd.ragene.1.1.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 254 >>>>>>>> 2 1 211195 >>>>>>>> 3 2 21 >>>>>>>> 4 4 45 >>>>>>>> 5 6 399 >>>>>>>> 6 7 1153 >>>>>>>> >>>>>>>> $pd.ragene.2.0.st >>>>>>>> type count(*) >>>>>>>> 1 NA 1071 >>>>>>>> 2 1 214018 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 5083 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.ragene.2.1.st >>>>>>>> type count(*) >>>>>>>> 1 NA 1071 >>>>>>>> 2 1 214018 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 5083 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>>> sapply(b,length) >>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>>>> pd.hugene.2.1.st >>>>>>>> 1195 1195 >>>>>>>> 0 0 >>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>>>> pd.mogene.2.1.st >>>>>>>> 1324 1324 >>>>>>>> 0 0 >>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>>>> pd.ragene.2.1.st >>>>>>>> 399 399 >>>>>>>> 0 0 >>>>>>>> >>>>>>>>> sessionInfo() >>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>> methods >>>>>>>> [8] base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>>>> [19] BiocInstaller_1.10.2 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.8 years ago James W. MacDonald 65k

0

Entering edit mode

Dear Mark, Yes, EIF3D and pos_controls mentioned have different transcript cluster id's. However, imho this means only that they have received these additional transcript cluster id's to be able to define them as normgene->exon. Furthermore, in the transcript annotation file the transcript_cluster_id is identical to the probeset_id. Thus e.g. 16934607 is used as probeset_id in the probeset annotation file. I have tested a couple of normgene->exon controls and it seems that all of them belong to real genes in the probeset csv file with the type 'main'. Mostly, these come in groups belonging to the same exon_id, e.g. the mentioned controls 16934607 - 16934610 all belong to exon_id 5192052 of the EIF3D gene. I suppose that you use exons as normgene->exon controls which e.g. in the case of alternative splicing are always expressed. Best regards, Christian On 7/10/13 12:31 PM, Mark Cowley wrote: > Hi, > I see the point you're making Christian. > I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file. > > Cheers and thanks for looking into this > > Mark > > Sent from my iPhone > > On 10/07/2013, at 7:53 AM, "cstrato" <cstrato at="" aon.at=""> wrote: > >> Dear Jim, >> >> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain. >> >> Best regards, >> Christian >> >> >> On 7/9/13 11:46 PM, James W. MacDonald wrote: >>> Hi Christian, >>> >>> Thanks for pointing that out. There is still a bit of an inconsistency >>> with the pd packages that should probably be corrected, as at the >>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D, >>> whereas at the transcript level, this same probeset is intended to be a >>> positive control (and as you note below, these probes are incorporated >>> into a larger probeset intended to measure the EIF3D transcript). >>> >>> It would be nice to be able to filter out the controls like Mark >>> attempted (and I do regularly as well). >>> >>> Mark - I talked with Benilton Carvalho about this, and he will take a >>> look next week. >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> On 7/9/2013 3:38 PM, cstrato wrote: >>>> Dear Jim, >>>> >>>> In xps I use as basic file for exon arrays the probeset annotation >>>> file and then compare the data to the data from the pgf-file. Any >>>> differences will be reported. >>>> >>>> I have just checked the different files for HuGene-2_0-st. If you >>>> check as an example the following probeset_ids: >>>> 16934607 >>>> 16934608 >>>> 16934609 >>>> 16934610 >>>> >>>> Then you will see that the transcript annotation file lists these ids >>>> as 'normgene->exon' and 'pos_control'. However, the probeset >>>> annotation file lists these ids as 'main' belonging to gene EIF3D with >>>> transcript_cluster_id 16934583. Looking for this id in the transcript >>>> annotation file reveals that the number of 'total_probes' is 24. >>>> Indeed, the probeset annotation file lists 24 probesets including the >>>> four above mentioned probeset_ids. >>>> >>>> This means that although these four probesets are listed in the >>>> transcript annotation file as 'normgene->exon' the label 'main' in the >>>> pgf-file is correct since these probesets are part of the gene EIF3D. >>>> >>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >>>> listed as 'normgene->exon'. However, in this case these probesets are >>>> also listed as 'normgene->exon' in the probeset annotation file, i.e. >>>> these probesets do not belong to any transcript listed in the probeset >>>> annotation file. >>>> >>>> Best regards, >>>> Christian >>>> >>>> >>>> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>>>> Hi Christian, >>>>> >>>>> That's not the issue. Instead, the issue is that the pgf file lists the >>>>> normgene->exon probeset IDs as being 'main'. I have received a response >>>>> from Affy stating that the qcc file lists the normgene->exon probesets >>>>> as pos_control, but that seems orthogonal to the issue at hand. >>>>> >>>>>> qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", >>>>> stringsAsFactors=F, header=T) >>>>>> pgf <- readPgf("HuGene-2_0-st.pgf") >>>>>> head(qcc) >>>>> probeset_id group_name probeset_name quantification_in_header >>>>> 1 16650001 neg_control 16650001 0 >>>>> 2 16650003 neg_control 16650003 0 >>>>> >>>>> ## get the positive controls (normgene->exon probesets) from the qcc >>>>> file >>>>>> pos_cont <- qcc[qcc[,2] == "pos_control",1] >>>>> >>>>> ## compare to pgf file >>>>>> x <- pgf$probesetType[pgf$probesetId %in% pos_cont] >>>>>> table(x) >>>>> x >>>>> main >>>>> 1626 >>>>> >>>>> So in the pgf file, these probesets are being called 'main' instead of >>>>> some sort of control. How do you handle this in xps? Do you use the pgf >>>>> file? >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> >>>>> On 7/9/2013 2:06 PM, cstrato wrote: >>>>>> Dear Jim, >>>>>> >>>>>> I did not really follow the discussion so I may be wrong, but if you >>>>>> mean that there is a difference between the number of 'main' types, >>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to >>>>>> the number of 'main' in the probeset annotation file and not in the >>>>>> transcript annotation file. >>>>>> >>>>>> But as I said, I may have misunderstood the problem. >>>>>> >>>>>> I am mainly replying because at the beginning of this year I had long >>>>>> discussions with DevNet to make sure that the annotation files for the >>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>>>> everything what I have found. >>>>>> >>>>>> Best regards, >>>>>> Christian >>>>>> >>>>>> >>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>>>> Hi Mark, >>>>>>> >>>>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>>>> transcript and probeset annotation files (and had them fixed), but >>>>>>> didn't think I needed to check the others. Famous last words, no? >>>>>>> >>>>>>>> x <- readPgf("HuGene-2_0-st.pgf") >>>>>>>> table(x$probesetType) >>>>>>> >>>>>>> control->affx control->affx->bac_spike >>>>>>> 18 18 >>>>>>> control->affx->ercc_spike control->affx->polya_spike >>>>>>> 92 39 >>>>>>> control->bgp->antigenomic main >>>>>>> 23 349012 >>>>>>> normgene->intron reporter >>>>>>> 3575 82 >>>>>>> >>>>>>>> y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>>>>> table(y$category) >>>>>>> >>>>>>> control->affx control->affx->bac_spike >>>>>>> 18 18 >>>>>>> control->affx->ercc-spike control->affx->polya_spike >>>>>>> 92 39 >>>>>>> control->bgp->antigenomic main >>>>>>> 23 44629 >>>>>>> normgene->exon normgene->intron >>>>>>> 1626 3575 >>>>>>> reporter rescue >>>>>>> 82 3515 >>>>>>> >>>>>>> I'll ping Affymetrix and see what they have to say. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>>>> Dear Benilton, James& Bioconductors, >>>>>>>> Thanks for providing the platform design packages for >>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>>>> >>>>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>>>> >>>>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>>>> category: >>>>>>>> >>>>>>>> # probe types: >>>>>>>> librarypd.hugene.2.0.st) >>>>>>>> conn<- dbpd.hugene.2.0.st) >>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>> type type_id >>>>>>>> 1 1 main >>>>>>>> 2 2 control->affx >>>>>>>> 3 3 control->chip >>>>>>>> 4 4 control->bgp->antigenomic >>>>>>>> 5 5 control->bgp->genomic >>>>>>>> 6 6 normgene->exon >>>>>>>> 7 7 normgene->intron >>>>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>>>> 9 9 control->affx->bac_spike >>>>>>>> 10 10 oligo_spike_in >>>>>>>> 11 11 r1_bac_spike_at >>>>>>>> >>>>>>>> # probe counts for each of the probe categories: >>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>>>> type") >>>>>>>> type count(*) >>>>>>>> 1 NA 3728 >>>>>>>> 2 1 345497 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 3575 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> NB: no type 6 probes. >>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>>>> >>>>>>>> Can these mappings please be updated? >>>>>>>> >>>>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>>>> haven't investigated these in any detail. >>>>>>>> >>>>>>>> cheers, >>>>>>>> Mark >>>>>>>> ----------------------------------------------------- >>>>>>>> Mark Cowley, PhD >>>>>>>> >>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>>>> Sydney, Australia >>>>>>>> ----------------------------------------------------- >>>>>>>> >>>>>>>> All 12 packages below: >>>>>>>> pd.packages<- c( >>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>>>>>> "pd.hugene.2.1.st", >>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>>>>>> "pd.mogene.2.1.st", >>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>>>>>> "pd.ragene.2.1.st" >>>>>>>> ) >>>>>>>> >>>>>>>> a<- b<- list() >>>>>>>> forpd.pkg.name in pd.packages) { >>>>>>>> try({ >>>>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load >>>>>>>> the >>>>>>>> pd.package") >>>>>>>> conn<- db(getpd.pkg.name)) >>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>>>>>> featureSet GROUP BY type") >>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>>>> featureSet >>>>>>>> WHERE type = 6")[,1] >>>>>>>> }) >>>>>>>> } >>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>> >>>>>>>>> a >>>>>>>> $pd.hugene.1.0.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 227 >>>>>>>> 2 1 253002 >>>>>>>> 3 2 57 >>>>>>>> 4 4 45 >>>>>>>> 5 6 1195 >>>>>>>> 6 7 2904 >>>>>>>> >>>>>>>> $pd.hugene.1.1.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 227 >>>>>>>> 2 1 253002 >>>>>>>> 3 2 57 >>>>>>>> 4 4 45 >>>>>>>> 5 6 1195 >>>>>>>> 6 7 2904 >>>>>>>> >>>>>>>> $pd.hugene.2.0.st >>>>>>>> type count(*) >>>>>>>> 1 NA 3728 >>>>>>>> 2 1 345497 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 3575 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.hugene.2.1.st >>>>>>>> type count(*) >>>>>>>> 1 NA 3728 >>>>>>>> 2 1 345497 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 3575 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.mogene.1.0.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 86 >>>>>>>> 2 1 234878 >>>>>>>> 3 2 21 >>>>>>>> 4 4 45 >>>>>>>> 5 6 1324 >>>>>>>> 6 7 5222 >>>>>>>> >>>>>>>> $pd.mogene.1.1.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 86 >>>>>>>> 2 1 234878 >>>>>>>> 3 2 21 >>>>>>>> 4 4 45 >>>>>>>> 5 6 1324 >>>>>>>> 6 7 5222 >>>>>>>> >>>>>>>> $pd.mogene.2.0.st >>>>>>>> type count(*) >>>>>>>> 1 NA 810 >>>>>>>> 2 1 263551 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 5331 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.mogene.2.1.st >>>>>>>> type count(*) >>>>>>>> 1 NA 810 >>>>>>>> 2 1 263551 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 5331 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.ragene.1.0.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 254 >>>>>>>> 2 1 211195 >>>>>>>> 3 2 21 >>>>>>>> 4 4 45 >>>>>>>> 5 6 399 >>>>>>>> 6 7 1153 >>>>>>>> >>>>>>>> $pd.ragene.1.1.st.v1 >>>>>>>> type count(*) >>>>>>>> 1 NA 254 >>>>>>>> 2 1 211195 >>>>>>>> 3 2 21 >>>>>>>> 4 4 45 >>>>>>>> 5 6 399 >>>>>>>> 6 7 1153 >>>>>>>> >>>>>>>> $pd.ragene.2.0.st >>>>>>>> type count(*) >>>>>>>> 1 NA 1071 >>>>>>>> 2 1 214018 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 5083 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>> $pd.ragene.2.1.st >>>>>>>> type count(*) >>>>>>>> 1 NA 1071 >>>>>>>> 2 1 214018 >>>>>>>> 3 2 18 >>>>>>>> 4 4 23 >>>>>>>> 5 7 5083 >>>>>>>> 6 9 18 >>>>>>>> >>>>>>>>> sapply(b,length) >>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>>>> pd.hugene.2.1.st >>>>>>>> 1195 1195 >>>>>>>> 0 0 >>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>>>> pd.mogene.2.1.st >>>>>>>> 1324 1324 >>>>>>>> 0 0 >>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>>>> pd.ragene.2.1.st >>>>>>>> 399 399 >>>>>>>> 0 0 >>>>>>>> >>>>>>>>> sessionInfo() >>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>> >>>>>>>> locale: >>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>>>> >>>>>>>> attached base packages: >>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>> methods >>>>>>>> [8] base >>>>>>>> >>>>>>>> other attached packages: >>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>>>> [19] BiocInstaller_1.10.2 >>>>>>>> >>>>>>>> loaded via a namespace (and not attached): >>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >

ADD REPLY • link 10.8 years ago cstrato ★ 3.9k

0

Entering edit mode

Hi, is it feasible to add a 'type' column to the core_mps table, using info from the transcript csv file? since 'core' implies transcript- level analysis. cheers, Mark On 11/07/2013, at 4:29 AM, cstrato <cstrato at="" aon.at=""> wrote: > Dear Mark, > > Yes, EIF3D and pos_controls mentioned have different transcript cluster id's. However, imho this means only that they have received these additional transcript cluster id's to be able to define them as normgene->exon. Furthermore, in the transcript annotation file the transcript_cluster_id is identical to the probeset_id. Thus e.g. 16934607 is used as probeset_id in the probeset annotation file. > > I have tested a couple of normgene->exon controls and it seems that all of them belong to real genes in the probeset csv file with the type 'main'. Mostly, these come in groups belonging to the same exon_id, e.g. the mentioned controls 16934607 - 16934610 all belong to exon_id 5192052 of the EIF3D gene. > > I suppose that you use exons as normgene->exon controls which e.g. in the case of alternative splicing are always expressed. > > Best regards, > Christian > > > On 7/10/13 12:31 PM, Mark Cowley wrote: >> Hi, >> I see the point you're making Christian. >> I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file. >> >> Cheers and thanks for looking into this >> >> Mark >> >> Sent from my iPhone >> >> On 10/07/2013, at 7:53 AM, "cstrato" <cstrato at="" aon.at=""> wrote: >> >>> Dear Jim, >>> >>> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain. >>> >>> Best regards, >>> Christian >>> >>> >>> On 7/9/13 11:46 PM, James W. MacDonald wrote: >>>> Hi Christian, >>>> >>>> Thanks for pointing that out. There is still a bit of an inconsistency >>>> with the pd packages that should probably be corrected, as at the >>>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D, >>>> whereas at the transcript level, this same probeset is intended to be a >>>> positive control (and as you note below, these probes are incorporated >>>> into a larger probeset intended to measure the EIF3D transcript). >>>> >>>> It would be nice to be able to filter out the controls like Mark >>>> attempted (and I do regularly as well). >>>> >>>> Mark - I talked with Benilton Carvalho about this, and he will take a >>>> look next week. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> On 7/9/2013 3:38 PM, cstrato wrote: >>>>> Dear Jim, >>>>> >>>>> In xps I use as basic file for exon arrays the probeset annotation >>>>> file and then compare the data to the data from the pgf-file. Any >>>>> differences will be reported. >>>>> >>>>> I have just checked the different files for HuGene-2_0-st. If you >>>>> check as an example the following probeset_ids: >>>>> 16934607 >>>>> 16934608 >>>>> 16934609 >>>>> 16934610 >>>>> >>>>> Then you will see that the transcript annotation file lists these ids >>>>> as 'normgene->exon' and 'pos_control'. However, the probeset >>>>> annotation file lists these ids as 'main' belonging to gene EIF3D with >>>>> transcript_cluster_id 16934583. Looking for this id in the transcript >>>>> annotation file reveals that the number of 'total_probes' is 24. >>>>> Indeed, the probeset annotation file lists 24 probesets including the >>>>> four above mentioned probeset_ids. >>>>> >>>>> This means that although these four probesets are listed in the >>>>> transcript annotation file as 'normgene->exon' the label 'main' in the >>>>> pgf-file is correct since these probesets are part of the gene EIF3D. >>>>> >>>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >>>>> listed as 'normgene->exon'. However, in this case these probesets are >>>>> also listed as 'normgene->exon' in the probeset annotation file, i.e. >>>>> these probesets do not belong to any transcript listed in the probeset >>>>> annotation file. >>>>> >>>>> Best regards, >>>>> Christian >>>>> >>>>> >>>>> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>>>>> Hi Christian, >>>>>> >>>>>> That's not the issue. Instead, the issue is that the pgf file lists the >>>>>> normgene->exon probeset IDs as being 'main'. I have received a response >>>>>> from Affy stating that the qcc file lists the normgene->exon probesets >>>>>> as pos_control, but that seems orthogonal to the issue at hand. >>>>>> >>>>>>> qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", >>>>>> stringsAsFactors=F, header=T) >>>>>>> pgf <- readPgf("HuGene-2_0-st.pgf") >>>>>>> head(qcc) >>>>>> probeset_id group_name probeset_name quantification_in_header >>>>>> 1 16650001 neg_control 16650001 0 >>>>>> 2 16650003 neg_control 16650003 0 >>>>>> >>>>>> ## get the positive controls (normgene->exon probesets) from the qcc >>>>>> file >>>>>>> pos_cont <- qcc[qcc[,2] == "pos_control",1] >>>>>> >>>>>> ## compare to pgf file >>>>>>> x <- pgf$probesetType[pgf$probesetId %in% pos_cont] >>>>>>> table(x) >>>>>> x >>>>>> main >>>>>> 1626 >>>>>> >>>>>> So in the pgf file, these probesets are being called 'main' instead of >>>>>> some sort of control. How do you handle this in xps? Do you use the pgf >>>>>> file? >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 7/9/2013 2:06 PM, cstrato wrote: >>>>>>> Dear Jim, >>>>>>> >>>>>>> I did not really follow the discussion so I may be wrong, but if you >>>>>>> mean that there is a difference between the number of 'main' types, >>>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to >>>>>>> the number of 'main' in the probeset annotation file and not in the >>>>>>> transcript annotation file. >>>>>>> >>>>>>> But as I said, I may have misunderstood the problem. >>>>>>> >>>>>>> I am mainly replying because at the beginning of this year I had long >>>>>>> discussions with DevNet to make sure that the annotation files for the >>>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>>>>> everything what I have found. >>>>>>> >>>>>>> Best regards, >>>>>>> Christian >>>>>>> >>>>>>> >>>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>>>>> Hi Mark, >>>>>>>> >>>>>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>>>>> transcript and probeset annotation files (and had them fixed), but >>>>>>>> didn't think I needed to check the others. Famous last words, no? >>>>>>>> >>>>>>>>> x <- readPgf("HuGene-2_0-st.pgf") >>>>>>>>> table(x$probesetType) >>>>>>>> >>>>>>>> control->affx control->affx->bac_spike >>>>>>>> 18 18 >>>>>>>> control->affx->ercc_spike control->affx->polya_spike >>>>>>>> 92 39 >>>>>>>> control->bgp->antigenomic main >>>>>>>> 23 349012 >>>>>>>> normgene->intron reporter >>>>>>>> 3575 82 >>>>>>>> >>>>>>>>> y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >>>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>>>>>> table(y$category) >>>>>>>> >>>>>>>> control->affx control->affx->bac_spike >>>>>>>> 18 18 >>>>>>>> control->affx->ercc-spike control->affx->polya_spike >>>>>>>> 92 39 >>>>>>>> control->bgp->antigenomic main >>>>>>>> 23 44629 >>>>>>>> normgene->exon normgene->intron >>>>>>>> 1626 3575 >>>>>>>> reporter rescue >>>>>>>> 82 3515 >>>>>>>> >>>>>>>> I'll ping Affymetrix and see what they have to say. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>>>>> Dear Benilton, James& Bioconductors, >>>>>>>>> Thanks for providing the platform design packages for >>>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>>>>> >>>>>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>>>>> >>>>>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>>>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>>>>> category: >>>>>>>>> >>>>>>>>> # probe types: >>>>>>>>> librarypd.hugene.2.0.st) >>>>>>>>> conn<- dbpd.hugene.2.0.st) >>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>> type type_id >>>>>>>>> 1 1 main >>>>>>>>> 2 2 control->affx >>>>>>>>> 3 3 control->chip >>>>>>>>> 4 4 control->bgp->antigenomic >>>>>>>>> 5 5 control->bgp->genomic >>>>>>>>> 6 6 normgene->exon >>>>>>>>> 7 7 normgene->intron >>>>>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>>>>> 9 9 control->affx->bac_spike >>>>>>>>> 10 10 oligo_spike_in >>>>>>>>> 11 11 r1_bac_spike_at >>>>>>>>> >>>>>>>>> # probe counts for each of the probe categories: >>>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>>>>> type") >>>>>>>>> type count(*) >>>>>>>>> 1 NA 3728 >>>>>>>>> 2 1 345497 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 3575 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> NB: no type 6 probes. >>>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>>>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>>>>> >>>>>>>>> Can these mappings please be updated? >>>>>>>>> >>>>>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>>>>> haven't investigated these in any detail. >>>>>>>>> >>>>>>>>> cheers, >>>>>>>>> Mark >>>>>>>>> ----------------------------------------------------- >>>>>>>>> Mark Cowley, PhD >>>>>>>>> >>>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>>>>> Sydney, Australia >>>>>>>>> ----------------------------------------------------- >>>>>>>>> >>>>>>>>> All 12 packages below: >>>>>>>>> pd.packages<- c( >>>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>>>>>>> "pd.hugene.2.1.st", >>>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>>>>>>> "pd.mogene.2.1.st", >>>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>>>>>>> "pd.ragene.2.1.st" >>>>>>>>> ) >>>>>>>>> >>>>>>>>> a<- b<- list() >>>>>>>>> forpd.pkg.name in pd.packages) { >>>>>>>>> try({ >>>>>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load >>>>>>>>> the >>>>>>>>> pd.package") >>>>>>>>> conn<- db(getpd.pkg.name)) >>>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>>>>>>> featureSet GROUP BY type") >>>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>>>>> featureSet >>>>>>>>> WHERE type = 6")[,1] >>>>>>>>> }) >>>>>>>>> } >>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>> >>>>>>>>>> a >>>>>>>>> $pd.hugene.1.0.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 227 >>>>>>>>> 2 1 253002 >>>>>>>>> 3 2 57 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 1195 >>>>>>>>> 6 7 2904 >>>>>>>>> >>>>>>>>> $pd.hugene.1.1.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 227 >>>>>>>>> 2 1 253002 >>>>>>>>> 3 2 57 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 1195 >>>>>>>>> 6 7 2904 >>>>>>>>> >>>>>>>>> $pd.hugene.2.0.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 3728 >>>>>>>>> 2 1 345497 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 3575 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.hugene.2.1.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 3728 >>>>>>>>> 2 1 345497 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 3575 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.mogene.1.0.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 86 >>>>>>>>> 2 1 234878 >>>>>>>>> 3 2 21 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 1324 >>>>>>>>> 6 7 5222 >>>>>>>>> >>>>>>>>> $pd.mogene.1.1.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 86 >>>>>>>>> 2 1 234878 >>>>>>>>> 3 2 21 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 1324 >>>>>>>>> 6 7 5222 >>>>>>>>> >>>>>>>>> $pd.mogene.2.0.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 810 >>>>>>>>> 2 1 263551 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 5331 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.mogene.2.1.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 810 >>>>>>>>> 2 1 263551 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 5331 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.ragene.1.0.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 254 >>>>>>>>> 2 1 211195 >>>>>>>>> 3 2 21 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 399 >>>>>>>>> 6 7 1153 >>>>>>>>> >>>>>>>>> $pd.ragene.1.1.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 254 >>>>>>>>> 2 1 211195 >>>>>>>>> 3 2 21 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 399 >>>>>>>>> 6 7 1153 >>>>>>>>> >>>>>>>>> $pd.ragene.2.0.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 1071 >>>>>>>>> 2 1 214018 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 5083 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.ragene.2.1.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 1071 >>>>>>>>> 2 1 214018 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 5083 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>>> sapply(b,length) >>>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>>>>> pd.hugene.2.1.st >>>>>>>>> 1195 1195 >>>>>>>>> 0 0 >>>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>>>>> pd.mogene.2.1.st >>>>>>>>> 1324 1324 >>>>>>>>> 0 0 >>>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>>>>> pd.ragene.2.1.st >>>>>>>>> 399 399 >>>>>>>>> 0 0 >>>>>>>>> >>>>>>>>>> sessionInfo() >>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>>> methods >>>>>>>>> [8] base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>>>>> [19] BiocInstaller_1.10.2 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>

ADD REPLY • link 10.8 years ago Mark Cowley ▴ 910

0

Entering edit mode

Dear Mark, This needs probably to be answered by Jim, since for me everything is ok as is. Best regards, Christian On 7/11/13 1:55 AM, Mark Cowley wrote: > Hi, > is it feasible to add a 'type' column to the core_mps table, using info from the transcript csv file? since 'core' implies transcript- level analysis. > > cheers, > Mark > > On 11/07/2013, at 4:29 AM, cstrato <cstrato at="" aon.at=""> > wrote: > >> Dear Mark, >> >> Yes, EIF3D and pos_controls mentioned have different transcript cluster id's. However, imho this means only that they have received these additional transcript cluster id's to be able to define them as normgene->exon. Furthermore, in the transcript annotation file the transcript_cluster_id is identical to the probeset_id. Thus e.g. 16934607 is used as probeset_id in the probeset annotation file. >> >> I have tested a couple of normgene->exon controls and it seems that all of them belong to real genes in the probeset csv file with the type 'main'. Mostly, these come in groups belonging to the same exon_id, e.g. the mentioned controls 16934607 - 16934610 all belong to exon_id 5192052 of the EIF3D gene. >> >> I suppose that you use exons as normgene->exon controls which e.g. in the case of alternative splicing are always expressed. >> >> Best regards, >> Christian >> >> >> On 7/10/13 12:31 PM, Mark Cowley wrote: >>> Hi, >>> I see the point you're making Christian. >>> I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file. >>> >>> Cheers and thanks for looking into this >>> >>> Mark >>> >>> Sent from my iPhone >>> >>> On 10/07/2013, at 7:53 AM, "cstrato" <cstrato at="" aon.at=""> wrote: >>> >>>> Dear Jim, >>>> >>>> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain. >>>> >>>> Best regards, >>>> Christian >>>> >>>> >>>> On 7/9/13 11:46 PM, James W. MacDonald wrote: >>>>> Hi Christian, >>>>> >>>>> Thanks for pointing that out. There is still a bit of an inconsistency >>>>> with the pd packages that should probably be corrected, as at the >>>>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D, >>>>> whereas at the transcript level, this same probeset is intended to be a >>>>> positive control (and as you note below, these probes are incorporated >>>>> into a larger probeset intended to measure the EIF3D transcript). >>>>> >>>>> It would be nice to be able to filter out the controls like Mark >>>>> attempted (and I do regularly as well). >>>>> >>>>> Mark - I talked with Benilton Carvalho about this, and he will take a >>>>> look next week. >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> On 7/9/2013 3:38 PM, cstrato wrote: >>>>>> Dear Jim, >>>>>> >>>>>> In xps I use as basic file for exon arrays the probeset annotation >>>>>> file and then compare the data to the data from the pgf-file. Any >>>>>> differences will be reported. >>>>>> >>>>>> I have just checked the different files for HuGene-2_0-st. If you >>>>>> check as an example the following probeset_ids: >>>>>> 16934607 >>>>>> 16934608 >>>>>> 16934609 >>>>>> 16934610 >>>>>> >>>>>> Then you will see that the transcript annotation file lists these ids >>>>>> as 'normgene->exon' and 'pos_control'. However, the probeset >>>>>> annotation file lists these ids as 'main' belonging to gene EIF3D with >>>>>> transcript_cluster_id 16934583. Looking for this id in the transcript >>>>>> annotation file reveals that the number of 'total_probes' is 24. >>>>>> Indeed, the probeset annotation file lists 24 probesets including the >>>>>> four above mentioned probeset_ids. >>>>>> >>>>>> This means that although these four probesets are listed in the >>>>>> transcript annotation file as 'normgene->exon' the label 'main' in the >>>>>> pgf-file is correct since these probesets are part of the gene EIF3D. >>>>>> >>>>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >>>>>> listed as 'normgene->exon'. However, in this case these probesets are >>>>>> also listed as 'normgene->exon' in the probeset annotation file, i.e. >>>>>> these probesets do not belong to any transcript listed in the probeset >>>>>> annotation file. >>>>>> >>>>>> Best regards, >>>>>> Christian >>>>>> >>>>>> >>>>>> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>>>>>> Hi Christian, >>>>>>> >>>>>>> That's not the issue. Instead, the issue is that the pgf file lists the >>>>>>> normgene->exon probeset IDs as being 'main'. I have received a response >>>>>>> from Affy stating that the qcc file lists the normgene->exon probesets >>>>>>> as pos_control, but that seems orthogonal to the issue at hand. >>>>>>> >>>>>>>> qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", >>>>>>> stringsAsFactors=F, header=T) >>>>>>>> pgf <- readPgf("HuGene-2_0-st.pgf") >>>>>>>> head(qcc) >>>>>>> probeset_id group_name probeset_name quantification_in_header >>>>>>> 1 16650001 neg_control 16650001 0 >>>>>>> 2 16650003 neg_control 16650003 0 >>>>>>> >>>>>>> ## get the positive controls (normgene->exon probesets) from the qcc >>>>>>> file >>>>>>>> pos_cont <- qcc[qcc[,2] == "pos_control",1] >>>>>>> >>>>>>> ## compare to pgf file >>>>>>>> x <- pgf$probesetType[pgf$probesetId %in% pos_cont] >>>>>>>> table(x) >>>>>>> x >>>>>>> main >>>>>>> 1626 >>>>>>> >>>>>>> So in the pgf file, these probesets are being called 'main' instead of >>>>>>> some sort of control. How do you handle this in xps? Do you use the pgf >>>>>>> file? >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 7/9/2013 2:06 PM, cstrato wrote: >>>>>>>> Dear Jim, >>>>>>>> >>>>>>>> I did not really follow the discussion so I may be wrong, but if you >>>>>>>> mean that there is a difference between the number of 'main' types, >>>>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to >>>>>>>> the number of 'main' in the probeset annotation file and not in the >>>>>>>> transcript annotation file. >>>>>>>> >>>>>>>> But as I said, I may have misunderstood the problem. >>>>>>>> >>>>>>>> I am mainly replying because at the beginning of this year I had long >>>>>>>> discussions with DevNet to make sure that the annotation files for the >>>>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>>>>>> everything what I have found. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Christian >>>>>>>> >>>>>>>> >>>>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>>>>>> Hi Mark, >>>>>>>>> >>>>>>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>>>>>> transcript and probeset annotation files (and had them fixed), but >>>>>>>>> didn't think I needed to check the others. Famous last words, no? >>>>>>>>> >>>>>>>>>> x <- readPgf("HuGene-2_0-st.pgf") >>>>>>>>>> table(x$probesetType) >>>>>>>>> >>>>>>>>> control->affx control->affx->bac_spike >>>>>>>>> 18 18 >>>>>>>>> control->affx->ercc_spike control->affx->polya_spike >>>>>>>>> 92 39 >>>>>>>>> control->bgp->antigenomic main >>>>>>>>> 23 349012 >>>>>>>>> normgene->intron reporter >>>>>>>>> 3575 82 >>>>>>>>> >>>>>>>>>> y <- read.csv("HuGene- 2_0-st-v1.na33.2.hg19.transcript.csv", >>>>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>>>>>>> table(y$category) >>>>>>>>> >>>>>>>>> control->affx control->affx->bac_spike >>>>>>>>> 18 18 >>>>>>>>> control->affx->ercc-spike control->affx->polya_spike >>>>>>>>> 92 39 >>>>>>>>> control->bgp->antigenomic main >>>>>>>>> 23 44629 >>>>>>>>> normgene->exon normgene->intron >>>>>>>>> 1626 3575 >>>>>>>>> reporter rescue >>>>>>>>> 82 3515 >>>>>>>>> >>>>>>>>> I'll ping Affymetrix and see what they have to say. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>>>>>> Dear Benilton, James& Bioconductors, >>>>>>>>>> Thanks for providing the platform design packages for >>>>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>>>>>> >>>>>>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>>>>>> >>>>>>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>>>>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>>>>>> category: >>>>>>>>>> >>>>>>>>>> # probe types: >>>>>>>>>> librarypd.hugene.2.0.st) >>>>>>>>>> conn<- dbpd.hugene.2.0.st) >>>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>>> type type_id >>>>>>>>>> 1 1 main >>>>>>>>>> 2 2 control->affx >>>>>>>>>> 3 3 control->chip >>>>>>>>>> 4 4 control->bgp->antigenomic >>>>>>>>>> 5 5 control->bgp->genomic >>>>>>>>>> 6 6 normgene->exon >>>>>>>>>> 7 7 normgene->intron >>>>>>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>>>>>> 9 9 control->affx->bac_spike >>>>>>>>>> 10 10 oligo_spike_in >>>>>>>>>> 11 11 r1_bac_spike_at >>>>>>>>>> >>>>>>>>>> # probe counts for each of the probe categories: >>>>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>>>>>> type") >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 3728 >>>>>>>>>> 2 1 345497 >>>>>>>>>> 3 2 18 >>>>>>>>>> 4 4 23 >>>>>>>>>> 5 7 3575 >>>>>>>>>> 6 9 18 >>>>>>>>>> >>>>>>>>>> NB: no type 6 probes. >>>>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>>>>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>>>>>> >>>>>>>>>> Can these mappings please be updated? >>>>>>>>>> >>>>>>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>>>>>> haven't investigated these in any detail. >>>>>>>>>> >>>>>>>>>> cheers, >>>>>>>>>> Mark >>>>>>>>>> ----------------------------------------------------- >>>>>>>>>> Mark Cowley, PhD >>>>>>>>>> >>>>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>>>>>> Sydney, Australia >>>>>>>>>> ----------------------------------------------------- >>>>>>>>>> >>>>>>>>>> All 12 packages below: >>>>>>>>>> pd.packages<- c( >>>>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>>>>>>>> "pd.hugene.2.1.st", >>>>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>>>>>>>> "pd.mogene.2.1.st", >>>>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>>>>>>>> "pd.ragene.2.1.st" >>>>>>>>>> ) >>>>>>>>>> >>>>>>>>>> a<- b<- list() >>>>>>>>>> forpd.pkg.name in pd.packages) { >>>>>>>>>> try({ >>>>>>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load >>>>>>>>>> the >>>>>>>>>> pd.package") >>>>>>>>>> conn<- db(getpd.pkg.name)) >>>>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>>>>>>>> featureSet GROUP BY type") >>>>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>>>>>> featureSet >>>>>>>>>> WHERE type = 6")[,1] >>>>>>>>>> }) >>>>>>>>>> } >>>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>>> >>>>>>>>>>> a >>>>>>>>>> $pd.hugene.1.0.st.v1 >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 227 >>>>>>>>>> 2 1 253002 >>>>>>>>>> 3 2 57 >>>>>>>>>> 4 4 45 >>>>>>>>>> 5 6 1195 >>>>>>>>>> 6 7 2904 >>>>>>>>>> >>>>>>>>>> $pd.hugene.1.1.st.v1 >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 227 >>>>>>>>>> 2 1 253002 >>>>>>>>>> 3 2 57 >>>>>>>>>> 4 4 45 >>>>>>>>>> 5 6 1195 >>>>>>>>>> 6 7 2904 >>>>>>>>>> >>>>>>>>>> $pd.hugene.2.0.st >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 3728 >>>>>>>>>> 2 1 345497 >>>>>>>>>> 3 2 18 >>>>>>>>>> 4 4 23 >>>>>>>>>> 5 7 3575 >>>>>>>>>> 6 9 18 >>>>>>>>>> >>>>>>>>>> $pd.hugene.2.1.st >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 3728 >>>>>>>>>> 2 1 345497 >>>>>>>>>> 3 2 18 >>>>>>>>>> 4 4 23 >>>>>>>>>> 5 7 3575 >>>>>>>>>> 6 9 18 >>>>>>>>>> >>>>>>>>>> $pd.mogene.1.0.st.v1 >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 86 >>>>>>>>>> 2 1 234878 >>>>>>>>>> 3 2 21 >>>>>>>>>> 4 4 45 >>>>>>>>>> 5 6 1324 >>>>>>>>>> 6 7 5222 >>>>>>>>>> >>>>>>>>>> $pd.mogene.1.1.st.v1 >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 86 >>>>>>>>>> 2 1 234878 >>>>>>>>>> 3 2 21 >>>>>>>>>> 4 4 45 >>>>>>>>>> 5 6 1324 >>>>>>>>>> 6 7 5222 >>>>>>>>>> >>>>>>>>>> $pd.mogene.2.0.st >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 810 >>>>>>>>>> 2 1 263551 >>>>>>>>>> 3 2 18 >>>>>>>>>> 4 4 23 >>>>>>>>>> 5 7 5331 >>>>>>>>>> 6 9 18 >>>>>>>>>> >>>>>>>>>> $pd.mogene.2.1.st >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 810 >>>>>>>>>> 2 1 263551 >>>>>>>>>> 3 2 18 >>>>>>>>>> 4 4 23 >>>>>>>>>> 5 7 5331 >>>>>>>>>> 6 9 18 >>>>>>>>>> >>>>>>>>>> $pd.ragene.1.0.st.v1 >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 254 >>>>>>>>>> 2 1 211195 >>>>>>>>>> 3 2 21 >>>>>>>>>> 4 4 45 >>>>>>>>>> 5 6 399 >>>>>>>>>> 6 7 1153 >>>>>>>>>> >>>>>>>>>> $pd.ragene.1.1.st.v1 >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 254 >>>>>>>>>> 2 1 211195 >>>>>>>>>> 3 2 21 >>>>>>>>>> 4 4 45 >>>>>>>>>> 5 6 399 >>>>>>>>>> 6 7 1153 >>>>>>>>>> >>>>>>>>>> $pd.ragene.2.0.st >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 1071 >>>>>>>>>> 2 1 214018 >>>>>>>>>> 3 2 18 >>>>>>>>>> 4 4 23 >>>>>>>>>> 5 7 5083 >>>>>>>>>> 6 9 18 >>>>>>>>>> >>>>>>>>>> $pd.ragene.2.1.st >>>>>>>>>> type count(*) >>>>>>>>>> 1 NA 1071 >>>>>>>>>> 2 1 214018 >>>>>>>>>> 3 2 18 >>>>>>>>>> 4 4 23 >>>>>>>>>> 5 7 5083 >>>>>>>>>> 6 9 18 >>>>>>>>>> >>>>>>>>>>> sapply(b,length) >>>>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>>>>>> pd.hugene.2.1.st >>>>>>>>>> 1195 1195 >>>>>>>>>> 0 0 >>>>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>>>>>> pd.mogene.2.1.st >>>>>>>>>> 1324 1324 >>>>>>>>>> 0 0 >>>>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>>>>>> pd.ragene.2.1.st >>>>>>>>>> 399 399 >>>>>>>>>> 0 0 >>>>>>>>>> >>>>>>>>>>> sessionInfo() >>>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>>> >>>>>>>>>> locale: >>>>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>>>>>> >>>>>>>>>> attached base packages: >>>>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>>>> methods >>>>>>>>>> [8] base >>>>>>>>>> >>>>>>>>>> other attached packages: >>>>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>>>>>> [19] BiocInstaller_1.10.2 >>>>>>>>>> >>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: >>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>> > >

ADD REPLY • link 10.8 years ago cstrato ★ 3.9k

0

Entering edit mode

Hi Mark et. al., On the initial example given by Mark, the PGF lists (a selection of) probes as "main", while the transcript annotation file lists them as "normgene->exon". I understand that Affymetrix uses probes that map to "actual stuff" (lack of better word, sorry) as positive controls... In this case, we have probes that are both "main" and "normgene->exon". On the newest designs, we are missing the annotation described in the transcript annotation file for the probes that belong to multiple classes. If I'm not missing any information, Mark's suggestion (add 'type' to the *_mps tables) seems appropriate. But that requires redesigning the db... Will investigate the best approach for this and come back with more on the topic soon. b On 11 July 2013 14:53, cstrato <cstrato at="" aon.at=""> wrote: > Dear Mark, > > This needs probably to be answered by Jim, since for me everything is ok as > is. > > Best regards, > Christian > > > > On 7/11/13 1:55 AM, Mark Cowley wrote: >> >> Hi, >> is it feasible to add a 'type' column to the core_mps table, using info >> from the transcript csv file? since 'core' implies transcript-level >> analysis. >> >> cheers, >> Mark >> >> On 11/07/2013, at 4:29 AM, cstrato <cstrato at="" aon.at=""> >> wrote: >> >>> Dear Mark, >>> >>> Yes, EIF3D and pos_controls mentioned have different transcript cluster >>> id's. However, imho this means only that they have received these additional >>> transcript cluster id's to be able to define them as normgene->exon. >>> Furthermore, in the transcript annotation file the transcript_cluster_id is >>> identical to the probeset_id. Thus e.g. 16934607 is used as probeset_id in >>> the probeset annotation file. >>> >>> I have tested a couple of normgene->exon controls and it seems that all >>> of them belong to real genes in the probeset csv file with the type 'main'. >>> Mostly, these come in groups belonging to the same exon_id, e.g. the >>> mentioned controls 16934607 - 16934610 all belong to exon_id 5192052 of the >>> EIF3D gene. >>> >>> I suppose that you use exons as normgene->exon controls which e.g. in the >>> case of alternative splicing are always expressed. >>> >>> Best regards, >>> Christian >>> >>> >>> On 7/10/13 12:31 PM, Mark Cowley wrote: >>>> >>>> Hi, >>>> I see the point you're making Christian. >>>> I'm offline now, so cant check this, but i assume that EIF3D and the >>>> pos_control meta-probeset in question have different transcript cluster >>>> id's. it doesn't make sense for the pos_control transcript cluster Id to be >>>> tagged as 'main'. My grep -f was still running when i left work to confirm >>>> whether all normgene->exon probesets are all deployed within different real >>>> genes in the probeset csv file. >>>> >>>> Cheers and thanks for looking into this >>>> >>>> Mark >>>> >>>> Sent from my iPhone >>>> >>>> On 10/07/2013, at 7:53 AM, "cstrato" <cstrato at="" aon.at=""> wrote: >>>> >>>>> Dear Jim, >>>>> >>>>> As far as I understand it, at the transcript level 16934607 is on one >>>>> hand part of the EIF3D transcript and on the other hand does serve as a >>>>> positive control. To me this seems to be no contradiction, but probably only >>>>> DevNet can explain. >>>>> >>>>> Best regards, >>>>> Christian >>>>> >>>>> >>>>> On 7/9/13 11:46 PM, James W. MacDonald wrote: >>>>>> >>>>>> Hi Christian, >>>>>> >>>>>> Thanks for pointing that out. There is still a bit of an inconsistency >>>>>> with the pd packages that should probably be corrected, as at the >>>>>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D, >>>>>> whereas at the transcript level, this same probeset is intended to be >>>>>> a >>>>>> positive control (and as you note below, these probes are incorporated >>>>>> into a larger probeset intended to measure the EIF3D transcript). >>>>>> >>>>>> It would be nice to be able to filter out the controls like Mark >>>>>> attempted (and I do regularly as well). >>>>>> >>>>>> Mark - I talked with Benilton Carvalho about this, and he will take a >>>>>> look next week. >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> On 7/9/2013 3:38 PM, cstrato wrote: >>>>>>> >>>>>>> Dear Jim, >>>>>>> >>>>>>> In xps I use as basic file for exon arrays the probeset annotation >>>>>>> file and then compare the data to the data from the pgf-file. Any >>>>>>> differences will be reported. >>>>>>> >>>>>>> I have just checked the different files for HuGene-2_0-st. If you >>>>>>> check as an example the following probeset_ids: >>>>>>> 16934607 >>>>>>> 16934608 >>>>>>> 16934609 >>>>>>> 16934610 >>>>>>> >>>>>>> Then you will see that the transcript annotation file lists these ids >>>>>>> as 'normgene->exon' and 'pos_control'. However, the probeset >>>>>>> annotation file lists these ids as 'main' belonging to gene EIF3D >>>>>>> with >>>>>>> transcript_cluster_id 16934583. Looking for this id in the transcript >>>>>>> annotation file reveals that the number of 'total_probes' is 24. >>>>>>> Indeed, the probeset annotation file lists 24 probesets including the >>>>>>> four above mentioned probeset_ids. >>>>>>> >>>>>>> This means that although these four probesets are listed in the >>>>>>> transcript annotation file as 'normgene->exon' the label 'main' in >>>>>>> the >>>>>>> pgf-file is correct since these probesets are part of the gene EIF3D. >>>>>>> >>>>>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >>>>>>> listed as 'normgene->exon'. However, in this case these probesets are >>>>>>> also listed as 'normgene->exon' in the probeset annotation file, i.e. >>>>>>> these probesets do not belong to any transcript listed in the >>>>>>> probeset >>>>>>> annotation file. >>>>>>> >>>>>>> Best regards, >>>>>>> Christian >>>>>>> >>>>>>> >>>>>>> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>>>>>>> >>>>>>>> Hi Christian, >>>>>>>> >>>>>>>> That's not the issue. Instead, the issue is that the pgf file lists >>>>>>>> the >>>>>>>> normgene->exon probeset IDs as being 'main'. I have received a >>>>>>>> response >>>>>>>> from Affy stating that the qcc file lists the normgene->exon >>>>>>>> probesets >>>>>>>> as pos_control, but that seems orthogonal to the issue at hand. >>>>>>>> >>>>>>>>> qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", >>>>>>>> >>>>>>>> stringsAsFactors=F, header=T) >>>>>>>>> >>>>>>>>> pgf <- readPgf("HuGene-2_0-st.pgf") >>>>>>>>> head(qcc) >>>>>>>> >>>>>>>> probeset_id group_name probeset_name quantification_in_header >>>>>>>> 1 16650001 neg_control 16650001 0 >>>>>>>> 2 16650003 neg_control 16650003 0 >>>>>>>> >>>>>>>> ## get the positive controls (normgene->exon probesets) from the qcc >>>>>>>> file >>>>>>>>> >>>>>>>>> pos_cont <- qcc[qcc[,2] == "pos_control",1] >>>>>>>> >>>>>>>> >>>>>>>> ## compare to pgf file >>>>>>>>> >>>>>>>>> x <- pgf$probesetType[pgf$probesetId %in% pos_cont] >>>>>>>>> table(x) >>>>>>>> >>>>>>>> x >>>>>>>> main >>>>>>>> 1626 >>>>>>>> >>>>>>>> So in the pgf file, these probesets are being called 'main' instead >>>>>>>> of >>>>>>>> some sort of control. How do you handle this in xps? Do you use the >>>>>>>> pgf >>>>>>>> file? >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 7/9/2013 2:06 PM, cstrato wrote: >>>>>>>>> >>>>>>>>> Dear Jim, >>>>>>>>> >>>>>>>>> I did not really follow the discussion so I may be wrong, but if >>>>>>>>> you >>>>>>>>> mean that there is a difference between the number of 'main' types, >>>>>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds >>>>>>>>> to >>>>>>>>> the number of 'main' in the probeset annotation file and not in the >>>>>>>>> transcript annotation file. >>>>>>>>> >>>>>>>>> But as I said, I may have misunderstood the problem. >>>>>>>>> >>>>>>>>> I am mainly replying because at the beginning of this year I had >>>>>>>>> long >>>>>>>>> discussions with DevNet to make sure that the annotation files for >>>>>>>>> the >>>>>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>>>>>>> everything what I have found. >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Christian >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>>>>>>> >>>>>>>>>> Hi Mark, >>>>>>>>>> >>>>>>>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>>>>>>> transcript and probeset annotation files (and had them fixed), but >>>>>>>>>> didn't think I needed to check the others. Famous last words, no? >>>>>>>>>> >>>>>>>>>>> x <- readPgf("HuGene-2_0-st.pgf") >>>>>>>>>>> table(x$probesetType) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> control->affx control->affx->bac_spike >>>>>>>>>> 18 18 >>>>>>>>>> control->affx->ercc_spike control->affx->polya_spike >>>>>>>>>> 92 39 >>>>>>>>>> control->bgp->antigenomic main >>>>>>>>>> 23 349012 >>>>>>>>>> normgene->intron reporter >>>>>>>>>> 3575 82 >>>>>>>>>> >>>>>>>>>>> y <- read.csv("HuGene- 2_0-st-v1.na33.2.hg19.transcript.csv", >>>>>>>>>> >>>>>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>>>>>>>> >>>>>>>>>>> table(y$category) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> control->affx control->affx->bac_spike >>>>>>>>>> 18 18 >>>>>>>>>> control->affx->ercc-spike control->affx->polya_spike >>>>>>>>>> 92 39 >>>>>>>>>> control->bgp->antigenomic main >>>>>>>>>> 23 44629 >>>>>>>>>> normgene->exon normgene->intron >>>>>>>>>> 1626 3575 >>>>>>>>>> reporter rescue >>>>>>>>>> 82 3515 >>>>>>>>>> >>>>>>>>>> I'll ping Affymetrix and see what they have to say. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Jim >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>>>>>>> >>>>>>>>>>> Dear Benilton, James& Bioconductors, >>>>>>>>>>> Thanks for providing the platform design packages for >>>>>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>>>>>>> >>>>>>>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, >>>>>>>>>>> but >>>>>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>>>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>>>>>>> >>>>>>>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx >>>>>>>>>>> csv >>>>>>>>>>> files lists 1626 normgene->exon probes, however the >>>>>>>>>>> pg.hugene.2.0.st >>>>>>>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>>>>>>> category: >>>>>>>>>>> >>>>>>>>>>> # probe types: >>>>>>>>>>> librarypd.hugene.2.0.st) >>>>>>>>>>> conn<- dbpd.hugene.2.0.st) >>>>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>>>> type type_id >>>>>>>>>>> 1 1 main >>>>>>>>>>> 2 2 control->affx >>>>>>>>>>> 3 3 control->chip >>>>>>>>>>> 4 4 control->bgp->antigenomic >>>>>>>>>>> 5 5 control->bgp->genomic >>>>>>>>>>> 6 6 normgene->exon >>>>>>>>>>> 7 7 normgene->intron >>>>>>>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>>>>>>> 9 9 control->affx->bac_spike >>>>>>>>>>> 10 10 oligo_spike_in >>>>>>>>>>> 11 11 r1_bac_spike_at >>>>>>>>>>> >>>>>>>>>>> # probe counts for each of the probe categories: >>>>>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>>>>>>> type") >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 3728 >>>>>>>>>>> 2 1 345497 >>>>>>>>>>> 3 2 18 >>>>>>>>>>> 4 4 23 >>>>>>>>>>> 5 7 3575 >>>>>>>>>>> 6 9 18 >>>>>>>>>>> >>>>>>>>>>> NB: no type 6 probes. >>>>>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and >>>>>>>>>>> see >>>>>>>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>>>>>>> >>>>>>>>>>> Can these mappings please be updated? >>>>>>>>>>> >>>>>>>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>>>>>>> haven't investigated these in any detail. >>>>>>>>>>> >>>>>>>>>>> cheers, >>>>>>>>>>> Mark >>>>>>>>>>> ----------------------------------------------------- >>>>>>>>>>> Mark Cowley, PhD >>>>>>>>>>> >>>>>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>>>>>>> Sydney, Australia >>>>>>>>>>> ----------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> All 12 packages below: >>>>>>>>>>> pd.packages<- c( >>>>>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", >>>>>>>>>>> "pd.hugene.2.0.st", >>>>>>>>>>> "pd.hugene.2.1.st", >>>>>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", >>>>>>>>>>> "pd.mogene.2.0.st", >>>>>>>>>>> "pd.mogene.2.1.st", >>>>>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", >>>>>>>>>>> "pd.ragene.2.0.st", >>>>>>>>>>> "pd.ragene.2.1.st" >>>>>>>>>>> ) >>>>>>>>>>> >>>>>>>>>>> a<- b<- list() >>>>>>>>>>> forpd.pkg.name in pd.packages) { >>>>>>>>>>> try({ >>>>>>>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't >>>>>>>>>>> load >>>>>>>>>>> the >>>>>>>>>>> pd.package") >>>>>>>>>>> conn<- db(getpd.pkg.name)) >>>>>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) >>>>>>>>>>> from >>>>>>>>>>> featureSet GROUP BY type") >>>>>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>>>>>>> featureSet >>>>>>>>>>> WHERE type = 6")[,1] >>>>>>>>>>> }) >>>>>>>>>>> } >>>>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>>>> >>>>>>>>>>>> a >>>>>>>>>>> >>>>>>>>>>> $pd.hugene.1.0.st.v1 >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 227 >>>>>>>>>>> 2 1 253002 >>>>>>>>>>> 3 2 57 >>>>>>>>>>> 4 4 45 >>>>>>>>>>> 5 6 1195 >>>>>>>>>>> 6 7 2904 >>>>>>>>>>> >>>>>>>>>>> $pd.hugene.1.1.st.v1 >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 227 >>>>>>>>>>> 2 1 253002 >>>>>>>>>>> 3 2 57 >>>>>>>>>>> 4 4 45 >>>>>>>>>>> 5 6 1195 >>>>>>>>>>> 6 7 2904 >>>>>>>>>>> >>>>>>>>>>> $pd.hugene.2.0.st >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 3728 >>>>>>>>>>> 2 1 345497 >>>>>>>>>>> 3 2 18 >>>>>>>>>>> 4 4 23 >>>>>>>>>>> 5 7 3575 >>>>>>>>>>> 6 9 18 >>>>>>>>>>> >>>>>>>>>>> $pd.hugene.2.1.st >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 3728 >>>>>>>>>>> 2 1 345497 >>>>>>>>>>> 3 2 18 >>>>>>>>>>> 4 4 23 >>>>>>>>>>> 5 7 3575 >>>>>>>>>>> 6 9 18 >>>>>>>>>>> >>>>>>>>>>> $pd.mogene.1.0.st.v1 >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 86 >>>>>>>>>>> 2 1 234878 >>>>>>>>>>> 3 2 21 >>>>>>>>>>> 4 4 45 >>>>>>>>>>> 5 6 1324 >>>>>>>>>>> 6 7 5222 >>>>>>>>>>> >>>>>>>>>>> $pd.mogene.1.1.st.v1 >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 86 >>>>>>>>>>> 2 1 234878 >>>>>>>>>>> 3 2 21 >>>>>>>>>>> 4 4 45 >>>>>>>>>>> 5 6 1324 >>>>>>>>>>> 6 7 5222 >>>>>>>>>>> >>>>>>>>>>> $pd.mogene.2.0.st >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 810 >>>>>>>>>>> 2 1 263551 >>>>>>>>>>> 3 2 18 >>>>>>>>>>> 4 4 23 >>>>>>>>>>> 5 7 5331 >>>>>>>>>>> 6 9 18 >>>>>>>>>>> >>>>>>>>>>> $pd.mogene.2.1.st >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 810 >>>>>>>>>>> 2 1 263551 >>>>>>>>>>> 3 2 18 >>>>>>>>>>> 4 4 23 >>>>>>>>>>> 5 7 5331 >>>>>>>>>>> 6 9 18 >>>>>>>>>>> >>>>>>>>>>> $pd.ragene.1.0.st.v1 >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 254 >>>>>>>>>>> 2 1 211195 >>>>>>>>>>> 3 2 21 >>>>>>>>>>> 4 4 45 >>>>>>>>>>> 5 6 399 >>>>>>>>>>> 6 7 1153 >>>>>>>>>>> >>>>>>>>>>> $pd.ragene.1.1.st.v1 >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 254 >>>>>>>>>>> 2 1 211195 >>>>>>>>>>> 3 2 21 >>>>>>>>>>> 4 4 45 >>>>>>>>>>> 5 6 399 >>>>>>>>>>> 6 7 1153 >>>>>>>>>>> >>>>>>>>>>> $pd.ragene.2.0.st >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 1071 >>>>>>>>>>> 2 1 214018 >>>>>>>>>>> 3 2 18 >>>>>>>>>>> 4 4 23 >>>>>>>>>>> 5 7 5083 >>>>>>>>>>> 6 9 18 >>>>>>>>>>> >>>>>>>>>>> $pd.ragene.2.1.st >>>>>>>>>>> type count(*) >>>>>>>>>>> 1 NA 1071 >>>>>>>>>>> 2 1 214018 >>>>>>>>>>> 3 2 18 >>>>>>>>>>> 4 4 23 >>>>>>>>>>> 5 7 5083 >>>>>>>>>>> 6 9 18 >>>>>>>>>>> >>>>>>>>>>>> sapply(b,length) >>>>>>>>>>> >>>>>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>>>>>>> pd.hugene.2.1.st >>>>>>>>>>> 1195 1195 >>>>>>>>>>> 0 0 >>>>>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>>>>>>> pd.mogene.2.1.st >>>>>>>>>>> 1324 1324 >>>>>>>>>>> 0 0 >>>>>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>>>>>>> pd.ragene.2.1.st >>>>>>>>>>> 399 399 >>>>>>>>>>> 0 0 >>>>>>>>>>> >>>>>>>>>>>> sessionInfo() >>>>>>>>>>> >>>>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>>>> >>>>>>>>>>> locale: >>>>>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>>>>>>> >>>>>>>>>>> attached base packages: >>>>>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>>>>> methods >>>>>>>>>>> [8] base >>>>>>>>>>> >>>>>>>>>>> other attached packages: >>>>>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>>>>>>> [19] BiocInstaller_1.10.2 >>>>>>>>>>> >>>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 >>>>>>>>>>> Biostrings_2.28.0 >>>>>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>> Search the archives: >>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>> >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.8 years ago Benilton Carvalho ★ 4.3k

0

Entering edit mode

Hi Benilton, That sounds like a great solution. I hope it's not too much work! Mark Sent from my iPhone On 20/07/2013, at 9:25 AM, "Benilton Carvalho" <beniltoncarvalho at="" gmail.com=""> wrote: > Hi Mark et. al., > > On the initial example given by Mark, the PGF lists (a selection of) > probes as "main", while the transcript annotation file lists them as > "normgene->exon". > > I understand that Affymetrix uses probes that map to "actual stuff" > (lack of better word, sorry) as positive controls... In this case, we > have probes that are both "main" and "normgene->exon". > > On the newest designs, we are missing the annotation described in the > transcript annotation file for the probes that belong to multiple > classes. > > If I'm not missing any information, Mark's suggestion (add 'type' to > the *_mps tables) seems appropriate. But that requires redesigning the > db... Will investigate the best approach for this and come back with > more on the topic soon. > > b > > On 11 July 2013 14:53, cstrato <cstrato at="" aon.at=""> wrote: >> Dear Mark, >> >> This needs probably to be answered by Jim, since for me everything is ok as >> is. >> >> Best regards, >> Christian >> >> >> >> On 7/11/13 1:55 AM, Mark Cowley wrote: >>> >>> Hi, >>> is it feasible to add a 'type' column to the core_mps table, using info >>> from the transcript csv file? since 'core' implies transcript- level >>> analysis. >>> >>> cheers, >>> Mark >>> >>> On 11/07/2013, at 4:29 AM, cstrato <cstrato at="" aon.at=""> >>> wrote: >>> >>>> Dear Mark, >>>> >>>> Yes, EIF3D and pos_controls mentioned have different transcript cluster >>>> id's. However, imho this means only that they have received these additional >>>> transcript cluster id's to be able to define them as normgene->exon. >>>> Furthermore, in the transcript annotation file the transcript_cluster_id is >>>> identical to the probeset_id. Thus e.g. 16934607 is used as probeset_id in >>>> the probeset annotation file. >>>> >>>> I have tested a couple of normgene->exon controls and it seems that all >>>> of them belong to real genes in the probeset csv file with the type 'main'. >>>> Mostly, these come in groups belonging to the same exon_id, e.g. the >>>> mentioned controls 16934607 - 16934610 all belong to exon_id 5192052 of the >>>> EIF3D gene. >>>> >>>> I suppose that you use exons as normgene->exon controls which e.g. in the >>>> case of alternative splicing are always expressed. >>>> >>>> Best regards, >>>> Christian >>>> >>>> >>>> On 7/10/13 12:31 PM, Mark Cowley wrote: >>>>> >>>>> Hi, >>>>> I see the point you're making Christian. >>>>> I'm offline now, so cant check this, but i assume that EIF3D and the >>>>> pos_control meta-probeset in question have different transcript cluster >>>>> id's. it doesn't make sense for the pos_control transcript cluster Id to be >>>>> tagged as 'main'. My grep -f was still running when i left work to confirm >>>>> whether all normgene->exon probesets are all deployed within different real >>>>> genes in the probeset csv file. >>>>> >>>>> Cheers and thanks for looking into this >>>>> >>>>> Mark >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On 10/07/2013, at 7:53 AM, "cstrato" <cstrato at="" aon.at=""> wrote: >>>>> >>>>>> Dear Jim, >>>>>> >>>>>> As far as I understand it, at the transcript level 16934607 is on one >>>>>> hand part of the EIF3D transcript and on the other hand does serve as a >>>>>> positive control. To me this seems to be no contradiction, but probably only >>>>>> DevNet can explain. >>>>>> >>>>>> Best regards, >>>>>> Christian >>>>>> >>>>>> >>>>>> On 7/9/13 11:46 PM, James W. MacDonald wrote: >>>>>>> >>>>>>> Hi Christian, >>>>>>> >>>>>>> Thanks for pointing that out. There is still a bit of an inconsistency >>>>>>> with the pd packages that should probably be corrected, as at the >>>>>>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D, >>>>>>> whereas at the transcript level, this same probeset is intended to be >>>>>>> a >>>>>>> positive control (and as you note below, these probes are incorporated >>>>>>> into a larger probeset intended to measure the EIF3D transcript). >>>>>>> >>>>>>> It would be nice to be able to filter out the controls like Mark >>>>>>> attempted (and I do regularly as well). >>>>>>> >>>>>>> Mark - I talked with Benilton Carvalho about this, and he will take a >>>>>>> look next week. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 7/9/2013 3:38 PM, cstrato wrote: >>>>>>>> >>>>>>>> Dear Jim, >>>>>>>> >>>>>>>> In xps I use as basic file for exon arrays the probeset annotation >>>>>>>> file and then compare the data to the data from the pgf-file. Any >>>>>>>> differences will be reported. >>>>>>>> >>>>>>>> I have just checked the different files for HuGene-2_0-st. If you >>>>>>>> check as an example the following probeset_ids: >>>>>>>> 16934607 >>>>>>>> 16934608 >>>>>>>> 16934609 >>>>>>>> 16934610 >>>>>>>> >>>>>>>> Then you will see that the transcript annotation file lists these ids >>>>>>>> as 'normgene->exon' and 'pos_control'. However, the probeset >>>>>>>> annotation file lists these ids as 'main' belonging to gene EIF3D >>>>>>>> with >>>>>>>> transcript_cluster_id 16934583. Looking for this id in the transcript >>>>>>>> annotation file reveals that the number of 'total_probes' is 24. >>>>>>>> Indeed, the probeset annotation file lists 24 probesets including the >>>>>>>> four above mentioned probeset_ids. >>>>>>>> >>>>>>>> This means that although these four probesets are listed in the >>>>>>>> transcript annotation file as 'normgene->exon' the label 'main' in >>>>>>>> the >>>>>>>> pgf-file is correct since these probesets are part of the gene EIF3D. >>>>>>>> >>>>>>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >>>>>>>> listed as 'normgene->exon'. However, in this case these probesets are >>>>>>>> also listed as 'normgene->exon' in the probeset annotation file, i.e. >>>>>>>> these probesets do not belong to any transcript listed in the >>>>>>>> probeset >>>>>>>> annotation file. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Christian >>>>>>>> >>>>>>>> >>>>>>>> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>>>>>>>> >>>>>>>>> Hi Christian, >>>>>>>>> >>>>>>>>> That's not the issue. Instead, the issue is that the pgf file lists >>>>>>>>> the >>>>>>>>> normgene->exon probeset IDs as being 'main'. I have received a >>>>>>>>> response >>>>>>>>> from Affy stating that the qcc file lists the normgene->exon >>>>>>>>> probesets >>>>>>>>> as pos_control, but that seems orthogonal to the issue at hand. >>>>>>>>> >>>>>>>>>> qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#", >>>>>>>>> >>>>>>>>> stringsAsFactors=F, header=T) >>>>>>>>>> >>>>>>>>>> pgf <- readPgf("HuGene-2_0-st.pgf") >>>>>>>>>> head(qcc) >>>>>>>>> >>>>>>>>> probeset_id group_name probeset_name quantification_in_header >>>>>>>>> 1 16650001 neg_control 16650001 0 >>>>>>>>> 2 16650003 neg_control 16650003 0 >>>>>>>>> >>>>>>>>> ## get the positive controls (normgene->exon probesets) from the qcc >>>>>>>>> file >>>>>>>>>> >>>>>>>>>> pos_cont <- qcc[qcc[,2] == "pos_control",1] >>>>>>>>> >>>>>>>>> >>>>>>>>> ## compare to pgf file >>>>>>>>>> >>>>>>>>>> x <- pgf$probesetType[pgf$probesetId %in% pos_cont] >>>>>>>>>> table(x) >>>>>>>>> >>>>>>>>> x >>>>>>>>> main >>>>>>>>> 1626 >>>>>>>>> >>>>>>>>> So in the pgf file, these probesets are being called 'main' instead >>>>>>>>> of >>>>>>>>> some sort of control. How do you handle this in xps? Do you use the >>>>>>>>> pgf >>>>>>>>> file? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/9/2013 2:06 PM, cstrato wrote: >>>>>>>>>> >>>>>>>>>> Dear Jim, >>>>>>>>>> >>>>>>>>>> I did not really follow the discussion so I may be wrong, but if >>>>>>>>>> you >>>>>>>>>> mean that there is a difference between the number of 'main' types, >>>>>>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds >>>>>>>>>> to >>>>>>>>>> the number of 'main' in the probeset annotation file and not in the >>>>>>>>>> transcript annotation file. >>>>>>>>>> >>>>>>>>>> But as I said, I may have misunderstood the problem. >>>>>>>>>> >>>>>>>>>> I am mainly replying because at the beginning of this year I had >>>>>>>>>> long >>>>>>>>>> discussions with DevNet to make sure that the annotation files for >>>>>>>>>> the >>>>>>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>>>>>>>> everything what I have found. >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> Christian >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Mark, >>>>>>>>>>> >>>>>>>>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>>>>>>>> transcript and probeset annotation files (and had them fixed), but >>>>>>>>>>> didn't think I needed to check the others. Famous last words, no? >>>>>>>>>>> >>>>>>>>>>>> x <- readPgf("HuGene-2_0-st.pgf") >>>>>>>>>>>> table(x$probesetType) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> control->affx control->affx->bac_spike >>>>>>>>>>> 18 18 >>>>>>>>>>> control->affx->ercc_spike control->affx->polya_spike >>>>>>>>>>> 92 39 >>>>>>>>>>> control->bgp->antigenomic main >>>>>>>>>>> 23 349012 >>>>>>>>>>> normgene->intron reporter >>>>>>>>>>> 3575 82 >>>>>>>>>>> >>>>>>>>>>>> y <- read.csv("HuGene- 2_0-st-v1.na33.2.hg19.transcript.csv", >>>>>>>>>>> >>>>>>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>>>>>>>>> >>>>>>>>>>>> table(y$category) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> control->affx control->affx->bac_spike >>>>>>>>>>> 18 18 >>>>>>>>>>> control->affx->ercc-spike control->affx->polya_spike >>>>>>>>>>> 92 39 >>>>>>>>>>> control->bgp->antigenomic main >>>>>>>>>>> 23 44629 >>>>>>>>>>> normgene->exon normgene->intron >>>>>>>>>>> 1626 3575 >>>>>>>>>>> reporter rescue >>>>>>>>>>> 82 3515 >>>>>>>>>>> >>>>>>>>>>> I'll ping Affymetrix and see what they have to say. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Jim >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>>>>>>>> >>>>>>>>>>>> Dear Benilton, James& Bioconductors, >>>>>>>>>>>> Thanks for providing the platform design packages for >>>>>>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>>>>>>>> >>>>>>>>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, >>>>>>>>>>>> but >>>>>>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>>>>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>>>>>>>> >>>>>>>>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx >>>>>>>>>>>> csv >>>>>>>>>>>> files lists 1626 normgene->exon probes, however the >>>>>>>>>>>> pg.hugene.2.0.st >>>>>>>>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>>>>>>>> category: >>>>>>>>>>>> >>>>>>>>>>>> # probe types: >>>>>>>>>>>> librarypd.hugene.2.0.st) >>>>>>>>>>>> conn<- dbpd.hugene.2.0.st) >>>>>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>>>>> type type_id >>>>>>>>>>>> 1 1 main >>>>>>>>>>>> 2 2 control->affx >>>>>>>>>>>> 3 3 control->chip >>>>>>>>>>>> 4 4 control->bgp->antigenomic >>>>>>>>>>>> 5 5 control->bgp->genomic >>>>>>>>>>>> 6 6 normgene->exon >>>>>>>>>>>> 7 7 normgene->intron >>>>>>>>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>>>>>>>> 9 9 control->affx->bac_spike >>>>>>>>>>>> 10 10 oligo_spike_in >>>>>>>>>>>> 11 11 r1_bac_spike_at >>>>>>>>>>>> >>>>>>>>>>>> # probe counts for each of the probe categories: >>>>>>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>>>>>>>> type") >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 3728 >>>>>>>>>>>> 2 1 345497 >>>>>>>>>>>> 3 2 18 >>>>>>>>>>>> 4 4 23 >>>>>>>>>>>> 5 7 3575 >>>>>>>>>>>> 6 9 18 >>>>>>>>>>>> >>>>>>>>>>>> NB: no type 6 probes. >>>>>>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and >>>>>>>>>>>> see >>>>>>>>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>>>>>>>> >>>>>>>>>>>> Can these mappings please be updated? >>>>>>>>>>>> >>>>>>>>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>>>>>>>> haven't investigated these in any detail. >>>>>>>>>>>> >>>>>>>>>>>> cheers, >>>>>>>>>>>> Mark >>>>>>>>>>>> ----------------------------------------------------- >>>>>>>>>>>> Mark Cowley, PhD >>>>>>>>>>>> >>>>>>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>>>>>>>> Sydney, Australia >>>>>>>>>>>> ----------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> All 12 packages below: >>>>>>>>>>>> pd.packages<- c( >>>>>>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", >>>>>>>>>>>> "pd.hugene.2.0.st", >>>>>>>>>>>> "pd.hugene.2.1.st", >>>>>>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", >>>>>>>>>>>> "pd.mogene.2.0.st", >>>>>>>>>>>> "pd.mogene.2.1.st", >>>>>>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", >>>>>>>>>>>> "pd.ragene.2.0.st", >>>>>>>>>>>> "pd.ragene.2.1.st" >>>>>>>>>>>> ) >>>>>>>>>>>> >>>>>>>>>>>> a<- b<- list() >>>>>>>>>>>> forpd.pkg.name in pd.packages) { >>>>>>>>>>>> try({ >>>>>>>>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't >>>>>>>>>>>> load >>>>>>>>>>>> the >>>>>>>>>>>> pd.package") >>>>>>>>>>>> conn<- db(getpd.pkg.name)) >>>>>>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) >>>>>>>>>>>> from >>>>>>>>>>>> featureSet GROUP BY type") >>>>>>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>>>>>>>> featureSet >>>>>>>>>>>> WHERE type = 6")[,1] >>>>>>>>>>>> }) >>>>>>>>>>>> } >>>>>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>>>>> >>>>>>>>>>>>> a >>>>>>>>>>>> >>>>>>>>>>>> $pd.hugene.1.0.st.v1 >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 227 >>>>>>>>>>>> 2 1 253002 >>>>>>>>>>>> 3 2 57 >>>>>>>>>>>> 4 4 45 >>>>>>>>>>>> 5 6 1195 >>>>>>>>>>>> 6 7 2904 >>>>>>>>>>>> >>>>>>>>>>>> $pd.hugene.1.1.st.v1 >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 227 >>>>>>>>>>>> 2 1 253002 >>>>>>>>>>>> 3 2 57 >>>>>>>>>>>> 4 4 45 >>>>>>>>>>>> 5 6 1195 >>>>>>>>>>>> 6 7 2904 >>>>>>>>>>>> >>>>>>>>>>>> $pd.hugene.2.0.st >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 3728 >>>>>>>>>>>> 2 1 345497 >>>>>>>>>>>> 3 2 18 >>>>>>>>>>>> 4 4 23 >>>>>>>>>>>> 5 7 3575 >>>>>>>>>>>> 6 9 18 >>>>>>>>>>>> >>>>>>>>>>>> $pd.hugene.2.1.st >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 3728 >>>>>>>>>>>> 2 1 345497 >>>>>>>>>>>> 3 2 18 >>>>>>>>>>>> 4 4 23 >>>>>>>>>>>> 5 7 3575 >>>>>>>>>>>> 6 9 18 >>>>>>>>>>>> >>>>>>>>>>>> $pd.mogene.1.0.st.v1 >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 86 >>>>>>>>>>>> 2 1 234878 >>>>>>>>>>>> 3 2 21 >>>>>>>>>>>> 4 4 45 >>>>>>>>>>>> 5 6 1324 >>>>>>>>>>>> 6 7 5222 >>>>>>>>>>>> >>>>>>>>>>>> $pd.mogene.1.1.st.v1 >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 86 >>>>>>>>>>>> 2 1 234878 >>>>>>>>>>>> 3 2 21 >>>>>>>>>>>> 4 4 45 >>>>>>>>>>>> 5 6 1324 >>>>>>>>>>>> 6 7 5222 >>>>>>>>>>>> >>>>>>>>>>>> $pd.mogene.2.0.st >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 810 >>>>>>>>>>>> 2 1 263551 >>>>>>>>>>>> 3 2 18 >>>>>>>>>>>> 4 4 23 >>>>>>>>>>>> 5 7 5331 >>>>>>>>>>>> 6 9 18 >>>>>>>>>>>> >>>>>>>>>>>> $pd.mogene.2.1.st >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 810 >>>>>>>>>>>> 2 1 263551 >>>>>>>>>>>> 3 2 18 >>>>>>>>>>>> 4 4 23 >>>>>>>>>>>> 5 7 5331 >>>>>>>>>>>> 6 9 18 >>>>>>>>>>>> >>>>>>>>>>>> $pd.ragene.1.0.st.v1 >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 254 >>>>>>>>>>>> 2 1 211195 >>>>>>>>>>>> 3 2 21 >>>>>>>>>>>> 4 4 45 >>>>>>>>>>>> 5 6 399 >>>>>>>>>>>> 6 7 1153 >>>>>>>>>>>> >>>>>>>>>>>> $pd.ragene.1.1.st.v1 >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 254 >>>>>>>>>>>> 2 1 211195 >>>>>>>>>>>> 3 2 21 >>>>>>>>>>>> 4 4 45 >>>>>>>>>>>> 5 6 399 >>>>>>>>>>>> 6 7 1153 >>>>>>>>>>>> >>>>>>>>>>>> $pd.ragene.2.0.st >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 1071 >>>>>>>>>>>> 2 1 214018 >>>>>>>>>>>> 3 2 18 >>>>>>>>>>>> 4 4 23 >>>>>>>>>>>> 5 7 5083 >>>>>>>>>>>> 6 9 18 >>>>>>>>>>>> >>>>>>>>>>>> $pd.ragene.2.1.st >>>>>>>>>>>> type count(*) >>>>>>>>>>>> 1 NA 1071 >>>>>>>>>>>> 2 1 214018 >>>>>>>>>>>> 3 2 18 >>>>>>>>>>>> 4 4 23 >>>>>>>>>>>> 5 7 5083 >>>>>>>>>>>> 6 9 18 >>>>>>>>>>>> >>>>>>>>>>>>> sapply(b,length) >>>>>>>>>>>> >>>>>>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>>>>>>>> pd.hugene.2.1.st >>>>>>>>>>>> 1195 1195 >>>>>>>>>>>> 0 0 >>>>>>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>>>>>>>> pd.mogene.2.1.st >>>>>>>>>>>> 1324 1324 >>>>>>>>>>>> 0 0 >>>>>>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>>>>>>>> pd.ragene.2.1.st >>>>>>>>>>>> 399 399 >>>>>>>>>>>> 0 0 >>>>>>>>>>>> >>>>>>>>>>>>> sessionInfo() >>>>>>>>>>>> >>>>>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>>>>> >>>>>>>>>>>> locale: >>>>>>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>>>>>>>> >>>>>>>>>>>> attached base packages: >>>>>>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>>>>>> methods >>>>>>>>>>>> [8] base >>>>>>>>>>>> >>>>>>>>>>>> other attached packages: >>>>>>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>>>>>>>> [19] BiocInstaller_1.10.2 >>>>>>>>>>>> >>>>>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 >>>>>>>>>>>> Biostrings_2.28.0 >>>>>>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>>> Search the archives: >>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>> >>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.8 years ago Mark Cowley ▴ 910

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

Hi Mark, I stand corrected. These probesets ARE being summarized at the transcript level. So the issue is how to mark a probeset as both 'main' and 'normgene->exon' at the same time in the pd package. Best, Jim On 7/10/2013 9:56 AM, James W. MacDonald wrote: > Hi Mark, > > The issue here is that the fsetid AND the meta_fsetid for probeset > 16934607 are the same. In other words, when you summarize at the > probeset level, there is a probeset 16934607 that is intended to > interrogate an exon of EIF3D. And when you summarize at the transcript > level, there is a probeset 16934607 that is intended to be a positive > control. > > So at one level, this probeset IS a main probeset, and that is > reflected in the pgf file. Prior to this series of arrays, the > positive controls were never also a part of an existing transcript, so > this issue never came up. As you will note in my earlier response to > Christian, I agree that this isn't ideal, and that it would be better > to not tag a positive control as 'main' in the pd package. > > However, I don't know how this would be fixed. Note that > > > dbGetQuery(con, "select * from core_mps where meta_fsetid='16934607';") > meta_fsetid transcript_cluster_id fsetid > 1 16934607 16934607 16934607 > > and > > > dbGetQuery(con, "select * from featureSet where > transcript_cluster_id='16934583';")[,c(1,5,10)] > fsetid transcript_cluster_id type > 1 16934584 16934583 1 > 2 16934585 16934583 1 > 3 16934586 16934583 1 > 4 16934587 16934583 1 > 5 16934588 16934583 1 > 6 16934589 16934583 1 > 7 16934590 16934583 1 > 8 16934591 16934583 1 > 9 16934592 16934583 1 > 10 16934593 16934583 1 > 11 16934594 16934583 1 > 12 16934595 16934583 1 > 13 16934596 16934583 1 > 14 16934597 16934583 1 > 15 16934598 16934583 1 > 16 16934600 16934583 1 > 17 16934601 16934583 1 > 18 16934602 16934583 1 > 19 16934603 16934583 1 > 20 16934604 16934583 1 > 21 16934607 16934583 1 > 22 16934608 16934583 1 > 23 16934609 16934583 1 > 24 16934610 16934583 1 > > and > > > dbGetQuery(con, "select * from featureSet where > transcript_cluster_id='16934607';") > [1] fsetid strand start > [4] stop transcript_cluster_id exon_id > [7] crosshyb_type level chrom > [10] type > > So this seems like what you would want to happen. There isn't a > probeset 16934607 when you summarize at the transcript level > > > On 7/10/2013 6:31 AM, Mark Cowley wrote: >> Hi, >> I see the point you're making Christian. >> I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file. >> >> Cheers and thanks for looking into this >> >> Mark >> >> Sent from my iPhone >> >> On 10/07/2013, at 7:53 AM, "cstrato"<cstrato at="" aon.at=""> wrote: >> >>> Dear Jim, >>> >>> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain. >>> >>> Best regards, >>> Christian >>> >>> >>> On 7/9/13 11:46 PM, James W. MacDonald wrote: >>>> Hi Christian, >>>> >>>> Thanks for pointing that out. There is still a bit of an inconsistency >>>> with the pd packages that should probably be corrected, as at the >>>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D, >>>> whereas at the transcript level, this same probeset is intended to be a >>>> positive control (and as you note below, these probes are incorporated >>>> into a larger probeset intended to measure the EIF3D transcript). >>>> >>>> It would be nice to be able to filter out the controls like Mark >>>> attempted (and I do regularly as well). >>>> >>>> Mark - I talked with Benilton Carvalho about this, and he will take a >>>> look next week. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> On 7/9/2013 3:38 PM, cstrato wrote: >>>>> Dear Jim, >>>>> >>>>> In xps I use as basic file for exon arrays the probeset annotation >>>>> file and then compare the data to the data from the pgf-file. Any >>>>> differences will be reported. >>>>> >>>>> I have just checked the different files for HuGene-2_0-st. If you >>>>> check as an example the following probeset_ids: >>>>> 16934607 >>>>> 16934608 >>>>> 16934609 >>>>> 16934610 >>>>> >>>>> Then you will see that the transcript annotation file lists these ids >>>>> as 'normgene->exon' and 'pos_control'. However, the probeset >>>>> annotation file lists these ids as 'main' belonging to gene EIF3D with >>>>> transcript_cluster_id 16934583. Looking for this id in the transcript >>>>> annotation file reveals that the number of 'total_probes' is 24. >>>>> Indeed, the probeset annotation file lists 24 probesets including the >>>>> four above mentioned probeset_ids. >>>>> >>>>> This means that although these four probesets are listed in the >>>>> transcript annotation file as 'normgene->exon' the label 'main' in the >>>>> pgf-file is correct since these probesets are part of the gene EIF3D. >>>>> >>>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets >>>>> listed as 'normgene->exon'. However, in this case these probesets are >>>>> also listed as 'normgene->exon' in the probeset annotation file, i.e. >>>>> these probesets do not belong to any transcript listed in the probeset >>>>> annotation file. >>>>> >>>>> Best regards, >>>>> Christian >>>>> >>>>> >>>>> On 7/9/13 8:46 PM, James W. MacDonald wrote: >>>>>> Hi Christian, >>>>>> >>>>>> That's not the issue. Instead, the issue is that the pgf file lists the >>>>>> normgene->exon probeset IDs as being 'main'. I have received a response >>>>>> from Affy stating that the qcc file lists the normgene->exon probesets >>>>>> as pos_control, but that seems orthogonal to the issue at hand. >>>>>> >>>>>>> qcc<- read.table("HuGene-2_0-st.qcc", comment.char="#", >>>>>> stringsAsFactors=F, header=T) >>>>>>> pgf<- readPgf("HuGene-2_0-st.pgf") >>>>>>> head(qcc) >>>>>> probeset_id group_name probeset_name quantification_in_header >>>>>> 1 16650001 neg_control 16650001 0 >>>>>> 2 16650003 neg_control 16650003 0 >>>>>> >>>>>> ## get the positive controls (normgene->exon probesets) from the qcc >>>>>> file >>>>>>> pos_cont<- qcc[qcc[,2] == "pos_control",1] >>>>>> ## compare to pgf file >>>>>>> x<- pgf$probesetType[pgf$probesetId %in% pos_cont] >>>>>>> table(x) >>>>>> x >>>>>> main >>>>>> 1626 >>>>>> >>>>>> So in the pgf file, these probesets are being called 'main' instead of >>>>>> some sort of control. How do you handle this in xps? Do you use the pgf >>>>>> file? >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 7/9/2013 2:06 PM, cstrato wrote: >>>>>>> Dear Jim, >>>>>>> >>>>>>> I did not really follow the discussion so I may be wrong, but if you >>>>>>> mean that there is a difference between the number of 'main' types, >>>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to >>>>>>> the number of 'main' in the probeset annotation file and not in the >>>>>>> transcript annotation file. >>>>>>> >>>>>>> But as I said, I may have misunderstood the problem. >>>>>>> >>>>>>> I am mainly replying because at the beginning of this year I had long >>>>>>> discussions with DevNet to make sure that the annotation files for the >>>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct >>>>>>> everything what I have found. >>>>>>> >>>>>>> Best regards, >>>>>>> Christian >>>>>>> >>>>>>> >>>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote: >>>>>>>> Hi Mark, >>>>>>>> >>>>>>>> Thanks for the heads-up. We already knew that Affy messed up the >>>>>>>> transcript and probeset annotation files (and had them fixed), but >>>>>>>> didn't think I needed to check the others. Famous last words, no? >>>>>>>> >>>>>>>>> x<- readPgf("HuGene-2_0-st.pgf") >>>>>>>>> table(x$probesetType) >>>>>>>> control->affx control->affx->bac_spike >>>>>>>> 18 18 >>>>>>>> control->affx->ercc_spike control->affx->polya_spike >>>>>>>> 92 39 >>>>>>>> control->bgp->antigenomic main >>>>>>>> 23 349012 >>>>>>>> normgene->intron reporter >>>>>>>> 3575 82 >>>>>>>> >>>>>>>>> y<- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv", >>>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE) >>>>>>>>> table(y$category) >>>>>>>> control->affx control->affx->bac_spike >>>>>>>> 18 18 >>>>>>>> control->affx->ercc-spike control->affx->polya_spike >>>>>>>> 92 39 >>>>>>>> control->bgp->antigenomic main >>>>>>>> 23 44629 >>>>>>>> normgene->exon normgene->intron >>>>>>>> 1626 3575 >>>>>>>> reporter rescue >>>>>>>> 82 3515 >>>>>>>> >>>>>>>> I'll ping Affymetrix and see what they have to say. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote: >>>>>>>>> Dear Benilton, James& Bioconductors, >>>>>>>>> Thanks for providing the platform design packages for >>>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays. >>>>>>>>> >>>>>>>>> I use SQL to query these packages& ultimately retain only 'main' >>>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but >>>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the >>>>>>>>> normgene->exon control probes are misclassified as 'main' probes. >>>>>>>>> >>>>>>>>> evidence: the HuGene- 2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv >>>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st >>>>>>>>> package lists 0, and assigns these 1626 probes to the 'main' >>>>>>>>> category: >>>>>>>>> >>>>>>>>> # probe types: >>>>>>>>> librarypd.hugene.2.0.st) >>>>>>>>> conn<- dbpd.hugene.2.0.st) >>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>> type type_id >>>>>>>>> 1 1 main >>>>>>>>> 2 2 control->affx >>>>>>>>> 3 3 control->chip >>>>>>>>> 4 4 control->bgp->antigenomic >>>>>>>>> 5 5 control->bgp->genomic >>>>>>>>> 6 6 normgene->exon >>>>>>>>> 7 7 normgene->intron >>>>>>>>> 8 8 rescue->FLmRNA->unmapped >>>>>>>>> 9 9 control->affx->bac_spike >>>>>>>>> 10 10 oligo_spike_in >>>>>>>>> 11 11 r1_bac_spike_at >>>>>>>>> >>>>>>>>> # probe counts for each of the probe categories: >>>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY >>>>>>>>> type") >>>>>>>>> type count(*) >>>>>>>>> 1 NA 3728 >>>>>>>>> 2 1 345497 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 3575 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> NB: no type 6 probes. >>>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see >>>>>>>>> this issue for all 2.0 and 2.1 arrays (see below) >>>>>>>>> >>>>>>>>> Can these mappings please be updated? >>>>>>>>> >>>>>>>>> PS, there's a bunch of probes with type = NA in the database. I >>>>>>>>> haven't investigated these in any detail. >>>>>>>>> >>>>>>>>> cheers, >>>>>>>>> Mark >>>>>>>>> ----------------------------------------------------- >>>>>>>>> Mark Cowley, PhD >>>>>>>>> >>>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics >>>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research, >>>>>>>>> Sydney, Australia >>>>>>>>> ----------------------------------------------------- >>>>>>>>> >>>>>>>>> All 12 packages below: >>>>>>>>> pd.packages<- c( >>>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st", >>>>>>>>> "pd.hugene.2.1.st", >>>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st", >>>>>>>>> "pd.mogene.2.1.st", >>>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st", >>>>>>>>> "pd.ragene.2.1.st" >>>>>>>>> ) >>>>>>>>> >>>>>>>>> a<- b<- list() >>>>>>>>> forpd.pkg.name in pd.packages) { >>>>>>>>> try({ >>>>>>>>> requirepd.pkg.name, character.only=TRUE) || stop("Can't load >>>>>>>>> the >>>>>>>>> pd.package") >>>>>>>>> conn<- db(getpd.pkg.name)) >>>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from >>>>>>>>> featureSet GROUP BY type") >>>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from >>>>>>>>> featureSet >>>>>>>>> WHERE type = 6")[,1] >>>>>>>>> }) >>>>>>>>> } >>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict") >>>>>>>>> >>>>>>>>>> a >>>>>>>>> $pd.hugene.1.0.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 227 >>>>>>>>> 2 1 253002 >>>>>>>>> 3 2 57 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 1195 >>>>>>>>> 6 7 2904 >>>>>>>>> >>>>>>>>> $pd.hugene.1.1.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 227 >>>>>>>>> 2 1 253002 >>>>>>>>> 3 2 57 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 1195 >>>>>>>>> 6 7 2904 >>>>>>>>> >>>>>>>>> $pd.hugene.2.0.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 3728 >>>>>>>>> 2 1 345497 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 3575 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.hugene.2.1.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 3728 >>>>>>>>> 2 1 345497 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 3575 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.mogene.1.0.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 86 >>>>>>>>> 2 1 234878 >>>>>>>>> 3 2 21 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 1324 >>>>>>>>> 6 7 5222 >>>>>>>>> >>>>>>>>> $pd.mogene.1.1.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 86 >>>>>>>>> 2 1 234878 >>>>>>>>> 3 2 21 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 1324 >>>>>>>>> 6 7 5222 >>>>>>>>> >>>>>>>>> $pd.mogene.2.0.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 810 >>>>>>>>> 2 1 263551 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 5331 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.mogene.2.1.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 810 >>>>>>>>> 2 1 263551 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 5331 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.ragene.1.0.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 254 >>>>>>>>> 2 1 211195 >>>>>>>>> 3 2 21 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 399 >>>>>>>>> 6 7 1153 >>>>>>>>> >>>>>>>>> $pd.ragene.1.1.st.v1 >>>>>>>>> type count(*) >>>>>>>>> 1 NA 254 >>>>>>>>> 2 1 211195 >>>>>>>>> 3 2 21 >>>>>>>>> 4 4 45 >>>>>>>>> 5 6 399 >>>>>>>>> 6 7 1153 >>>>>>>>> >>>>>>>>> $pd.ragene.2.0.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 1071 >>>>>>>>> 2 1 214018 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 5083 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>> $pd.ragene.2.1.st >>>>>>>>> type count(*) >>>>>>>>> 1 NA 1071 >>>>>>>>> 2 1 214018 >>>>>>>>> 3 2 18 >>>>>>>>> 4 4 23 >>>>>>>>> 5 7 5083 >>>>>>>>> 6 9 18 >>>>>>>>> >>>>>>>>>> sapply(b,length) >>>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st >>>>>>>>> pd.hugene.2.1.st >>>>>>>>> 1195 1195 >>>>>>>>> 0 0 >>>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st >>>>>>>>> pd.mogene.2.1.st >>>>>>>>> 1324 1324 >>>>>>>>> 0 0 >>>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st >>>>>>>>> pd.ragene.2.1.st >>>>>>>>> 399 399 >>>>>>>>> 0 0 >>>>>>>>> >>>>>>>>>> sessionInfo() >>>>>>>>> R version 3.0.0 (2013-04-03) >>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>>>>>> >>>>>>>>> locale: >>>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C >>>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 >>>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 >>>>>>>>> [7] LC_PAPER=C LC_NAME=C >>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C >>>>>>>>> >>>>>>>>> attached base packages: >>>>>>>>> [1] parallel stats graphics grDevices utils datasets >>>>>>>>> methods >>>>>>>>> [8] base >>>>>>>>> >>>>>>>>> other attached packages: >>>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0 >>>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0 >>>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0 >>>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0 >>>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0 >>>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0 >>>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0 >>>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0 >>>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7 >>>>>>>>> [19] BiocInstaller_1.10.2 >>>>>>>>> >>>>>>>>> loaded via a namespace (and not attached): >>>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0 >>>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11 >>>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1 >>>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0 >>>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [[alternative HTML version deleted]] >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 10.8 years ago James W. MacDonald 65k

Login before adding your answer.