multiple locations for probeset in hgu133plus2CHRLOC vs. UCSC PSL data
3
0
Entering edit mode
@bazeley-peter-3140
Last seen 7.1 years ago
Hello, R version: 2.8.0 I just installed the hgu133plus2.db package, and am looking at the hgu133plus2CHRLOC environment. I've noticed that some of the probeset entries (e.g. "201268_at") have multiple locations compared to Affy's annotation file. I'd like to figure out if these multiple locations are current, in which case it is some sort of overlapping/repeating duplication. For example: > as.list(hgu133plus2CHRLOC)'201268_at' 17 17 17 17 46598879 46597889 46598637 46599081 indicates that the probeset maps to 4 locations. Compare this to the alignments info in the Affy's annotation file (from 7/8/08, http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG- U133_Plus_2.na26.annot.csv.zip): chr12:119204403-119205041 (+) // 91.49 // q24.31 /// chr17:46598810-46604103 (+) // 96.87 // q21.33 which suggests one location on chromosome 17 (I'm ignoring chromosome 12 for now). This is a "_at" probeset, so it should map uniquely to a sequence, according to Affy's "Data Analysis Fundamentals" document (and speaking to a rep). From the information provided by "?hgu133plus2CHRLOC", I downloaded ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/d atabase/affyU133Plus2.txt.gz from UCSC to see how this occured, but it is not clear. Actually, the file: http://www.affymetrix.com/Auth/analysis/downloads/psl/HG- U133_Plus_2.link.psl.zip from Affy's support page has the same alignment info. Here's the relevant PSL info: Target sequence name: chr17 Alignment start position in target: 46598810 Alignment end position in target: 46604103 Number of blocks in the alignment (a block contains no gaps): 5 Comma-separated list of sizes of each block: 47,130,102,113,257, Comma-separated list of starting positions of each block in target: 46598810,46599186,46600601,46602296,46603846, The second location provided by CHRLOC (46597889) occurs before the start of the alignment in the PSL info, so perhaps this one CHRLOC location corresponds to the PSL alignment? The mappings were obtained from UCSC on 2006-Apr14, so perhaps additional alignments existed at the time, which have since been removed. Thank you for any help. Hopefully I'm just missing something obvious (well, non-obvious for me). Peter Bazeley [[alternative HTML version deleted]] ADD COMMENT 0 Entering edit mode @sean-davis-490 Last seen 6 weeks ago United States On Mon, Nov 17, 2008 at 8:28 PM, Bazeley, Peter <peter.bazeley@utoledo.edu>wrote: > Hello, > > R version: 2.8.0 > > I just installed the hgu133plus2.db package, and am looking at the > hgu133plus2CHRLOC environment. I've noticed that some of the probeset > entries (e.g. "201268_at") have multiple locations compared to Affy's > annotation file. I'd like to figure out if these multiple locations are > current, in which case it is some sort of overlapping/repeating duplication. > For example: > > > as.list(hgu133plus2CHRLOC)'201268_at' > 17 17 17 17 > 46598879 46597889 46598637 46599081 > > indicates that the probeset maps to 4 locations. Compare this to the > alignments info in the Affy's annotation file (from 7/8/08, > http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG- U133_Plus_2.na26.annot.csv.zip > ): > > chr12:119204403-119205041 (+) // 91.49 // q24.31 /// > chr17:46598810-46604103 (+) // 96.87 // q21.33 > > which suggests one location on chromosome 17 (I'm ignoring chromosome 12 > for now). This is a "_at" probeset, so it should map uniquely to a sequence, > according to Affy's "Data Analysis Fundamentals" document (and speaking to a > rep). > > >From the information provided by "?hgu133plus2CHRLOC", I downloaded > > ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens /database/affyU133Plus2.txt.gz > from UCSC to see how this occured, but it is not clear. Actually, the file: > > http://www.affymetrix.com/Auth/analysis/downloads/psl/HG- U133_Plus_2.link.psl.zip > from Affy's support page has the same alignment info. Here's the relevant > PSL info: > > Target sequence name: chr17 > Alignment start position in target: 46598810 > Alignment end position in target: 46604103 > Number of blocks in the alignment (a block contains no gaps): 5 > Comma-separated list of sizes of each block: 47,130,102,113,257, > Comma-separated list of starting positions of each block in target: > 46598810,46599186,46600601,46602296,46603846, > > > The second location provided by CHRLOC (46597889) occurs before the start > of the alignment in the PSL info, so perhaps this one CHRLOC location > corresponds to the PSL alignment? The mappings were obtained from UCSC on > 2006-Apr14, so perhaps additional alignments existed at the time, which have > since been removed. > > > Thank you for any help. Hopefully I'm just missing something obvious (well, > non-obvious for me). > Marc can answer with more authority, but I think that the confusion has to do with the fact that everything is mapped through Entrez Gene ID and NOT the transcript. If you look in the UCSC genome browser from which the alignments are created, you will see that Entrez ID 4831 has four RefSeqs associated with it. Hence, there are four alignments. With the actual probe sequences, one could potentially make an argument for one transcript over another, but relying on affy's call of which transcript is to the "representative" one is probably not a reliable way to choose one transcript over another. Hope that helps, Sean [[alternative HTML version deleted]]
0
Entering edit mode
Hi Peter, I can add some extra information that may explain some of the data. If I may take the liberty of pointing you to a tool I have written and made available on my Institute web site, (It is the GABOS/GAFEP tool at http://bioinf.wehi.edu.au/gabos/) which will allow you to conveniently recall data from various gene/probe definition files. On GABOS, select the hg18 genome and check the Annotation file, affyU133Plus2 and uncheck other annotation files. Click the "Zero GAFEP Params" button, then enter 201268_at in the "List of Gene Names" box. Click the "Retrieve Sequence" button. You will get the five blocks of sequence, with their co-ordinates relative to chr17 as shown below: ======================= > hg18 chr17 + affyU133Plus2 201268_at Exon '1/5 [ 1 47 ] 46598811 46598857 [ -0 47 +0 ] 46598811 46598857 TCTGCTCTCCCAGCGCAGCGCCGCCGCCCGGCCCCTCCAGCTTCCCG > hg18 chr17 + affyU133Plus2 201268_at Exon '2/5 [ 1 130 ] 46599187 46599316 [ -0 130 +0 ] 46599187 46599316 GACCATGGCCAACCTGGAGCGCACCTTCATCGCCATCAAGCCGGACGGCGTGCAGCGCGGCCTGGTGGGC GAGATCATCAAGCGCTTCGAGCAGAAGGGA TTCCGCCTCGTGGCCATGAAGTTCCTCCGG > hg18 chr17 + affyU133Plus2 201268_at Exon '3/5 [ 1 102 ] 46600602 46600703 [ -0 102 +0 ] 46600602 46600703 GCCTCTGAAGAACACCTGAAGCAGCACTACATTGACCTGAAAGACCGACCATTCTTCCCTGGGCTGGTGA AGTACATGAACTCAGGGCCGGTTGTGGCCA TG > hg18 chr17 + affyU133Plus2 201268_at Exon '4/5 [ 1 113 ] 46602297 46602409 [ -0 113 +0 ] 46602297 46602409 GTCTGGGAGGGGCTGAACGTGGTGAAGACAGGCCGAGTGATGCTTGGGGAGACCAATCCAGCAGATTCAA AGCCAGGCACCATTCGTGGGGACTTCTGCA TTCAGGTTGGCAG > hg18 chr17 + affyU133Plus2 201268_at Exon '5/5 [ 1 257 ] 46603847 46604103 [ -0 257 +0 ] 46603847 46604103 GAACATCATTCATGGCAGTGATTCAGTAAAAAGTGCTGAAAAAGAAATCAGCCTATGGTTTAAGCCTGAA GAACTGGTTGACTACAAGTCTTGTGCTCAT GACTGGGTCTATGAATAAGAGGTGGACACAACAGCAGTCTCCTTCAGCACGGCGTGGTGTGTCCCTGGAC ACAGCTCTTCATTCCATTGACTTAGAGGCA ACAGGATTGATCATTCTTTTATAGAGCATATTTGCCAATAAAGCTTTTGGAAGCCGG ======================= My understanding is that the affy files on the UCSC site define the gene that the affy probes were designed around. You can also use the GABOS tool to retrieve genes defined around your area of interest. To do this, select hg18, chr17, check refFlat (which is the set of RefSeq genes with their browser gene name included), or any of the other gene definition files, click the "Zero GAFEP Params" button, then enter a Sequence Range under the chromosome selection, for example in your situation, 46.5m-46.7m should cover your area of interest. I would also suggest you check the box "Do NOT display Sequence Data", click the "Retrieve Sequence" button. About 60 lines (each corresponding to an exon) are listed. You can see that the NM_001018137-NME2 gene corresponds to your affy probe. (Note the GABOS beginning co-ordinates are one bigger than your affy co-ordinates.). Below is the NM_001018137-NME2 gene data retrieved by GABOS. ======================= > hg18 chr17 + refFlat NM_001018137-NME2 Exon '1/5 [ 1 152 ] 46597890 46598041 [ -0 152 +0 ] 46597890 46598041 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '2/5 [ 1 130 ] 46599187 46599316 [ -0 130 +0 ] 46599187 46599316 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '3/5 [ 1 102 ] 46600602 46600703 [ -0 102 +0 ] 46600602 46600703 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '4/5 [ 1 113 ] 46602297 46602409 [ -0 113 +0 ] 46602297 46602409 > hg18 chr17 + refFlat NM_001018137-NME2 Exon '5/5 [ 1 258 ] 46603847 46604104 [ -0 258 +0 ] 46603847 46604104 ======================== Hope this helps explain the data a little, I'll leave it to others to explain how the hgu133plus2.db package works, hope that helps, Keith ======================== Keith Satterley Bioinformatics Division The Walter and Eliza Hall Institute of Medical Research Parkville, Melbourne, Victoria, Australia ======================= Sean Davis wrote: > On Mon, Nov 17, 2008 at 8:28 PM, Bazeley, Peter > <peter.bazeley at="" utoledo.edu="">wrote: > >> Hello, >> >> R version: 2.8.0 >> >> I just installed the hgu133plus2.db package, and am looking at the >> hgu133plus2CHRLOC environment. I've noticed that some of the probeset >> entries (e.g. "201268_at") have multiple locations compared to Affy's >> annotation file. I'd like to figure out if these multiple locations are >> current, in which case it is some sort of overlapping/repeating duplication. >> For example: >> >>> as.list(hgu133plus2CHRLOC)'201268_at' >> 17 17 17 17 >> 46598879 46597889 46598637 46599081 >> >> indicates that the probeset maps to 4 locations. Compare this to the >> alignments info in the Affy's annotation file (from 7/8/08, >> http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG- U133_Plus_2.na26.annot.csv.zip >> ): >> >> chr12:119204403-119205041 (+) // 91.49 // q24.31 /// >> chr17:46598810-46604103 (+) // 96.87 // q21.33 >> >> which suggests one location on chromosome 17 (I'm ignoring chromosome 12 >> for now). This is a "_at" probeset, so it should map uniquely to a sequence, >> according to Affy's "Data Analysis Fundamentals" document (and speaking to a >> rep). >> >> >From the information provided by "?hgu133plus2CHRLOC", I downloaded >> >> ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapien s/database/affyU133Plus2.txt.gz >> from UCSC to see how this occured, but it is not clear. Actually, the file: >> >> http://www.affymetrix.com/Auth/analysis/downloads/psl/HG- U133_Plus_2.link.psl.zip >> from Affy's support page has the same alignment info. Here's the relevant >> PSL info: >> >> Target sequence name: chr17 >> Alignment start position in target: 46598810 >> Alignment end position in target: 46604103 >> Number of blocks in the alignment (a block contains no gaps): 5 >> Comma-separated list of sizes of each block: 47,130,102,113,257, >> Comma-separated list of starting positions of each block in target: >> 46598810,46599186,46600601,46602296,46603846, >> >> >> The second location provided by CHRLOC (46597889) occurs before the start >> of the alignment in the PSL info, so perhaps this one CHRLOC location >> corresponds to the PSL alignment? The mappings were obtained from UCSC on >> 2006-Apr14, so perhaps additional alignments existed at the time, which have >> since been removed. >> >> >> Thank you for any help. Hopefully I'm just missing something obvious (well, >> non-obvious for me). >> > > Marc can answer with more authority, but I think that the confusion has to > do with the fact that everything is mapped through Entrez Gene ID and NOT > the transcript. If you look in the UCSC genome browser from which the > alignments are created, you will see that Entrez ID 4831 has four RefSeqs > associated with it. Hence, there are four alignments. With the actual > probe sequences, one could potentially make an argument for one transcript > over another, but relying on affy's call of which transcript is to the > "representative" one is probably not a reliable way to choose one transcript > over another. > > Hope that helps, > Sean > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ADD REPLY 0 Entering edit mode Marc Carlson ★ 7.2k @marc-carlson-2264 Last seen 5.2 years ago United States Hi Peter, I think that your confusion is coming from the fact that these are the chromosome start locations for the genes and not the probes. According to Affy, that probe is supposed to be measuring that gene and we took their word for that. We then gave you the start positions for transcripts of that gene according to UCSC. We don't currently provide the data for where the probe aligns to the genome or to which transcripts in the genome the probe might stick to. Marc Bazeley, Peter wrote: > Hello, > > R version: 2.8.0 > > I just installed the hgu133plus2.db package, and am looking at the hgu133plus2CHRLOC environment. I've noticed that some of the probeset entries (e.g. "201268_at") have multiple locations compared to Affy's annotation file. I'd like to figure out if these multiple locations are current, in which case it is some sort of overlapping/repeating duplication. For example: > > >> as.list(hgu133plus2CHRLOC)'201268_at' >> > 17 17 17 17 > 46598879 46597889 46598637 46599081 > > indicates that the probeset maps to 4 locations. Compare this to the alignments info in the Affy's annotation file (from 7/8/08, http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG- U133_Plus_2.na26.annot.csv.zip): > > chr12:119204403-119205041 (+) // 91.49 // q24.31 /// chr17:46598810-46604103 (+) // 96.87 // q21.33 > > which suggests one location on chromosome 17 (I'm ignoring chromosome 12 for now). This is a "_at" probeset, so it should map uniquely to a sequence, according to Affy's "Data Analysis Fundamentals" document (and speaking to a rep). > > >From the information provided by "?hgu133plus2CHRLOC", I downloaded > ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens /database/affyU133Plus2.txt.gz > from UCSC to see how this occured, but it is not clear. Actually, the file: > http://www.affymetrix.com/Auth/analysis/downloads/psl/HG- U133_Plus_2.link.psl.zip > from Affy's support page has the same alignment info. Here's the relevant PSL info: > > Target sequence name: chr17 > Alignment start position in target: 46598810 > Alignment end position in target: 46604103 > Number of blocks in the alignment (a block contains no gaps): 5 > Comma-separated list of sizes of each block: 47,130,102,113,257, > Comma-separated list of starting positions of each block in target: 46598810,46599186,46600601,46602296,46603846, > > > The second location provided by CHRLOC (46597889) occurs before the start of the alignment in the PSL info, so perhaps this one CHRLOC location corresponds to the PSL alignment? The mappings were obtained from UCSC on 2006-Apr14, so perhaps additional alignments existed at the time, which have since been removed. > > > Thank you for any help. Hopefully I'm just missing something obvious (well, non-obvious for me). > > Peter Bazeley > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >
0
Entering edit mode
@herve-pages-1542
Last seen 7 hours ago
Seattle, WA, United States
Hi Peter, You can find the genome coordinates of a set of probes by using the matchPDict()/countPDict() tool from Biostrings + the genome for hg18: library(hgu133plus2probe) library(BSgenome.Hsapiens.UCSC.hg18) probes <- DNAStringSet( hgu133plus2probe$sequence[hgu133plus2probe$Probe.Set.Name == "201268_at"] ) pdict1 <- PDict(probes) nhits1 <- t(sapply(seqnames(Hsapiens), function(name) countPDict(pdict1, Hsapiens[[name]]))) The 'nhits1' matrix summarizes the number of hits per chromosome (plus strand only). See which chromosomes have hits with: > which(rowSums(nhits1) != 0) chr12 chr17 12 17 For completeness, we should also search the minus strands of all chromosomes. This can easily be done by taking the reverse complement sequences of the probes with pdict2 <- PDict(reverseComplement(probes)) and then using the same code as above. Then use matchPDict() to get the coordinates of the hits in chr17: chr17_hits <- unlist(matchPDict(pdict1, Hsapiens$chr17)) If you want to know the genes where those hits occur, you can use biomaRt as explained by Steffen on this list a few days ago: https://stat.ethz.ch/pipermail/bioconductor/2008-November/025208.html Once you've retrieved all the genes coordinates, you can do the following: # Extract the genes that belong to chr17 (plus strand only): chr17_genes <- geneLocs[geneLocs$chromosome_name == "17" & geneLocs$strand == 1, ] # Use overlap() from the IRanges package to find the genes # where the hits occur: > tree <- IntervalTree(IRanges(start=chr17_genes$start_position, end=chr17_genes$end_position)) > overlap(tree, chr17_hits, multiple = FALSE) [1] 33 33 33 33 33 33 33 33 33 33 So the gene hit by probe 201268_at is: > chr17_genes[33, ] ensembl_gene_id strand chromosome_name start_position end_position 954 ENSG00000011052 1 17 46585919 46604104 which is what is reported by the hgu133plus2ENSEMBL map: > hgu133plus2ENSEMBL[['201268_at']] [1] "ENSG00000011052" Note that the same approach shows that the hits in chr12 are also within a gene boundaries (gene ENSG00000123009) but I'll let more qualified people comment about this. Cheers, H. Bazeley, Peter wrote: > Hello, > > R version: 2.8.0 > > I just installed the hgu133plus2.db package, and am looking at the hgu133plus2CHRLOC environment. I've noticed that some of the probeset entries (e.g. "201268_at") have multiple locations compared to Affy's annotation file. I'd like to figure out if these multiple locations are current, in which case it is some sort of overlapping/repeating duplication. For example: > >> as.list(hgu133plus2CHRLOC)$'201268_at' > 17 17 17 17 > 46598879 46597889 46598637 46599081 > > indicates that the probeset maps to 4 locations. Compare this to the alignments info in the Affy's annotation file (from 7/8/08, http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG- U133_Plus_2.na26.annot.csv.zip): > > chr12:119204403-119205041 (+) // 91.49 // q24.31 /// chr17:46598810-46604103 (+) // 96.87 // q21.33 > > which suggests one location on chromosome 17 (I'm ignoring chromosome 12 for now). This is a "_at" probeset, so it should map uniquely to a sequence, according to Affy's "Data Analysis Fundamentals" document (and speaking to a rep). > > >From the information provided by "?hgu133plus2CHRLOC", I downloaded > ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens /database/affyU133Plus2.txt.gz > from UCSC to see how this occured, but it is not clear. Actually, the file: > http://www.affymetrix.com/Auth/analysis/downloads/psl/HG- U133_Plus_2.link.psl.zip > from Affy's support page has the same alignment info. Here's the relevant PSL info: > > Target sequence name: chr17 > Alignment start position in target: 46598810 > Alignment end position in target: 46604103 > Number of blocks in the alignment (a block contains no gaps): 5 > Comma-separated list of sizes of each block: 47,130,102,113,257, > Comma-separated list of starting positions of each block in target: 46598810,46599186,46600601,46602296,46603846, > > > The second location provided by CHRLOC (46597889) occurs before the start of the alignment in the PSL info, so perhaps this one CHRLOC location corresponds to the PSL alignment? The mappings were obtained from UCSC on 2006-Apr14, so perhaps additional alignments existed at the time, which have since been removed. > > > Thank you for any help. Hopefully I'm just missing something obvious (well, non-obvious for me). > > Peter Bazeley > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
On Thu, Nov 20, 2008 at 3:03 PM, HervÃ© PagÃ¨s <hpages@fhcrc.org> wrote: > Hi Peter, > > You can find the genome coordinates of a set of probes by using the > matchPDict()/countPDict() tool from Biostrings + the genome for hg18: > > library(hgu133plus2probe) > library(BSgenome.Hsapiens.UCSC.hg18) > probes <- DNAStringSet( > hgu133plus2probe$sequence[hgu133plus2probe$Probe.Set.Name == > "201268_at"] > ) > > pdict1 <- PDict(probes) > nhits1 <- t(sapply(seqnames(Hsapiens), > function(name) countPDict(pdict1, Hsapiens[[name]]))) > > The 'nhits1' matrix summarizes the number of hits per chromosome (plus > strand only). See which chromosomes have hits with: > > > which(rowSums(nhits1) != 0) > chr12 chr17 > 12 17 > > For completeness, we should also search the minus strands of all > chromosomes. > This can easily be done by taking the reverse complement sequences of the > probes with pdict2 <- PDict(reverseComplement(probes)) and then using the > same code as above. > > Then use matchPDict() to get the coordinates of the hits in chr17: > > chr17_hits <- unlist(matchPDict(pdict1, Hsapiens$chr17)) > > If you want to know the genes where those hits occur, you can use > biomaRt as explained by Steffen on this list a few days ago: > > https://stat.ethz.ch/pipermail/bioconductor/2008-November/025208.html > > Once you've retrieved all the genes coordinates, you can do the > following: > > # Extract the genes that belong to chr17 (plus strand only): > chr17_genes <- geneLocs[geneLocs$chromosome_name == "17" & geneLocs$strand > == 1, ] > > # Use overlap() from the IRanges package to find the genes > # where the hits occur: > > tree <- IntervalTree(IRanges(start=chr17_genes$start_position, > end=chr17_genes$end_position)) > > overlap(tree, chr17_hits, multiple = FALSE) > [1] 33 33 33 33 33 33 33 33 33 33 > > So the gene hit by probe 201268_at is: > > > chr17_genes[33, ] > ensembl_gene_id strand chromosome_name start_position end_position > 954 ENSG00000011052 1 17 46585919 46604104 > > which is what is reported by the hgu133plus2ENSEMBL map: > > > hgu133plus2ENSEMBL[['201268_at']] > [1] "ENSG00000011052" > > Note that the same approach shows that the hits in chr12 are also within > a gene boundaries (gene ENSG00000123009) but I'll let more qualified people > comment about this. > I'll just comment here that when mapping expression probes, it is probably more meaningful to map the probes to a transcript database rather than the genome. Transcripts are discontinuous in the genome and the genome contains repeats that are not likely present in the transcript space. That said, transcript databases are limited, so there may be tradeoffs. Sean > > Bazeley, Peter wrote: > >> Hello, >> >> R version: 2.8.0 >> >> I just installed the hgu133plus2.db package, and am looking at the >> hgu133plus2CHRLOC environment. I've noticed that some of the probeset >> entries (e.g. "201268_at") have multiple locations compared to Affy's >> annotation file. I'd like to figure out if these multiple locations are >> current, in which case it is some sort of overlapping/repeating duplication. >> For example: >> >> as.list(hgu133plus2CHRLOC)$'201268_at' >>> >> 17 17 17 17 46598879 46597889 46598637 46599081 >> indicates that the probeset maps to 4 locations. Compare this to the >> alignments info in the Affy's annotation file (from 7/8/08, >> http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG- U133_Plus_2.na26.annot.csv.zip): >> >> chr12:119204403-119205041 (+) // 91.49 // q24.31 /// >> chr17:46598810-46604103 (+) // 96.87 // q21.33 >> >> which suggests one location on chromosome 17 (I'm ignoring chromosome 12 >> for now). This is a "_at" probeset, so it should map uniquely to a sequence, >> according to Affy's "Data Analysis Fundamentals" document (and speaking to a >> rep). >> >> >From the information provided by "?hgu133plus2CHRLOC", I downloaded >> ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapien s/database/affyU133Plus2.txt.gzfrom UCSC to see how this occured, but it is not clear. Actually, the file: >> >> http://www.affymetrix.com/Auth/analysis/downloads/psl/HG- U133_Plus_2.link.psl.zip >> from Affy's support page has the same alignment info. Here's the relevant >> PSL info: >> >> Target sequence name: chr17 >> Alignment start position in target: 46598810 >> Alignment end position in target: 46604103 >> Number of blocks in the alignment (a block contains no gaps): 5 >> Comma-separated list of sizes of each block: 47,130,102,113,257, >> Comma-separated list of starting positions of each block in target: >> 46598810,46599186,46600601,46602296,46603846, >> >> >> The second location provided by CHRLOC (46597889) occurs before the start >> of the alignment in the PSL info, so perhaps this one CHRLOC location >> corresponds to the PSL alignment? The mappings were obtained from UCSC on >> 2006-Apr14, so perhaps additional alignments existed at the time, which have >> since been removed. >> >> >> Thank you for any help. Hopefully I'm just missing something obvious >> (well, non-obvious for me). >> >> Peter Bazeley >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > HervÃ© PagÃ¨s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]