Hi Peter,
You can find the genome coordinates of a set of probes by using the
matchPDict()/countPDict() tool from Biostrings + the genome for hg18:
library(hgu133plus2probe)
library(BSgenome.Hsapiens.UCSC.hg18)
probes <- DNAStringSet(
hgu133plus2probe$sequence[hgu133plus2probe$Probe.Set.Name
== "201268_at"]
)
pdict1 <- PDict(probes)
nhits1 <- t(sapply(seqnames(Hsapiens),
function(name) countPDict(pdict1,
Hsapiens[[name]])))
The 'nhits1' matrix summarizes the number of hits per chromosome (plus
strand only). See which chromosomes have hits with:
> which(rowSums(nhits1) != 0)
chr12 chr17
12 17
For completeness, we should also search the minus strands of all
chromosomes.
This can easily be done by taking the reverse complement sequences of
the
probes with pdict2 <- PDict(reverseComplement(probes)) and then using
the
same code as above.
Then use matchPDict() to get the coordinates of the hits in chr17:
chr17_hits <- unlist(matchPDict(pdict1, Hsapiens$chr17))
If you want to know the genes where those hits occur, you can use
biomaRt as explained by Steffen on this list a few days ago:
https://stat.ethz.ch/pipermail/bioconductor/2008-November/025208.html
Once you've retrieved all the genes coordinates, you can do the
following:
# Extract the genes that belong to chr17 (plus strand only):
chr17_genes <- geneLocs[geneLocs$chromosome_name == "17" &
geneLocs$strand == 1, ]
# Use overlap() from the IRanges package to find the genes
# where the hits occur:
> tree <- IntervalTree(IRanges(start=chr17_genes$start_position,
end=chr17_genes$end_position))
> overlap(tree, chr17_hits, multiple = FALSE)
[1] 33 33 33 33 33 33 33 33 33 33
So the gene hit by probe 201268_at is:
> chr17_genes[33, ]
ensembl_gene_id strand chromosome_name start_position
end_position
954 ENSG00000011052 1 17 46585919
46604104
which is what is reported by the hgu133plus2ENSEMBL map:
> hgu133plus2ENSEMBL[['201268_at']]
[1] "ENSG00000011052"
Note that the same approach shows that the hits in chr12 are also
within
a gene boundaries (gene ENSG00000123009) but I'll let more qualified
people
comment about this.
Cheers,
H.
Bazeley, Peter wrote:
> Hello,
>
> R version: 2.8.0
>
> I just installed the hgu133plus2.db package, and am looking at the
hgu133plus2CHRLOC environment. I've noticed that some of the probeset
entries (e.g. "201268_at") have multiple locations compared to Affy's
annotation file. I'd like to figure out if these multiple locations
are current, in which case it is some sort of overlapping/repeating
duplication. For example:
>
>> as.list(hgu133plus2CHRLOC)$'201268_at'
> 17 17 17 17
> 46598879 46597889 46598637 46599081
>
> indicates that the probeset maps to 4 locations. Compare this to the
alignments info in the Affy's annotation file (from 7/8/08,
http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG-
U133_Plus_2.na26.annot.csv.zip):
>
> chr12:119204403-119205041 (+) // 91.49 // q24.31 ///
chr17:46598810-46604103 (+) // 96.87 // q21.33
>
> which suggests one location on chromosome 17 (I'm ignoring
chromosome 12 for now). This is a "_at" probeset, so it should map
uniquely to a sequence, according to Affy's "Data Analysis
Fundamentals" document (and speaking to a rep).
>
> >From the information provided by "?hgu133plus2CHRLOC", I downloaded
>
ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens
/database/affyU133Plus2.txt.gz
> from UCSC to see how this occured, but it is not clear. Actually,
the file:
>
http://www.affymetrix.com/Auth/analysis/downloads/psl/HG-
U133_Plus_2.link.psl.zip
> from Affy's support page has the same alignment info. Here's the
relevant PSL info:
>
> Target sequence name: chr17
> Alignment start position in target: 46598810
> Alignment end position in target: 46604103
> Number of blocks in the alignment (a block contains no gaps): 5
> Comma-separated list of sizes of each block: 47,130,102,113,257,
> Comma-separated list of starting positions of each block in target:
46598810,46599186,46600601,46602296,46603846,
>
>
> The second location provided by CHRLOC (46597889) occurs before the
start of the alignment in the PSL info, so perhaps this one CHRLOC
location corresponds to the PSL alignment? The mappings were obtained
from UCSC on 2006-Apr14, so perhaps additional alignments existed at
the time, which have since been removed.
>
>
> Thank you for any help. Hopefully I'm just missing something obvious
(well, non-obvious for me).
>
> Peter Bazeley
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319