Question: Mapping microarray probes to the genome using findOverlaps
0
gravatar for Hervé Pagès
8.2 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:
Hi Ravi, Sean's suggestion to map the probes directly to the transcriptome can easily be achieved with the Biostrings/BSgenome/GenomicFeatures infrastructure. The code below uses vwhichPDict (a variant of matchPDict) and the danRer6 genome from UCSC (release name: Sanger Institute Zv8). Note that UCSC just released danRer7 (Sanger Institute Zv9) but we don't have a BSgenome package for it yet. Also I don't think we support the Nimblegen Zebrafish 12 x135K Expression chip so I'm using the zebrafish chip from Affymetrix instead. Depending on your machine and the quality of your internet connection, this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. ## Load the probes: library(zebrafishprobe) library(Biostrings) probes <- DNAStringSet(zebrafishprobe) ## Compute the transcriptome (32992 transcripts): library(GenomicFeatures) txdb <- makeTranscriptDbFromUCSC(genome="danRer6", tablename="ensGene") library(BSgenome.Drerio.UCSC.danRer6) txseqs <- extractTranscriptsFromGenome(Drerio, txdb) ## Note that saving 'txseqs' with write.XStringSet() generates ## a FASTA file that is 51M. ## Map 'probes' to 'txseqs' (allowing 2 mismatches): pdict2 <- PDict(probes, max.mismatch=2) m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2) makeMap <- function(m, probeids, txnames) { probeid <- probeids[unlist(m)] txname <- rep.int(txnames, elementLengths(m)) ans <- unique(data.frame(probeid, txname)) rownames(ans) <- NULL ans } map <- makeMap(m2, zebrafishprobe$Probe.Set.Name, names(txseqs)) Note that the longest step is not the call to vwhichPDict() but the extraction of the transcriptome. However, the final result doesn't look that good. A quick look at the mapping shows that only 62.4% of the probe ids are mapped and that some probe ids are mapped to more than 1 transcript: > head(map) probeid txname 1 Dr.19494.1.S1_at ENSDART00000048610 2 Dr.19494.1.S1_at ENSDART00000103479 3 Dr.19494.1.S1_at ENSDART00000103478 4 Dr.10343.1.S1_at ENSDART00000006449 5 Dr.12057.1.A1_at ENSDART00000054814 6 Dr.22271.1.A1_at ENSDART00000054814 > length(unique(zebrafishprobe$Probe.Set.Name)) [1] 15617 > length(unique(map$probeid)) [1] 9746 > any(duplicated(map$probeid)) [1] TRUE But maybe you will have more luck with the Nimblegen Zebrafish 12 x135K Expression chip. One more thing. While testing the above code, I discovered 2 problems that were preventing it to work: one with the extractTranscriptsFromGenome() function and one with the vwhichPDict() function. Those problems are now fixed in the release and devel versions of GenomicFeatures (v 1.2.4 & 1.3.13) and Biostrings (v 2.18.3 & 2.19.12). These new versions will become available via biocLite() in the next 24-36 hours. Hope this helps. H. ----- Original Message ----- From: "Ravi Karra" <ravi.karra@gmail.com> To: "Sean Davis" <sdavis2 at="" mail.nih.gov=""> Cc: bioconductor at r-project.org Sent: Saturday, February 26, 2011 2:56:02 PM Subject: Re: [BioC] Mapping microarray probes to the genome using findOverlaps Hi Sean, I could do that, but am not sure how to. The annotated zebrafish genome is the Tubingen strain and the some of the probes on the array are from the EK and AB strains. This means that I need to allow for SNP's in the alignment. I originally tried to align the probes to the Ensembl Transcripts using matchPDict but when I allowed for 2 mismatches (max.mismatch = 2) across the probe sequences, my computer never stopped running the program! I found TopHat to be much faster (8 min) and TopHat allows for a few nucleotide wobble by default!. Do you have a suggestion(s) on another way to align the array to the Ensembl Transcripts? Thanks, Ravi On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: > > > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra <ravi.karra at="" gmail.com=""> wrote: > Hello, > > I am still trying to map probes on the Nimblegen Zebrafish 12 x135K Expression to the Zv9 version of the zebrafish genome available from Ensembl! I am very reluctantly pursuing an alignment approach to annotation as the original annotation provided with the array is quite outdated. > > I performed a gapped alignment using the individual probe sequences (60-mers) from the array using TopHat and loaded the results into Bioconductor as a GappedAlignments object. I have made a TranscriptDb object using the Zv9 genome from Ensembl. Next, I plan to use findOverlaps for the annotation. What is the best way to get the overlaps (by exon, cds, or by transcript)? I am a little concerned that using transcriptsByOverlaps might be a bit too broad and result in mapping reads to multiple genes (for example transcripts in the genome that have overlapping genomic coordinates). By contrast, mapping with exonsByOverlaps and cdsByOverlaps might be too restrictive and miss information in the UTR's. My gut feeling is that annotating by cds is the "in-between" approach. What is the recommended approach for RNA-seq? As you can tell, I am quite new to this! > > > Hi, Ravi. Sorry to answer a question with more questions, but why not just map the probes against Ensembl Transcripts or refseq? What is the advantage of mapping to the genome and then going back to the transcripts? > > Sean [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENTlink modified 8.2 years ago by Sean Davis21k • written 8.2 years ago by Hervé Pagès ♦♦ 13k
Answer: Mapping microarray probes to the genome using findOverlaps
0
gravatar for Sean Davis
8.2 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:
On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages@fhcrc.org> wrote: > Hi Ravi, > > Sean's suggestion to map the probes directly to the transcriptome can > easily be achieved with the Biostrings/BSgenome/GenomicFeatures > infrastructure. > > The code below uses vwhichPDict (a variant of matchPDict) and the danRer6 > genome from UCSC (release name: Sanger Institute Zv8). Note that UCSC just > released danRer7 (Sanger Institute Zv9) but we don't have a BSgenome > package > for it yet. > > Also I don't think we support the Nimblegen Zebrafish 12 x135K Expression > chip so I'm using the zebrafish chip from Affymetrix instead. > > Depending on your machine and the quality of your internet connection, > this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. > > ## Load the probes: > library(zebrafishprobe) > library(Biostrings) > probes <- DNAStringSet(zebrafishprobe) > > ## Compute the transcriptome (32992 transcripts): > library(GenomicFeatures) > txdb <- makeTranscriptDbFromUCSC(genome="danRer6", tablename="ensGene") > library(BSgenome.Drerio.UCSC.danRer6) > txseqs <- extractTranscriptsFromGenome(Drerio, txdb) > ## Note that saving 'txseqs' with write.XStringSet() generates > ## a FASTA file that is 51M. > > ## Map 'probes' to 'txseqs' (allowing 2 mismatches): > pdict2 <- PDict(probes, max.mismatch=2) > m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2) > makeMap <- function(m, probeids, txnames) > { > probeid <- probeids[unlist(m)] > txname <- rep.int(txnames, elementLengths(m)) > ans <- unique(data.frame(probeid, txname)) > rownames(ans) <- NULL > ans > } > map <- makeMap(m2, zebrafishprobe$Probe.Set.Name, names(txseqs)) > > Note that the longest step is not the call to vwhichPDict() but the > extraction of the transcriptome. > > However, the final result doesn't look that good. A quick look at the > mapping shows that only 62.4% of the probe ids are mapped and that some > probe ids are mapped to more than 1 transcript: > > Nice example, Herve! Mapping to multiple transcripts is to be expected, so one needs to come up with ways of dealing with the situation. Missing probes is also typical. These can be dealt with by mapping to the unmatched probes to other transcript databases. For example, one could map to Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish ESTs next. > > head(map) > probeid txname > 1 Dr.19494.1.S1_at ENSDART00000048610 > 2 Dr.19494.1.S1_at ENSDART00000103479 > 3 Dr.19494.1.S1_at ENSDART00000103478 > 4 Dr.10343.1.S1_at ENSDART00000006449 > 5 Dr.12057.1.A1_at ENSDART00000054814 > 6 Dr.22271.1.A1_at ENSDART00000054814 > > length(unique(zebrafishprobe$Probe.Set.Name)) > [1] 15617 > > length(unique(map$probeid)) > [1] 9746 > > any(duplicated(map$probeid)) > [1] TRUE > > But maybe you will have more luck with the Nimblegen Zebrafish 12 x135K > Expression chip. > > Keep in mind that with 60mer probes, there is potentially a need to allow gaps in the alignment and that consecutive matches of even 40 or 50 base pairs are likely to generate a signal on a microarray. Sean > One more thing. While testing the above code, I discovered 2 problems > that were preventing it to work: one with the > extractTranscriptsFromGenome() > function and one with the vwhichPDict() function. Those problems are now > fixed in the release and devel versions of GenomicFeatures (v 1.2.4 & > 1.3.13) > and Biostrings (v 2.18.3 & 2.19.12). These new versions will become > available > via biocLite() in the next 24-36 hours. > > Hope this helps. > H. > > > ----- Original Message ----- > From: "Ravi Karra" <ravi.karra@gmail.com> > To: "Sean Davis" <sdavis2@mail.nih.gov> > Cc: bioconductor@r-project.org > Sent: Saturday, February 26, 2011 2:56:02 PM > Subject: Re: [BioC] Mapping microarray probes to the genome using > findOverlaps > > Hi Sean, > I could do that, but am not sure how to. The annotated zebrafish genome is > the Tubingen strain and the some of the probes on the array are from the EK > and AB strains. This means that I need to allow for SNP's in the > alignment. I originally tried to align the probes to the Ensembl > Transcripts using matchPDict but when I allowed for 2 mismatches > (max.mismatch = 2) across the probe sequences, my computer never stopped > running the program! I found TopHat to be much faster (8 min) and TopHat > allows for a few nucleotide wobble by default!. > Do you have a suggestion(s) on another way to align the array to the > Ensembl Transcripts? > Thanks, > Ravi > > > > > On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: > > > > > > > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra <ravi.karra@gmail.com> > wrote: > > Hello, > > > > I am still trying to map probes on the Nimblegen Zebrafish 12 x135K > Expression to the Zv9 version of the zebrafish genome available from > Ensembl! I am very reluctantly pursuing an alignment approach to annotation > as the original annotation provided with the array is quite outdated. > > > > I performed a gapped alignment using the individual probe sequences > (60-mers) from the array using TopHat and loaded the results into > Bioconductor as a GappedAlignments object. I have made a TranscriptDb > object using the Zv9 genome from Ensembl. Next, I plan to use findOverlaps > for the annotation. What is the best way to get the overlaps (by exon, cds, > or by transcript)? I am a little concerned that using transcriptsByOverlaps > might be a bit too broad and result in mapping reads to multiple genes (for > example transcripts in the genome that have overlapping genomic > coordinates). By contrast, mapping with exonsByOverlaps and cdsByOverlaps > might be too restrictive and miss information in the UTR's. My gut feeling > is that annotating by cds is the "in-between" approach. What is the > recommended approach for RNA-seq? As you can tell, I am quite new to this! > > > > > > Hi, Ravi. Sorry to answer a question with more questions, but why not > just map the probes against Ensembl Transcripts or refseq? What is the > advantage of mapping to the genome and then going back to the transcripts? > > > > Sean > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENTlink written 8.2 years ago by Sean Davis21k
Hi Sean, Ravi, On 03/01/2011 04:08 AM, Sean Davis wrote: > > > On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Ravi, > > Sean's suggestion to map the probes directly to the transcriptome can > easily be achieved with the Biostrings/BSgenome/GenomicFeatures > infrastructure. > > The code below uses vwhichPDict (a variant of matchPDict) and the > danRer6 > genome from UCSC (release name: Sanger Institute Zv8). Note that > UCSC just > released danRer7 (Sanger Institute Zv9) but we don't have a BSgenome > package > for it yet. > > Also I don't think we support the Nimblegen Zebrafish 12 x135K > Expression > chip so I'm using the zebrafish chip from Affymetrix instead. > > Depending on your machine and the quality of your internet connection, > this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. > > ## Load the probes: > library(zebrafishprobe) > library(Biostrings) > probes <- DNAStringSet(zebrafishprobe) > > ## Compute the transcriptome (32992 transcripts): > library(GenomicFeatures) > txdb <- makeTranscriptDbFromUCSC(genome="danRer6", > tablename="ensGene") > library(BSgenome.Drerio.UCSC.danRer6) > txseqs <- extractTranscriptsFromGenome(Drerio, txdb) > ## Note that saving 'txseqs' with write.XStringSet() generates > ## a FASTA file that is 51M. > > ## Map 'probes' to 'txseqs' (allowing 2 mismatches): > pdict2 <- PDict(probes, max.mismatch=2) > m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2) > makeMap <- function(m, probeids, txnames) > { > probeid <- probeids[unlist(m)] > txname <- rep.int <http: rep.int="">(txnames, elementLengths(m)) > ans <- unique(data.frame(probeid, txname)) > rownames(ans) <- NULL > ans > } > map <- makeMap(m2, zebrafishprobe$Probe.Set.Name > <http: probe.set.name="">, names(txseqs)) > > Note that the longest step is not the call to vwhichPDict() but the > extraction of the transcriptome. > > However, the final result doesn't look that good. A quick look at the > mapping shows that only 62.4% of the probe ids are mapped and that some > probe ids are mapped to more than 1 transcript: > > > Nice example, Herve! > > Mapping to multiple transcripts is to be expected, so one needs to come > up with ways of dealing with the situation. Missing probes is also > typical. These can be dealt with by mapping to the unmatched probes to > other transcript databases. For example, one could map to Ensembl > first, RefSeq second, all zebrafish mRNAs next, all zebrafish ESTs next. > > > head(map) > probeid txname > 1 Dr.19494.1.S1_at ENSDART00000048610 > 2 Dr.19494.1.S1_at ENSDART00000103479 > 3 Dr.19494.1.S1_at ENSDART00000103478 > 4 Dr.10343.1.S1_at ENSDART00000006449 > 5 Dr.12057.1.A1_at ENSDART00000054814 > 6 Dr.22271.1.A1_at ENSDART00000054814 > > length(unique(zebrafishprobe$Probe.Set.Name <http: probe.set.name="">)) > [1] 15617 > > length(unique(map$probeid)) > [1] 9746 > > any(duplicated(map$probeid)) > [1] TRUE > > But maybe you will have more luck with the Nimblegen Zebrafish 12 x135K > Expression chip. > > > Keep in mind that with 60mer probes, there is potentially a need to > allow gaps in the alignment Good point. I should mention here that, even though at the moment matchPDict() and family cannot handle indels when the dictionary is preprocessed, now they do support them when the dictionary is not preprocessed (I just added this feature). So one can still use this to try to align the probes that didn't align with 2 mismatches: mapped_probes_idx <- sort(unique(unlist(m2))) unmapped_probes <- probes[-mapped_probes_idx] # 115789 probes m2_indels <- vwhichPDict(unmapped_probes, txseqs, max.mismatch=2, with.indels=TRUE) Unfortunately, since no preprocessing is used, this is *very* slow (can take up to 3 or 4 days to align the 115789 unmapped probes). Anyway, for the zebrafish probes used in this example, allowing 2 indels doesn't seem to help (I tried to align the first 100 unmapped probes only and none of them would align), so, as you said Sean, one might want to try using other transcript databases. > and that consecutive matches of even 40 or > 50 base pairs are likely to generate a signal on a microarray. Unfortunately, I can't think of any easy/efficient way to achieve this kind of partial mapping with the tools I'm trying to promote here. The kind of indels supported by the matchPattern and matchPDict families of functions are simple indels (1 indel = 1 insertion or deletion of 1 nucleotide). For more complex types of alignments, the pairwiseAlignment() function could be used, but, since it is basically an implementation of the Smith-Waterman algorithm, it would be much much slower than any of the external tools Sean mentioned (blat, gmap, etc...) Cheers, H. PS: Support for a small number of indels to countPDict(), whichPDict(), vcountPDict() and vwhichPDict(), when the input dictionary is not preprocessed, was added to Biostrings 2.18.4 (release) and 2.19.13 (devel). > > Sean > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 8.2 years ago by Hervé Pagès ♦♦ 13k
Hi Sean and Herve, Thanks a ton for the suggestions! I modified Herve's approach to use biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many of the VEGA transcripts, many of which are supposed to be on this array. I ended up mapping to the cdna's from zv9 and like Herve only match about 60% of probes to the ensembl transcripts. I too found many probes that map to multiple transcripts. Like the brainarray group, I chose to use the probes as a part of transcript-specific probesets... i.e. a given probe can contribute to the expression of multiple transcripts. Before trying to extend mapping to other databases as suggested by Sean, I wanted to try and create a custom ndf file and annotation package to give the analysis a go. However, I keep getting the error below using pdinfrobuilder and my custom ndf file. Did I construct my new ndf file incorrectly? I included my code and session info below. Sorry for the inefficient code... I am learning on the fly! Thanks everyone! This list has been immensely helpful! Ravi ====================================================================== = Building annotation package for Nimblegen Expression Array NDF: newndf.ndf XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys ====================================================================== = Parsing file: newndf.ndf... OK Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK Merging NDF and XYS files... OK Preparing contents for featureSet table... OK Preparing contents for bgfeature table... OK Preparing contents for pmfeature table... OK Creating package in ./pd.newndf Inserting 27649 rows into table featureSet... OK Inserting 200368 rows into table pmfeature... Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed) My code: #1. Load the probes from the ndf file: library (Biostrings) array_file = "/Users/ravikarra/Documents/Research Projects/Regeneration Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf" probe_info = read.table (array_file, header = TRUE, sep = '\t') probes = DNAStringSet (as.character(probe_info$PROBE_SEQUENCE)) probeid = probe_info$PROBE_DESIGN_ID probeid = as.character (probeid) names(probes) = probeid real_probes = which (width(probes) == 60) #don't align the control, empty, or the crosshyb probes probes = probes [real_probes] #2. Compute the transcriptome (51569 transcripts with sequences as of March 7, .2011): library (biomaRt) ensembl = useMart ("ensembl") ensemblzv9 = useDataset("drerio_gene_ensembl",mart=ensembl) txIDs = getBM (attributes = c("ensembl_transcript_id"), mart = ensemblzv9)[,1] txSeqs = getSequence (id = txIDs, type = "ensembl_transcript_id", seqType = "cdna", mart = ensemblzv9) #make the DNAStringSet object zv9txDict = DNAStringSet (txSeqs$cdna) names(zv9txDict) = txSeqs$ensembl_transcript_id #3 Map 'probes' to 'txseqs' (allowing 2 mismatches): pdict2 <- PDict(probes, max.mismatch=2) m2 = vwhichPDict(pdict2, zv9txDict, max.mismatch=2) makeMap <- function(m, probeids, txnames) { probeid <- probeids[unlist(m)] txname <- rep.int(txnames, elementLengths(m)) ans <- unique(data.frame(probeid, txname)) rownames(ans) <- NULL ans } map = makeMap(m2, names (probes), names(zv9txDict)) #4. Remove transcripts with less than 3 probes txcounts = data.frame (table (map$txname)) dim (txcounts) #start with 32961 different transcripts usefultxs = txcounts [txcounts$Freq > 2,][,1] #end with 27647 transcripts map = map [map$txname %in% usefultxs, ] #5: Make a new ndf file with the updated transcript information #probe_info is the ndf file I originally loaded rownames (probe_info) = probe_info[,1] ndf = probe_info [as.character (map$probeid),] ndf$SEQ_ID = map$txname #append the information for the control probes to the new ndf control = probe_info [probe_info$PROBE_CLASS != "experimental",] ndf = rbind (ndf, control) # change all the probes with unmapped transcripts to random but keep controls intact noalign = !(as.character (probe_info$PROBE_DESIGN_ID) %in% as.character (ndf$PROBE_DESIGN_ID)) noalignprobes = probe_info [noalign,] noalignprobes $SEQ_ID = "noalign" ndf = rbind (ndf, noalignprobes) write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE) #6. Make the new annotation package #make custom chip descriptor file for Nimblegen library(pdInfoBuilder) filename_ndf <- list.files('.',pattern='.ndf',full.names=T) filename_ndf = new("ScalarCharacter", filename_ndf) filename_xys <- list.files('.', pattern=".xys",full.names = T) [1] filename_xys = new("ScalarCharacter", filename_xys) seed <- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf, xysFile=filename_xys, author="Ravi Karra",email="ravi.karra@duke.edu", biocViews="AnnotationData", chipName = "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7", manufacturer = "Nimblegen", url = "http://www.nimblegen.com/products/exp/eukaryotic.html", organism = "Danio rerio", version = "1.1") makePdInfoPackage(seed, destDir = ".") > sessionInfo() R version 2.12.2 (2011-02-25) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods [7] base other attached packages: [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0 [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2 [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5 [10] Biobase_2.10.0 loaded via a namespace (and not attached): [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3 [4] splines_2.12.2 XML_3.2-0 On Mar 1, 2011, at 7:08 AM, Sean Davis wrote: > > > On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages@fhcrc.org> wrote: > Hi Ravi, > > Sean's suggestion to map the probes directly to the transcriptome can > easily be achieved with the Biostrings/BSgenome/GenomicFeatures > infrastructure. > > The code below uses vwhichPDict (a variant of matchPDict) and the danRer6 > genome from UCSC (release name: Sanger Institute Zv8). Note that UCSC just > released danRer7 (Sanger Institute Zv9) but we don't have a BSgenome package > for it yet. > > Also I don't think we support the Nimblegen Zebrafish 12 x135K Expression > chip so I'm using the zebrafish chip from Affymetrix instead. > > Depending on your machine and the quality of your internet connection, > this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. > > ## Load the probes: > library(zebrafishprobe) > library(Biostrings) > probes <- DNAStringSet(zebrafishprobe) > > ## Compute the transcriptome (32992 transcripts): > library(GenomicFeatures) > txdb <- makeTranscriptDbFromUCSC(genome="danRer6", tablename="ensGene") > library(BSgenome.Drerio.UCSC.danRer6) > txseqs <- extractTranscriptsFromGenome(Drerio, txdb) > ## Note that saving 'txseqs' with write.XStringSet() generates > ## a FASTA file that is 51M. > > ## Map 'probes' to 'txseqs' (allowing 2 mismatches): > pdict2 <- PDict(probes, max.mismatch=2) > m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2) > makeMap <- function(m, probeids, txnames) > { > probeid <- probeids[unlist(m)] > txname <- rep.int(txnames, elementLengths(m)) > ans <- unique(data.frame(probeid, txname)) > rownames(ans) <- NULL > ans > } > map <- makeMap(m2, zebrafishprobe$Probe.Set.Name, names(txseqs)) > > Note that the longest step is not the call to vwhichPDict() but the > extraction of the transcriptome. > > However, the final result doesn't look that good. A quick look at the > mapping shows that only 62.4% of the probe ids are mapped and that some > probe ids are mapped to more than 1 transcript: > > > Nice example, Herve! > > Mapping to multiple transcripts is to be expected, so one needs to come up with ways of dealing with the situation. Missing probes is also typical. These can be dealt with by mapping to the unmatched probes to other transcript databases. For example, one could map to Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish ESTs next. > > > head(map) > probeid txname > 1 Dr.19494.1.S1_at ENSDART00000048610 > 2 Dr.19494.1.S1_at ENSDART00000103479 > 3 Dr.19494.1.S1_at ENSDART00000103478 > 4 Dr.10343.1.S1_at ENSDART00000006449 > 5 Dr.12057.1.A1_at ENSDART00000054814 > 6 Dr.22271.1.A1_at ENSDART00000054814 > > length(unique(zebrafishprobe$Probe.Set.Name)) > [1] 15617 > > length(unique(map$probeid)) > [1] 9746 > > any(duplicated(map$probeid)) > [1] TRUE > > But maybe you will have more luck with the Nimblegen Zebrafish 12 x135K > Expression chip. > > > Keep in mind that with 60mer probes, there is potentially a need to allow gaps in the alignment and that consecutive matches of even 40 or 50 base pairs are likely to generate a signal on a microarray. > > Sean > > One more thing. While testing the above code, I discovered 2 problems > that were preventing it to work: one with the extractTranscriptsFromGenome() > function and one with the vwhichPDict() function. Those problems are now > fixed in the release and devel versions of GenomicFeatures (v 1.2.4 & 1.3.13) > and Biostrings (v 2.18.3 & 2.19.12). These new versions will become available > via biocLite() in the next 24-36 hours. > > Hope this helps. > H. > > > ----- Original Message ----- > From: "Ravi Karra" <ravi.karra@gmail.com> > To: "Sean Davis" <sdavis2@mail.nih.gov> > Cc: bioconductor@r-project.org > Sent: Saturday, February 26, 2011 2:56:02 PM > Subject: Re: [BioC] Mapping microarray probes to the genome using findOverlaps > > Hi Sean, > I could do that, but am not sure how to. The annotated zebrafish genome is the Tubingen strain and the some of the probes on the array are from the EK and AB strains. This means that I need to allow for SNP's in the alignment. I originally tried to align the probes to the Ensembl Transcripts using matchPDict but when I allowed for 2 mismatches (max.mismatch = 2) across the probe sequences, my computer never stopped running the program! I found TopHat to be much faster (8 min) and TopHat allows for a few nucleotide wobble by default!. > Do you have a suggestion(s) on another way to align the array to the Ensembl Transcripts? > Thanks, > Ravi > > > > > On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: > > > > > > > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra <ravi.karra@gmail.com> wrote: > > Hello, > > > > I am still trying to map probes on the Nimblegen Zebrafish 12 x135K Expression to the Zv9 version of the zebrafish genome available from Ensembl! I am very reluctantly pursuing an alignment approach to annotation as the original annotation provided with the array is quite outdated. > > > > I performed a gapped alignment using the individual probe sequences (60-mers) from the array using TopHat and loaded the results into Bioconductor as a GappedAlignments object. I have made a TranscriptDb object using the Zv9 genome from Ensembl. Next, I plan to use findOverlaps for the annotation. What is the best way to get the overlaps (by exon, cds, or by transcript)? I am a little concerned that using transcriptsByOverlaps might be a bit too broad and result in mapping reads to multiple genes (for example transcripts in the genome that have overlapping genomic coordinates). By contrast, mapping with exonsByOverlaps and cdsByOverlaps might be too restrictive and miss information in the UTR's. My gut feeling is that annotating by cds is the "in-between" approach. What is the recommended approach for RNA-seq? As you can tell, I am quite new to this! > > > > > > Hi, Ravi. Sorry to answer a question with more questions, but why not just map the probes against Ensembl Transcripts or refseq? What is the advantage of mapping to the genome and then going back to the transcripts? > > > > Sean > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD REPLYlink written 8.1 years ago by Ravi Karra140
Hi Ravi, (my responses may be slow for these next days, sorry) Can you please confirm that the lines of your newly created NDF are unique? b On 8 March 2011 03:11, Ravi Karra <ravi.karra at="" gmail.com=""> wrote: > Hi Sean and Herve, > > Thanks a ton for the suggestions! ?I modified Herve's approach to use biomaRt. ?Zv9 is available via Biomart/Ensembl and has incorporated many of the VEGA transcripts, many of which are supposed to be on this array. I ended up mapping to the cdna's from zv9 and like Herve only match about 60% of probes to the ensembl transcripts. ?I too found many probes that map to multiple transcripts. ?Like the brainarray group, I chose to use the probes as a part of transcript- specific probesets... i.e. a given probe can contribute to the expression of multiple transcripts. ?Before trying to extend mapping to other databases as suggested by Sean, I wanted to try and create a custom ndf file and annotation package to give the analysis a go. ?However, I keep getting the error below using pdinfrobuilder and my custom ndf file. ?Did I construct my new ndf file incorrectly? I included my code and session info below. ?Sorry for the inefficient code... I am learning on the fly! > > Thanks everyone! ?This list has been immensely helpful! > > Ravi > > ==================================================================== === > Building annotation package for Nimblegen Expression Array > NDF: newndf.ndf > XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys > ==================================================================== === > Parsing file: newndf.ndf... OK > Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK > Merging NDF and XYS files... OK > Preparing contents for featureSet table... OK > Preparing contents for bgfeature table... OK > Preparing contents for pmfeature table... OK > Creating package in ./pd.newndf > Inserting 27649 rows into table featureSet... OK > Inserting 200368 rows into table pmfeature... Error in sqliteExecStatement(con, statement, bind.data) : > ?RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed) > > My code: > > #1. Load the probes from the ndf file: > library (Biostrings) > array_file = "/Users/ravikarra/Documents/Research Projects/Regeneration Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf" > probe_info = read.table (array_file, header = TRUE, sep = '\t') > probes = DNAStringSet (as.character(probe_info$PROBE_SEQUENCE)) > probeid = probe_info$PROBE_DESIGN_ID > probeid = as.character (probeid) > names(probes) = probeid > real_probes = which (width(probes) == 60) #don't align the control, empty, or the crosshyb probes > probes = probes [real_probes] > > #2. Compute the transcriptome (51569 transcripts with sequences as of March 7, .2011): > library (biomaRt) > ensembl = useMart ("ensembl") > ensemblzv9 = useDataset("drerio_gene_ensembl",mart=ensembl) > txIDs = getBM (attributes = c("ensembl_transcript_id"), mart = ensemblzv9)[,1] > txSeqs = getSequence (id = txIDs, type = "ensembl_transcript_id", ?seqType = "cdna", mart = ensemblzv9) > #make the DNAStringSet object > zv9txDict = DNAStringSet (txSeqs$cdna) > names(zv9txDict) = txSeqs$ensembl_transcript_id > > #3 Map 'probes' to 'txseqs' (allowing 2 mismatches): > pdict2 <- PDict(probes, max.mismatch=2) > m2 = vwhichPDict(pdict2, zv9txDict, max.mismatch=2) > makeMap <- function(m, probeids, txnames) { > ? ? ? ?probeid <- probeids[unlist(m)] > ? ? ? ?txname <- rep.int(txnames, elementLengths(m)) > ? ? ? ?ans <- unique(data.frame(probeid, txname)) > ? ? ? ?rownames(ans) <- NULL > ? ? ? ?ans > } > map = makeMap(m2, names (probes), names(zv9txDict)) > > #4. Remove transcripts with less than 3 probes > txcounts = data.frame (table (map$txname)) > dim (txcounts) #start with 32961 different transcripts > usefultxs = txcounts [txcounts$Freq > 2,][,1] #end with 27647 transcripts > map = map [map$txname %in% usefultxs, ] > > #5: Make a new ndf file with the updated transcript information > #probe_info is the ndf file I originally loaded > rownames (probe_info) = probe_info[,1] > ndf = probe_info [as.character (map$probeid),] > ndf$SEQ_ID = map$txname > > #append the information for the control probes to the new ndf > control = probe_info [probe_info$PROBE_CLASS != "experimental",] > ndf = rbind (ndf, control) > > # change all the probes with unmapped transcripts to random but keep controls intact > noalign = !(as.character (probe_info$PROBE_DESIGN_ID) %in% as.character (ndf$PROBE_DESIGN_ID)) > noalignprobes = probe_info [noalign,] > noalignprobes $SEQ_ID = "noalign" > ndf = rbind (ndf, noalignprobes) > write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE) > > #6. Make the new annotation package > #make custom chip descriptor file for Nimblegen > library(pdInfoBuilder) > filename_ndf <- list.files('.',pattern='.ndf',full.names=T) > filename_ndf = new("ScalarCharacter", filename_ndf) > filename_xys <- list.files('.', pattern=".xys",full.names = T) [1] > filename_xys = new("ScalarCharacter", filename_xys) > seed <- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf, xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu", biocViews="AnnotationData", chipName = "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7", manufacturer = "Nimblegen", url = "http://www.nimblegen.com/products/exp/eukaryotic.html", organism = "Danio rerio", version = "1.1") > makePdInfoPackage(seed, destDir = ".") > >> sessionInfo() > R version 2.12.2 (2011-02-25) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods > [7] base > > other attached packages: > ?[1] Biostrings_2.18.4 ? ?IRanges_1.8.9 ? ? ? ?biomaRt_2.6.0 > ?[4] pdInfoBuilder_1.14.1 oligo_1.14.0 ? ? ? ? oligoClasses_1.12.2 > ?[7] affxparser_1.22.1 ? ?RSQLite_0.9-4 ? ? ? ?DBI_0.2-5 > [10] Biobase_2.10.0 > > loaded via a namespace (and not attached): > [1] affyio_1.18.0 ? ? ? ? preprocessCore_1.12.0 RCurl_1.4-3 > [4] splines_2.12.2 ? ? ? ?XML_3.2-0 > > > On Mar 1, 2011, at 7:08 AM, Sean Davis wrote: > >> >> >> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages at="" fhcrc.org=""> wrote: >> Hi Ravi, >> >> Sean's suggestion to map the probes directly to the transcriptome can >> easily be achieved with the Biostrings/BSgenome/GenomicFeatures >> infrastructure. >> >> The code below uses vwhichPDict (a variant of matchPDict) and the danRer6 >> genome from UCSC (release name: Sanger Institute Zv8). Note that UCSC just >> released danRer7 (Sanger Institute Zv9) but we don't have a BSgenome package >> for it yet. >> >> Also I don't think we support the Nimblegen Zebrafish 12 x135K Expression >> chip so I'm using the zebrafish chip from Affymetrix instead. >> >> Depending on your machine and the quality of your internet connection, >> this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. >> >> ?## Load the probes: >> ?library(zebrafishprobe) >> ?library(Biostrings) >> ?probes <- DNAStringSet(zebrafishprobe) >> >> ?## Compute the transcriptome (32992 transcripts): >> ?library(GenomicFeatures) >> ?txdb <- makeTranscriptDbFromUCSC(genome="danRer6", tablename="ensGene") >> ?library(BSgenome.Drerio.UCSC.danRer6) >> ?txseqs <- extractTranscriptsFromGenome(Drerio, txdb) >> ?## Note that saving 'txseqs' with write.XStringSet() generates >> ?## a FASTA file that is 51M. >> >> ?## Map 'probes' to 'txseqs' (allowing 2 mismatches): >> ?pdict2 <- PDict(probes, max.mismatch=2) >> ?m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2) >> ?makeMap <- function(m, probeids, txnames) >> ?{ >> ? ?probeid <- probeids[unlist(m)] >> ? ?txname <- rep.int(txnames, elementLengths(m)) >> ? ?ans <- unique(data.frame(probeid, txname)) >> ? ?rownames(ans) <- NULL >> ? ?ans >> ?} >> ?map <- makeMap(m2, zebrafishprobe$Probe.Set.Name, names(txseqs)) >> >> Note that the longest step is not the call to vwhichPDict() but the >> extraction of the transcriptome. >> >> However, the final result doesn't look that good. A quick look at the >> mapping shows that only 62.4% of the probe ids are mapped and that some >> probe ids are mapped to more than 1 transcript: >> >> >> Nice example, Herve! >> >> Mapping to multiple transcripts is to be expected, so one needs to come up with ways of dealing with the situation. ?Missing probes is also typical. ?These can be dealt with by mapping to the unmatched probes to other transcript databases. ?For example, one could map to Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish ESTs next. >> >> ?> head(map) >> ? ? ? ? ? ? probeid ? ? ? ? ? ? txname >> ?1 Dr.19494.1.S1_at ENSDART00000048610 >> ?2 Dr.19494.1.S1_at ENSDART00000103479 >> ?3 Dr.19494.1.S1_at ENSDART00000103478 >> ?4 Dr.10343.1.S1_at ENSDART00000006449 >> ?5 Dr.12057.1.A1_at ENSDART00000054814 >> ?6 Dr.22271.1.A1_at ENSDART00000054814 >> ?> length(unique(zebrafishprobe$Probe.Set.Name)) >> ?[1] 15617 >> ?> length(unique(map$probeid)) >> ?[1] 9746 >> ?> any(duplicated(map$probeid)) >> ?[1] TRUE >> >> But maybe you will have more luck with the Nimblegen Zebrafish 12 x135K >> Expression chip. >> >> >> Keep in mind that with 60mer probes, there is potentially a need to allow gaps in the alignment and that consecutive matches of even 40 or 50 base pairs are likely to generate a signal on a microarray. >> >> Sean >> >> One more thing. While testing the above code, I discovered 2 problems >> that were preventing it to work: one with the extractTranscriptsFromGenome() >> function and one with the vwhichPDict() function. Those problems are now >> fixed in the release and devel versions of GenomicFeatures (v 1.2.4 & 1.3.13) >> and Biostrings (v 2.18.3 & 2.19.12). These new versions will become available >> via biocLite() in the next 24-36 hours. >> >> Hope this helps. >> H. >> >> >> ----- Original Message ----- >> From: "Ravi Karra" <ravi.karra at="" gmail.com=""> >> To: "Sean Davis" <sdavis2 at="" mail.nih.gov=""> >> Cc: bioconductor at r-project.org >> Sent: Saturday, February 26, 2011 2:56:02 PM >> Subject: Re: [BioC] Mapping microarray probes to the genome using ? ? ? findOverlaps >> >> Hi Sean, >> I could do that, but am not sure how to. ?The annotated zebrafish genome is the Tubingen strain and the some of the probes on the array are from the EK and AB strains. ? This means that I need to allow for SNP's in the alignment. ?I originally tried to align the probes to the Ensembl Transcripts using matchPDict but when I allowed for 2 mismatches (max.mismatch = 2) across the probe sequences, my computer never stopped running the program! ?I found TopHat to be much faster (8 min) and TopHat allows for a few nucleotide wobble by default!. >> Do you have a suggestion(s) on another way to align the array to the Ensembl Transcripts? >> Thanks, >> Ravi >> >> >> >> >> On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: >> >> > >> > >> > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra <ravi.karra at="" gmail.com=""> wrote: >> > Hello, >> > >> > I am still trying to map probes on the Nimblegen Zebrafish 12 x135K Expression to the Zv9 version of the zebrafish genome available from Ensembl! ?I am very reluctantly pursuing an alignment approach to annotation as the original annotation provided with the array is quite outdated. >> > >> > I performed a gapped alignment using the individual probe sequences (60-mers) from the array using TopHat and loaded the results into Bioconductor as a GappedAlignments object. ?I have made a TranscriptDb object using the Zv9 genome from Ensembl. ?Next, I plan to use findOverlaps for the annotation. What is the best way to get the overlaps (by exon, cds, or by transcript)? ?I am a little concerned that using transcriptsByOverlaps might be a bit too broad and result in mapping reads to multiple genes (for example transcripts in the genome that have overlapping genomic coordinates). By contrast, mapping with exonsByOverlaps and cdsByOverlaps might be too restrictive and miss information in the UTR's. ?My gut feeling is that annotating by cds is the "in-between" approach. ?What is the recommended approach for RNA-seq? ?As you can tell, I am quite new to this! >> > >> > >> > Hi, Ravi. ?Sorry to answer a question with more questions, but why not just map the probes against Ensembl Transcripts or refseq? ?What is the advantage of mapping to the genome and then going back to the transcripts? >> > >> > Sean >> >> >> ? ? ? ?[[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLYlink written 8.1 years ago by Benilton Carvalho4.3k
Hi Benilton, I am pretty sure the rows are unique in my new ndf file (see below). However, how does pdinfobuilder determine uniqueness? While each row is unique, there is no column that has strictly unique elements... see for example PROBE_DESIGN_ID. Could that be the issue? As a somewhat related question, if a single probe is used in computing the expression for several different transcripts, doesn't that change the independence of the new transcript-centric probe sets? In your experience with transcript-centric remappings, has that been a problem in finding interesting genes with moderated t-tests? How does the brainarray group get around this issue (from what I can tell, their remapped probesets reuse individual probes)? Thanks again for all your time in helping, Ravi Code: ndf = read.table ("/Users/ravikarra/Documents/Research Projects/Regeneration Gene Expression/Microarray Data/Array Alignment/newndf.ndf", sep = "\t", header = TRUE) dim (ndf) [1] 232687 17 dim (unique (ndf)) [1] 232687 17 length (unique (ndf$PROBE_DESIGN_ID)) [1] 145868 On Mar 8, 2011, at 4:28 AM, Benilton Carvalho wrote: > Hi Ravi, > > (my responses may be slow for these next days, sorry) > > Can you please confirm that the lines of your newly created NDF are unique? > > b > > On 8 March 2011 03:11, Ravi Karra <ravi.karra at="" gmail.com=""> wrote: >> Hi Sean and Herve, >> >> Thanks a ton for the suggestions! I modified Herve's approach to use biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many of the VEGA transcripts, many of which are supposed to be on this array. I ended up mapping to the cdna's from zv9 and like Herve only match about 60% of probes to the ensembl transcripts. I too found many probes that map to multiple transcripts. Like the brainarray group, I chose to use the probes as a part of transcript- specific probesets... i.e. a given probe can contribute to the expression of multiple transcripts. Before trying to extend mapping to other databases as suggested by Sean, I wanted to try and create a custom ndf file and annotation package to give the analysis a go. However, I keep getting the error below using pdinfrobuilder and my custom ndf file. Did I construct my new ndf file incorrectly? I included my code and session info below. Sorry for the inefficient code... I am learning on the fly! >> >> Thanks everyone! This list has been immensely helpful! >> >> Ravi >> >> =================================================================== ==== >> Building annotation package for Nimblegen Expression Array >> NDF: newndf.ndf >> XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys >> =================================================================== ==== >> Parsing file: newndf.ndf... OK >> Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK >> Merging NDF and XYS files... OK >> Preparing contents for featureSet table... OK >> Preparing contents for bgfeature table... OK >> Preparing contents for pmfeature table... OK >> Creating package in ./pd.newndf >> Inserting 27649 rows into table featureSet... OK >> Inserting 200368 rows into table pmfeature... Error in sqliteExecStatement(con, statement, bind.data) : >> RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed) >> >> My code: >> >> #1. Load the probes from the ndf file: >> library (Biostrings) >> array_file = "/Users/ravikarra/Documents/Research Projects/Regeneration Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf" >> probe_info = read.table (array_file, header = TRUE, sep = '\t') >> probes = DNAStringSet (as.character(probe_info$PROBE_SEQUENCE)) >> probeid = probe_info$PROBE_DESIGN_ID >> probeid = as.character (probeid) >> names(probes) = probeid >> real_probes = which (width(probes) == 60) #don't align the control, empty, or the crosshyb probes >> probes = probes [real_probes] >> >> #2. Compute the transcriptome (51569 transcripts with sequences as of March 7, .2011): >> library (biomaRt) >> ensembl = useMart ("ensembl") >> ensemblzv9 = useDataset("drerio_gene_ensembl",mart=ensembl) >> txIDs = getBM (attributes = c("ensembl_transcript_id"), mart = ensemblzv9)[,1] >> txSeqs = getSequence (id = txIDs, type = "ensembl_transcript_id", seqType = "cdna", mart = ensemblzv9) >> #make the DNAStringSet object >> zv9txDict = DNAStringSet (txSeqs$cdna) >> names(zv9txDict) = txSeqs$ensembl_transcript_id >> >> #3 Map 'probes' to 'txseqs' (allowing 2 mismatches): >> pdict2 <- PDict(probes, max.mismatch=2) >> m2 = vwhichPDict(pdict2, zv9txDict, max.mismatch=2) >> makeMap <- function(m, probeids, txnames) { >> probeid <- probeids[unlist(m)] >> txname <- rep.int(txnames, elementLengths(m)) >> ans <- unique(data.frame(probeid, txname)) >> rownames(ans) <- NULL >> ans >> } >> map = makeMap(m2, names (probes), names(zv9txDict)) >> >> #4. Remove transcripts with less than 3 probes >> txcounts = data.frame (table (map$txname)) >> dim (txcounts) #start with 32961 different transcripts >> usefultxs = txcounts [txcounts$Freq > 2,][,1] #end with 27647 transcripts >> map = map [map$txname %in% usefultxs, ] >> >> #5: Make a new ndf file with the updated transcript information >> #probe_info is the ndf file I originally loaded >> rownames (probe_info) = probe_info[,1] >> ndf = probe_info [as.character (map$probeid),] >> ndf$SEQ_ID = map$txname >> >> #append the information for the control probes to the new ndf >> control = probe_info [probe_info$PROBE_CLASS != "experimental",] >> ndf = rbind (ndf, control) >> >> # change all the probes with unmapped transcripts to random but keep controls intact >> noalign = !(as.character (probe_info$PROBE_DESIGN_ID) %in% as.character (ndf$PROBE_DESIGN_ID)) >> noalignprobes = probe_info [noalign,] >> noalignprobes $SEQ_ID = "noalign" >> ndf = rbind (ndf, noalignprobes) >> write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE) >> >> #6. Make the new annotation package >> #make custom chip descriptor file for Nimblegen >> library(pdInfoBuilder) >> filename_ndf <- list.files('.',pattern='.ndf',full.names=T) >> filename_ndf = new("ScalarCharacter", filename_ndf) >> filename_xys <- list.files('.', pattern=".xys",full.names = T) [1] >> filename_xys = new("ScalarCharacter", filename_xys) >> seed <- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf, xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu", biocViews="AnnotationData", chipName = "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7", manufacturer = "Nimblegen", url = "http://www.nimblegen.com/products/exp/eukaryotic.html", organism = "Danio rerio", version = "1.1") >> makePdInfoPackage(seed, destDir = ".") >> >>> sessionInfo() >> R version 2.12.2 (2011-02-25) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods >> [7] base >> >> other attached packages: >> [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0 >> [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2 >> [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5 >> [10] Biobase_2.10.0 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3 >> [4] splines_2.12.2 XML_3.2-0 >> >> >> On Mar 1, 2011, at 7:08 AM, Sean Davis wrote: >> >>> >>> >>> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages at="" fhcrc.org=""> wrote: >>> Hi Ravi, >>> >>> Sean's suggestion to map the probes directly to the transcriptome can >>> easily be achieved with the Biostrings/BSgenome/GenomicFeatures >>> infrastructure. >>> >>> The code below uses vwhichPDict (a variant of matchPDict) and the danRer6 >>> genome from UCSC (release name: Sanger Institute Zv8). Note that UCSC just >>> released danRer7 (Sanger Institute Zv9) but we don't have a BSgenome package >>> for it yet. >>> >>> Also I don't think we support the Nimblegen Zebrafish 12 x135K Expression >>> chip so I'm using the zebrafish chip from Affymetrix instead. >>> >>> Depending on your machine and the quality of your internet connection, >>> this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. >>> >>> ## Load the probes: >>> library(zebrafishprobe) >>> library(Biostrings) >>> probes <- DNAStringSet(zebrafishprobe) >>> >>> ## Compute the transcriptome (32992 transcripts): >>> library(GenomicFeatures) >>> txdb <- makeTranscriptDbFromUCSC(genome="danRer6", tablename="ensGene") >>> library(BSgenome.Drerio.UCSC.danRer6) >>> txseqs <- extractTranscriptsFromGenome(Drerio, txdb) >>> ## Note that saving 'txseqs' with write.XStringSet() generates >>> ## a FASTA file that is 51M. >>> >>> ## Map 'probes' to 'txseqs' (allowing 2 mismatches): >>> pdict2 <- PDict(probes, max.mismatch=2) >>> m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2) >>> makeMap <- function(m, probeids, txnames) >>> { >>> probeid <- probeids[unlist(m)] >>> txname <- rep.int(txnames, elementLengths(m)) >>> ans <- unique(data.frame(probeid, txname)) >>> rownames(ans) <- NULL >>> ans >>> } >>> map <- makeMap(m2, zebrafishprobe$Probe.Set.Name, names(txseqs)) >>> >>> Note that the longest step is not the call to vwhichPDict() but the >>> extraction of the transcriptome. >>> >>> However, the final result doesn't look that good. A quick look at the >>> mapping shows that only 62.4% of the probe ids are mapped and that some >>> probe ids are mapped to more than 1 transcript: >>> >>> >>> Nice example, Herve! >>> >>> Mapping to multiple transcripts is to be expected, so one needs to come up with ways of dealing with the situation. Missing probes is also typical. These can be dealt with by mapping to the unmatched probes to other transcript databases. For example, one could map to Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish ESTs next. >>> >>> > head(map) >>> probeid txname >>> 1 Dr.19494.1.S1_at ENSDART00000048610 >>> 2 Dr.19494.1.S1_at ENSDART00000103479 >>> 3 Dr.19494.1.S1_at ENSDART00000103478 >>> 4 Dr.10343.1.S1_at ENSDART00000006449 >>> 5 Dr.12057.1.A1_at ENSDART00000054814 >>> 6 Dr.22271.1.A1_at ENSDART00000054814 >>> > length(unique(zebrafishprobe$Probe.Set.Name)) >>> [1] 15617 >>> > length(unique(map$probeid)) >>> [1] 9746 >>> > any(duplicated(map$probeid)) >>> [1] TRUE >>> >>> But maybe you will have more luck with the Nimblegen Zebrafish 12 x135K >>> Expression chip. >>> >>> >>> Keep in mind that with 60mer probes, there is potentially a need to allow gaps in the alignment and that consecutive matches of even 40 or 50 base pairs are likely to generate a signal on a microarray. >>> >>> Sean >>> >>> One more thing. While testing the above code, I discovered 2 problems >>> that were preventing it to work: one with the extractTranscriptsFromGenome() >>> function and one with the vwhichPDict() function. Those problems are now >>> fixed in the release and devel versions of GenomicFeatures (v 1.2.4 & 1.3.13) >>> and Biostrings (v 2.18.3 & 2.19.12). These new versions will become available >>> via biocLite() in the next 24-36 hours. >>> >>> Hope this helps. >>> H. >>> >>> >>> ----- Original Message ----- >>> From: "Ravi Karra" <ravi.karra at="" gmail.com=""> >>> To: "Sean Davis" <sdavis2 at="" mail.nih.gov=""> >>> Cc: bioconductor at r-project.org >>> Sent: Saturday, February 26, 2011 2:56:02 PM >>> Subject: Re: [BioC] Mapping microarray probes to the genome using findOverlaps >>> >>> Hi Sean, >>> I could do that, but am not sure how to. The annotated zebrafish genome is the Tubingen strain and the some of the probes on the array are from the EK and AB strains. This means that I need to allow for SNP's in the alignment. I originally tried to align the probes to the Ensembl Transcripts using matchPDict but when I allowed for 2 mismatches (max.mismatch = 2) across the probe sequences, my computer never stopped running the program! I found TopHat to be much faster (8 min) and TopHat allows for a few nucleotide wobble by default!. >>> Do you have a suggestion(s) on another way to align the array to the Ensembl Transcripts? >>> Thanks, >>> Ravi >>> >>> >>> >>> >>> On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: >>> >>>> >>>> >>>> On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra <ravi.karra at="" gmail.com=""> wrote: >>>> Hello, >>>> >>>> I am still trying to map probes on the Nimblegen Zebrafish 12 x135K Expression to the Zv9 version of the zebrafish genome available from Ensembl! I am very reluctantly pursuing an alignment approach to annotation as the original annotation provided with the array is quite outdated. >>>> >>>> I performed a gapped alignment using the individual probe sequences (60-mers) from the array using TopHat and loaded the results into Bioconductor as a GappedAlignments object. I have made a TranscriptDb object using the Zv9 genome from Ensembl. Next, I plan to use findOverlaps for the annotation. What is the best way to get the overlaps (by exon, cds, or by transcript)? I am a little concerned that using transcriptsByOverlaps might be a bit too broad and result in mapping reads to multiple genes (for example transcripts in the genome that have overlapping genomic coordinates). By contrast, mapping with exonsByOverlaps and cdsByOverlaps might be too restrictive and miss information in the UTR's. My gut feeling is that annotating by cds is the "in-between" approach. What is the recommended approach for RNA-seq? As you can tell, I am quite new to this! >>>> >>>> >>>> Hi, Ravi. Sorry to answer a question with more questions, but why not just map the probes against Ensembl Transcripts or refseq? What is the advantage of mapping to the genome and then going back to the transcripts? >>>> >>>> Sean >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>
ADD REPLYlink written 8.1 years ago by Ravi Karra140
Hi Ravi, FWIW, I just made a BSgenome package for Zv9. It will become available to BioC devel users in a couple of hours (source package only for now). Cheers, H. On 03/07/2011 07:11 PM, Ravi Karra wrote: > Hi Sean and Herve, > > Thanks a ton for the suggestions! I modified Herve's approach to use > biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many > of the VEGA transcripts, many of which are supposed to be on this array. > I ended up mapping to the cdna's from zv9 and like Herve only match > about 60% of probes to the ensembl transcripts. I too found many probes > that map to multiple transcripts. Like the brainarray group, I chose to > use the probes as a part of transcript-specific probesets... i.e. a > given probe can contribute to the expression of multiple transcripts. > Before trying to extend mapping to other databases as suggested by Sean, > I wanted to try and create a custom ndf file and annotation package to > give the analysis a go. However, I keep getting the error below using > pdinfrobuilder and my custom ndf file. Did I construct my new ndf file > incorrectly? I included my code and session info below. Sorry for the > inefficient code... I am learning on the fly! > > Thanks everyone! This list has been immensely helpful! > > Ravi > > ==================================================================== === > Building annotation package for Nimblegen Expression Array > NDF: newndf.ndf > XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys > ==================================================================== === > Parsing file: newndf.ndf... OK > Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK > Merging NDF and XYS files... OK > Preparing contents for featureSet table... OK > Preparing contents for bgfeature table... OK > Preparing contents for pmfeature table... OK > Creating package in ./pd.newndf > Inserting 27649 rows into table featureSet... OK > Inserting 200368 rows into table pmfeature... Error in > sqliteExecStatement(con, statement, bind.data) : > RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed) > > My code: > > #1. Load the probes from the ndf file: > library(Biostrings) > array_file= "/Users/ravikarra/Documents/Research Projects/Regeneration > Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf" > probe_info= read.table(array_file, header= TRUE, sep= '\t') > probes= DNAStringSet(as.character(probe_info$PROBE_SEQUENCE)) > probeid= probe_info$PROBE_DESIGN_ID > probeid= as.character(probeid) > names(probes) = probeid > real_probes= which(width(probes) == 60) #don't align the control, empty, > or the crosshyb probes > probes= probes[real_probes] > > #2. Compute the transcriptome (51569 transcripts with sequences as of > March 7, .2011): > library(biomaRt) > ensembl= useMart("ensembl") > ensemblzv9= useDataset("drerio_gene_ensembl",mart=ensembl) > txIDs= getBM(attributes= c("ensembl_transcript_id"), mart= ensemblzv9)[,1] > txSeqs= getSequence(id= txIDs, type= "ensembl_transcript_id", seqType= > "cdna", mart= ensemblzv9) > #make the DNAStringSet object > zv9txDict= DNAStringSet(txSeqs$cdna) > names(zv9txDict) = txSeqs$ensembl_transcript_id > > #3 Map 'probes' to 'txseqs' (allowing 2 mismatches): > pdict2<- PDict(probes, max.mismatch=2) > m2= vwhichPDict(pdict2, zv9txDict, max.mismatch=2) > makeMap<- function(m, probeids, txnames) { > probeid<- probeids[unlist(m)] > txname<- rep.int(txnames, elementLengths(m)) > ans<- unique(data.frame(probeid, txname)) > rownames(ans) <- NULL > ans > } > map= makeMap(m2, names(probes), names(zv9txDict)) > > #4. Remove transcripts with less than 3 probes > txcounts= data.frame(table(map$txname)) > dim(txcounts) #start with 32961 different transcripts > usefultxs= txcounts[txcounts$Freq> 2,][,1] #end with 27647 transcripts > map= map[map$txname%in% usefultxs, ] > > #5: Make a new ndf file with the updated transcript information > #probe_info is the ndf file I originally loaded > rownames(probe_info) = probe_info[,1] > ndf= probe_info[as.character(map$probeid),] > ndf$SEQ_ID= map$txname > > #append the information for the control probes to the new ndf > control= probe_info[probe_info$PROBE_CLASS!= "experimental",] > ndf= rbind(ndf, control) > > # change all the probes with unmapped transcripts to random but keep > controls intact > noalign= !(as.character(probe_info$PROBE_DESIGN_ID) %in% > as.character(ndf$PROBE_DESIGN_ID)) > noalignprobes= probe_info[noalign,] > noalignprobes$SEQ_ID= "noalign" > ndf= rbind(ndf, noalignprobes) > write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE) > > #6. Make the new annotation package > #make custom chip descriptor file for Nimblegen > library(pdInfoBuilder) > filename_ndf<- list.files('.',pattern='.ndf',full.names=T) > filename_ndf= new("ScalarCharacter", filename_ndf) > filename_xys<- list.files('.', pattern=".xys",full.names= T) [1] > filename_xys= new("ScalarCharacter", filename_xys) > seed<- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf, > xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu > <mailto:ravi.karra at="" duke.edu="">", biocViews="AnnotationData", chipName= > "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7", > manufacturer= "Nimblegen", url= > "http://www.nimblegen.com/products/exp/eukaryotic.html", organism= > "Danio rerio", version= "1.1") > makePdInfoPackage(seed, destDir= ".") > >> sessionInfo() > R version 2.12.2 (2011-02-25) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > [7] base > > other attached packages: > [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0 > [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2 > [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5 > [10] Biobase_2.10.0 > > loaded via a namespace (and not attached): > [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3 > [4] splines_2.12.2 XML_3.2-0 > > > On Mar 1, 2011, at 7:08 AM, Sean Davis wrote: > >> >> >> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages at="" fhcrc.org="">> <mailto:hpages at="" fhcrc.org="">> wrote: >> >> Hi Ravi, >> >> Sean's suggestion to map the probes directly to the transcriptome can >> easily be achieved with the Biostrings/BSgenome/GenomicFeatures >> infrastructure. >> >> The code below uses vwhichPDict (a variant of matchPDict) and the >> danRer6 >> genome from UCSC (release name: Sanger Institute Zv8). Note that >> UCSC just >> released danRer7 (Sanger Institute Zv9) but we don't have a >> BSgenome package >> for it yet. >> >> Also I don't think we support the Nimblegen Zebrafish 12 x135K >> Expression >> chip so I'm using the zebrafish chip from Affymetrix instead. >> >> Depending on your machine and the quality of your internet connection, >> this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. >> >> ## Load the probes: >> library(zebrafishprobe) >> library(Biostrings) >> probes <- DNAStringSet(zebrafishprobe) >> >> ## Compute the transcriptome (32992 transcripts): >> library(GenomicFeatures) >> txdb <- makeTranscriptDbFromUCSC(genome="danRer6", >> tablename="ensGene") >> library(BSgenome.Drerio.UCSC.danRer6) >> txseqs <- extractTranscriptsFromGenome(Drerio, txdb) >> ## Note that saving 'txseqs' with write.XStringSet() generates >> ## a FASTA file that is 51M. >> >> ## Map 'probes' to 'txseqs' (allowing 2 mismatches): >> pdict2 <- PDict(probes, max.mismatch=2) >> m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2) >> makeMap <- function(m, probeids, txnames) >> { >> probeid <- probeids[unlist(m)] >> txname <- rep.int <http: rep.int=""/>(txnames, elementLengths(m)) >> ans <- unique(data.frame(probeid, txname)) >> rownames(ans) <- NULL >> ans >> } >> map <- makeMap(m2, zebrafishprobe$Probe.Set.Name >> <http: probe.set.name=""/>, names(txseqs)) >> >> Note that the longest step is not the call to vwhichPDict() but the >> extraction of the transcriptome. >> >> However, the final result doesn't look that good. A quick look at the >> mapping shows that only 62.4% of the probe ids are mapped and that >> some >> probe ids are mapped to more than 1 transcript: >> >> >> Nice example, Herve! >> >> Mapping to multiple transcripts is to be expected, so one needs to >> come up with ways of dealing with the situation. Missing probes is >> also typical. These can be dealt with by mapping to the unmatched >> probes to other transcript databases. For example, one could map to >> Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish >> ESTs next. >> >> > head(map) >> probeid txname >> 1 Dr.19494.1.S1_at ENSDART00000048610 >> 2 Dr.19494.1.S1_at ENSDART00000103479 >> 3 Dr.19494.1.S1_at ENSDART00000103478 >> 4 Dr.10343.1.S1_at ENSDART00000006449 >> 5 Dr.12057.1.A1_at ENSDART00000054814 >> 6 Dr.22271.1.A1_at ENSDART00000054814 >> > length(unique(zebrafishprobe$Probe.Set.Name >> <http: probe.set.name=""/>)) >> [1] 15617 >> > length(unique(map$probeid)) >> [1] 9746 >> > any(duplicated(map$probeid)) >> [1] TRUE >> >> But maybe you will have more luck with the Nimblegen Zebrafish 12 >> x135K >> Expression chip. >> >> >> Keep in mind that with 60mer probes, there is potentially a need to >> allow gaps in the alignment and that consecutive matches of even 40 or >> 50 base pairs are likely to generate a signal on a microarray. >> >> Sean >> >> One more thing. While testing the above code, I discovered 2 problems >> that were preventing it to work: one with the >> extractTranscriptsFromGenome() >> function and one with the vwhichPDict() function. Those problems >> are now >> fixed in the release and devel versions of GenomicFeatures (v >> 1.2.4 & 1.3.13) >> and Biostrings (v 2.18.3 & 2.19.12). These new versions will >> become available >> via biocLite() in the next 24-36 hours. >> >> Hope this helps. >> H. >> >> >> ----- Original Message ----- >> From: "Ravi Karra" <ravi.karra at="" gmail.com="">> <mailto:ravi.karra at="" gmail.com="">> >> To: "Sean Davis" <sdavis2 at="" mail.nih.gov="" <mailto:sdavis2="" at="" mail.nih.gov="">> >> Cc: bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> Sent: Saturday, February 26, 2011 2:56:02 PM >> Subject: Re: [BioC] Mapping microarray probes to the genome using >> findOverlaps >> >> Hi Sean, >> I could do that, but am not sure how to. The annotated zebrafish >> genome is the Tubingen strain and the some of the probes on the >> array are from the EK and AB strains. This means that I need to >> allow for SNP's in the alignment. I originally tried to align the >> probes to the Ensembl Transcripts using matchPDict but when I >> allowed for 2 mismatches (max.mismatch = 2) across the probe >> sequences, my computer never stopped running the program! I found >> TopHat to be much faster (8 min) and TopHat allows for a few >> nucleotide wobble by default!. >> Do you have a suggestion(s) on another way to align the array to >> the Ensembl Transcripts? >> Thanks, >> Ravi >> >> >> >> >> On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: >> >> > >> > >> > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra >> <ravi.karra at="" gmail.com="" <mailto:ravi.karra="" at="" gmail.com="">> wrote: >> > Hello, >> > >> > I am still trying to map probes on the Nimblegen Zebrafish 12 >> x135K Expression to the Zv9 version of the zebrafish genome >> available from Ensembl! I am very reluctantly pursuing an >> alignment approach to annotation as the original annotation >> provided with the array is quite outdated. >> > >> > I performed a gapped alignment using the individual probe >> sequences (60-mers) from the array using TopHat and loaded the >> results into Bioconductor as a GappedAlignments object. I have >> made a TranscriptDb object using the Zv9 genome from Ensembl. >> Next, I plan to use findOverlaps for the annotation. What is the >> best way to get the overlaps (by exon, cds, or by transcript)? I >> am a little concerned that using transcriptsByOverlaps might be a >> bit too broad and result in mapping reads to multiple genes (for >> example transcripts in the genome that have overlapping genomic >> coordinates). By contrast, mapping with exonsByOverlaps and >> cdsByOverlaps might be too restrictive and miss information in the >> UTR's. My gut feeling is that annotating by cds is the >> "in-between" approach. What is the recommended approach for >> RNA-seq? As you can tell, I am quite new to this! >> > >> > >> > Hi, Ravi. Sorry to answer a question with more questions, but >> why not just map the probes against Ensembl Transcripts or refseq? >> What is the advantage of mapping to the genome and then going back >> to the transcripts? >> > >> > Sean >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 8.1 years ago by Hervé Pagès ♦♦ 13k
Thanks for the suppprt Herve! Do you or Benilton (or anyone else!) have any ideas on how to redesign the .ndf file to allow PdInfoBuilder to use the updated probeset information from these alignments? Thanks again for all of the help, Ravi On Mar 8, 2011, at 8:29 PM, Hervé Pagès wrote: > Hi Ravi, > > FWIW, I just made a BSgenome package for Zv9. It will become available > to BioC devel users in a couple of hours (source package only for now). > > Cheers, > H. > > > On 03/07/2011 07:11 PM, Ravi Karra wrote: >> Hi Sean and Herve, >> >> Thanks a ton for the suggestions! I modified Herve's approach to use >> biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many >> of the VEGA transcripts, many of which are supposed to be on this array. >> I ended up mapping to the cdna's from zv9 and like Herve only match >> about 60% of probes to the ensembl transcripts. I too found many probes >> that map to multiple transcripts. Like the brainarray group, I chose to >> use the probes as a part of transcript-specific probesets... i.e. a >> given probe can contribute to the expression of multiple transcripts. >> Before trying to extend mapping to other databases as suggested by Sean, >> I wanted to try and create a custom ndf file and annotation package to >> give the analysis a go. However, I keep getting the error below using >> pdinfrobuilder and my custom ndf file. Did I construct my new ndf file >> incorrectly? I included my code and session info below. Sorry for the >> inefficient code... I am learning on the fly! >> >> Thanks everyone! This list has been immensely helpful! >> >> Ravi >> >> =================================================================== ==== >> Building annotation package for Nimblegen Expression Array >> NDF: newndf.ndf >> XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys >> =================================================================== ==== >> Parsing file: newndf.ndf... OK >> Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK >> Merging NDF and XYS files... OK >> Preparing contents for featureSet table... OK >> Preparing contents for bgfeature table... OK >> Preparing contents for pmfeature table... OK >> Creating package in ./pd.newndf >> Inserting 27649 rows into table featureSet... OK >> Inserting 200368 rows into table pmfeature... Error in >> sqliteExecStatement(con, statement, bind.data) : >> RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed) >> >> My code: >> >> #1. Load the probes from the ndf file: >> library(Biostrings) >> array_file= "/Users/ravikarra/Documents/Research Projects/Regeneration >> Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf" >> probe_info= read.table(array_file, header= TRUE, sep= '\t') >> probes= DNAStringSet(as.character(probe_info$PROBE_SEQUENCE)) >> probeid= probe_info$PROBE_DESIGN_ID >> probeid= as.character(probeid) >> names(probes) = probeid >> real_probes= which(width(probes) == 60) #don't align the control, empty, >> or the crosshyb probes >> probes= probes[real_probes] >> >> #2. Compute the transcriptome (51569 transcripts with sequences as of >> March 7, .2011): >> library(biomaRt) >> ensembl= useMart("ensembl") >> ensemblzv9= useDataset("drerio_gene_ensembl",mart=ensembl) >> txIDs= getBM(attributes= c("ensembl_transcript_id"), mart= ensemblzv9)[,1] >> txSeqs= getSequence(id= txIDs, type= "ensembl_transcript_id", seqType= >> "cdna", mart= ensemblzv9) >> #make the DNAStringSet object >> zv9txDict= DNAStringSet(txSeqs$cdna) >> names(zv9txDict) = txSeqs$ensembl_transcript_id >> >> #3 Map 'probes' to 'txseqs' (allowing 2 mismatches): >> pdict2<- PDict(probes, max.mismatch=2) >> m2= vwhichPDict(pdict2, zv9txDict, max.mismatch=2) >> makeMap<- function(m, probeids, txnames) { >> probeid<- probeids[unlist(m)] >> txname<- rep.int(txnames, elementLengths(m)) >> ans<- unique(data.frame(probeid, txname)) >> rownames(ans) <- NULL >> ans >> } >> map= makeMap(m2, names(probes), names(zv9txDict)) >> >> #4. Remove transcripts with less than 3 probes >> txcounts= data.frame(table(map$txname)) >> dim(txcounts) #start with 32961 different transcripts >> usefultxs= txcounts[txcounts$Freq> 2,][,1] #end with 27647 transcripts >> map= map[map$txname%in% usefultxs, ] >> >> #5: Make a new ndf file with the updated transcript information >> #probe_info is the ndf file I originally loaded >> rownames(probe_info) = probe_info[,1] >> ndf= probe_info[as.character(map$probeid),] >> ndf$SEQ_ID= map$txname >> >> #append the information for the control probes to the new ndf >> control= probe_info[probe_info$PROBE_CLASS!= "experimental",] >> ndf= rbind(ndf, control) >> >> # change all the probes with unmapped transcripts to random but keep >> controls intact >> noalign= !(as.character(probe_info$PROBE_DESIGN_ID) %in% >> as.character(ndf$PROBE_DESIGN_ID)) >> noalignprobes= probe_info[noalign,] >> noalignprobes$SEQ_ID= "noalign" >> ndf= rbind(ndf, noalignprobes) >> write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE) >> >> #6. Make the new annotation package >> #make custom chip descriptor file for Nimblegen >> library(pdInfoBuilder) >> filename_ndf<- list.files('.',pattern='.ndf',full.names=T) >> filename_ndf= new("ScalarCharacter", filename_ndf) >> filename_xys<- list.files('.', pattern=".xys",full.names= T) [1] >> filename_xys= new("ScalarCharacter", filename_xys) >> seed<- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf, >> xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu >> <mailto:ravi.karra at="" duke.edu="">", biocViews="AnnotationData", chipName= >> "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7", >> manufacturer= "Nimblegen", url= >> "http://www.nimblegen.com/products/exp/eukaryotic.html", organism= >> "Danio rerio", version= "1.1") >> makePdInfoPackage(seed, destDir= ".") >> >>> sessionInfo() >> R version 2.12.2 (2011-02-25) >> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods >> [7] base >> >> other attached packages: >> [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0 >> [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2 >> [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5 >> [10] Biobase_2.10.0 >> >> loaded via a namespace (and not attached): >> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3 >> [4] splines_2.12.2 XML_3.2-0 >> >> >> On Mar 1, 2011, at 7:08 AM, Sean Davis wrote: >> >>> >>> >>> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve <hpages at="" fhcrc.org="">>> <mailto:hpages at="" fhcrc.org="">> wrote: >>> >>> Hi Ravi, >>> >>> Sean's suggestion to map the probes directly to the transcriptome can >>> easily be achieved with the Biostrings/BSgenome/GenomicFeatures >>> infrastructure. >>> >>> The code below uses vwhichPDict (a variant of matchPDict) and the >>> danRer6 >>> genome from UCSC (release name: Sanger Institute Zv8). Note that >>> UCSC just >>> released danRer7 (Sanger Institute Zv9) but we don't have a >>> BSgenome package >>> for it yet. >>> >>> Also I don't think we support the Nimblegen Zebrafish 12 x135K >>> Expression >>> chip so I'm using the zebrafish chip from Affymetrix instead. >>> >>> Depending on your machine and the quality of your internet connection, >>> this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. >>> >>> ## Load the probes: >>> library(zebrafishprobe) >>> library(Biostrings) >>> probes <- DNAStringSet(zebrafishprobe) >>> >>> ## Compute the transcriptome (32992 transcripts): >>> library(GenomicFeatures) >>> txdb <- makeTranscriptDbFromUCSC(genome="danRer6", >>> tablename="ensGene") >>> library(BSgenome.Drerio.UCSC.danRer6) >>> txseqs <- extractTranscriptsFromGenome(Drerio, txdb) >>> ## Note that saving 'txseqs' with write.XStringSet() generates >>> ## a FASTA file that is 51M. >>> >>> ## Map 'probes' to 'txseqs' (allowing 2 mismatches): >>> pdict2 <- PDict(probes, max.mismatch=2) >>> m2 <- vwhichPDict(pdict2, txseqs, max.mismatch=2) >>> makeMap <- function(m, probeids, txnames) >>> { >>> probeid <- probeids[unlist(m)] >>> txname <- rep.int <http: rep.int=""/>(txnames, elementLengths(m)) >>> ans <- unique(data.frame(probeid, txname)) >>> rownames(ans) <- NULL >>> ans >>> } >>> map <- makeMap(m2, zebrafishprobe$Probe.Set.Name >>> <http: probe.set.name=""/>, names(txseqs)) >>> >>> Note that the longest step is not the call to vwhichPDict() but the >>> extraction of the transcriptome. >>> >>> However, the final result doesn't look that good. A quick look at the >>> mapping shows that only 62.4% of the probe ids are mapped and that >>> some >>> probe ids are mapped to more than 1 transcript: >>> >>> >>> Nice example, Herve! >>> >>> Mapping to multiple transcripts is to be expected, so one needs to >>> come up with ways of dealing with the situation. Missing probes is >>> also typical. These can be dealt with by mapping to the unmatched >>> probes to other transcript databases. For example, one could map to >>> Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish >>> ESTs next. >>> >>> > head(map) >>> probeid txname >>> 1 Dr.19494.1.S1_at ENSDART00000048610 >>> 2 Dr.19494.1.S1_at ENSDART00000103479 >>> 3 Dr.19494.1.S1_at ENSDART00000103478 >>> 4 Dr.10343.1.S1_at ENSDART00000006449 >>> 5 Dr.12057.1.A1_at ENSDART00000054814 >>> 6 Dr.22271.1.A1_at ENSDART00000054814 >>> > length(unique(zebrafishprobe$Probe.Set.Name >>> <http: probe.set.name=""/>)) >>> [1] 15617 >>> > length(unique(map$probeid)) >>> [1] 9746 >>> > any(duplicated(map$probeid)) >>> [1] TRUE >>> >>> But maybe you will have more luck with the Nimblegen Zebrafish 12 >>> x135K >>> Expression chip. >>> >>> >>> Keep in mind that with 60mer probes, there is potentially a need to >>> allow gaps in the alignment and that consecutive matches of even 40 or >>> 50 base pairs are likely to generate a signal on a microarray. >>> >>> Sean >>> >>> One more thing. While testing the above code, I discovered 2 problems >>> that were preventing it to work: one with the >>> extractTranscriptsFromGenome() >>> function and one with the vwhichPDict() function. Those problems >>> are now >>> fixed in the release and devel versions of GenomicFeatures (v >>> 1.2.4 & 1.3.13) >>> and Biostrings (v 2.18.3 & 2.19.12). These new versions will >>> become available >>> via biocLite() in the next 24-36 hours. >>> >>> Hope this helps. >>> H. >>> >>> >>> ----- Original Message ----- >>> From: "Ravi Karra" <ravi.karra at="" gmail.com="">>> <mailto:ravi.karra at="" gmail.com="">> >>> To: "Sean Davis" <sdavis2 at="" mail.nih.gov="" <mailto:sdavis2="" at="" mail.nih.gov="">> >>> Cc: bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> Sent: Saturday, February 26, 2011 2:56:02 PM >>> Subject: Re: [BioC] Mapping microarray probes to the genome using >>> findOverlaps >>> >>> Hi Sean, >>> I could do that, but am not sure how to. The annotated zebrafish >>> genome is the Tubingen strain and the some of the probes on the >>> array are from the EK and AB strains. This means that I need to >>> allow for SNP's in the alignment. I originally tried to align the >>> probes to the Ensembl Transcripts using matchPDict but when I >>> allowed for 2 mismatches (max.mismatch = 2) across the probe >>> sequences, my computer never stopped running the program! I found >>> TopHat to be much faster (8 min) and TopHat allows for a few >>> nucleotide wobble by default!. >>> Do you have a suggestion(s) on another way to align the array to >>> the Ensembl Transcripts? >>> Thanks, >>> Ravi >>> >>> >>> >>> >>> On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: >>> >>> > >>> > >>> > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra >>> <ravi.karra at="" gmail.com="" <mailto:ravi.karra="" at="" gmail.com="">> wrote: >>> > Hello, >>> > >>> > I am still trying to map probes on the Nimblegen Zebrafish 12 >>> x135K Expression to the Zv9 version of the zebrafish genome >>> available from Ensembl! I am very reluctantly pursuing an >>> alignment approach to annotation as the original annotation >>> provided with the array is quite outdated. >>> > >>> > I performed a gapped alignment using the individual probe >>> sequences (60-mers) from the array using TopHat and loaded the >>> results into Bioconductor as a GappedAlignments object. I have >>> made a TranscriptDb object using the Zv9 genome from Ensembl. >>> Next, I plan to use findOverlaps for the annotation. What is the >>> best way to get the overlaps (by exon, cds, or by transcript)? I >>> am a little concerned that using transcriptsByOverlaps might be a >>> bit too broad and result in mapping reads to multiple genes (for >>> example transcripts in the genome that have overlapping genomic >>> coordinates). By contrast, mapping with exonsByOverlaps and >>> cdsByOverlaps might be too restrictive and miss information in the >>> UTR's. My gut feeling is that annotating by cds is the >>> "in-between" approach. What is the recommended approach for >>> RNA-seq? As you can tell, I am quite new to this! >>> > >>> > >>> > Hi, Ravi. Sorry to answer a question with more questions, but >>> why not just map the probes against Ensembl Transcripts or refseq? >>> What is the advantage of mapping to the genome and then going back >>> to the transcripts? >>> > >>> > Sean >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
ADD REPLYlink written 8.1 years ago by Ravi Karra140
Sorry Ravi, I've no idea how to do this. H. On 03/13/2011 07:03 PM, Ravi Karra wrote: > Thanks for the suppprt Herve! > Do you or Benilton (or anyone else!) have any ideas on how to redesign the .ndf file to allow PdInfoBuilder to use the updated probeset information from these alignments? > Thanks again for all of the help, > Ravi > > On Mar 8, 2011, at 8:29 PM, Hervé Pagès wrote: > >> Hi Ravi, >> >> FWIW, I just made a BSgenome package for Zv9. It will become available >> to BioC devel users in a couple of hours (source package only for now). >> >> Cheers, >> H. >> >> >> On 03/07/2011 07:11 PM, Ravi Karra wrote: >>> Hi Sean and Herve, >>> >>> Thanks a ton for the suggestions! I modified Herve's approach to use >>> biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many >>> of the VEGA transcripts, many of which are supposed to be on this array. >>> I ended up mapping to the cdna's from zv9 and like Herve only match >>> about 60% of probes to the ensembl transcripts. I too found many probes >>> that map to multiple transcripts. Like the brainarray group, I chose to >>> use the probes as a part of transcript-specific probesets... i.e. a >>> given probe can contribute to the expression of multiple transcripts. >>> Before trying to extend mapping to other databases as suggested by Sean, >>> I wanted to try and create a custom ndf file and annotation package to >>> give the analysis a go. However, I keep getting the error below using >>> pdinfrobuilder and my custom ndf file. Did I construct my new ndf file >>> incorrectly? I included my code and session info below. Sorry for the >>> inefficient code... I am learning on the fly! >>> >>> Thanks everyone! This list has been immensely helpful! >>> >>> Ravi >>> >>> ================================================================== ===== >>> Building annotation package for Nimblegen Expression Array >>> NDF: newndf.ndf >>> XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys >>> ================================================================== ===== >>> Parsing file: newndf.ndf... OK >>> Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK >>> Merging NDF and XYS files... OK >>> Preparing contents for featureSet table... OK >>> Preparing contents for bgfeature table... OK >>> Preparing contents for pmfeature table... OK >>> Creating package in ./pd.newndf >>> Inserting 27649 rows into table featureSet... OK >>> Inserting 200368 rows into table pmfeature... Error in >>> sqliteExecStatement(con, statement, bind.data) : >>> RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed) >>> >>> My code: >>> >>> #1. Load the probes from the ndf file: >>> library(Biostrings) >>> array_file= "/Users/ravikarra/Documents/Research Projects/Regeneration >>> Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf" >>> probe_info= read.table(array_file, header= TRUE, sep= '\t') >>> probes= DNAStringSet(as.character(probe_info$PROBE_SEQUENCE)) >>> probeid= probe_info$PROBE_DESIGN_ID >>> probeid= as.character(probeid) >>> names(probes) = probeid >>> real_probes= which(width(probes) == 60) #don't align the control, empty, >>> or the crosshyb probes >>> probes= probes[real_probes] >>> >>> #2. Compute the transcriptome (51569 transcripts with sequences as of >>> March 7, .2011): >>> library(biomaRt) >>> ensembl= useMart("ensembl") >>> ensemblzv9= useDataset("drerio_gene_ensembl",mart=ensembl) >>> txIDs= getBM(attributes= c("ensembl_transcript_id"), mart= ensemblzv9)[,1] >>> txSeqs= getSequence(id= txIDs, type= "ensembl_transcript_id", seqType= >>> "cdna", mart= ensemblzv9) >>> #make the DNAStringSet object >>> zv9txDict= DNAStringSet(txSeqs$cdna) >>> names(zv9txDict) = txSeqs$ensembl_transcript_id >>> >>> #3 Map 'probes' to 'txseqs' (allowing 2 mismatches): >>> pdict2<- PDict(probes, max.mismatch=2) >>> m2= vwhichPDict(pdict2, zv9txDict, max.mismatch=2) >>> makeMap<- function(m, probeids, txnames) { >>> probeid<- probeids[unlist(m)] >>> txname<- rep.int(txnames, elementLengths(m)) >>> ans<- unique(data.frame(probeid, txname)) >>> rownames(ans)<- NULL >>> ans >>> } >>> map= makeMap(m2, names(probes), names(zv9txDict)) >>> >>> #4. Remove transcripts with less than 3 probes >>> txcounts= data.frame(table(map$txname)) >>> dim(txcounts) #start with 32961 different transcripts >>> usefultxs= txcounts[txcounts$Freq> 2,][,1] #end with 27647 transcripts >>> map= map[map$txname%in% usefultxs, ] >>> >>> #5: Make a new ndf file with the updated transcript information >>> #probe_info is the ndf file I originally loaded >>> rownames(probe_info) = probe_info[,1] >>> ndf= probe_info[as.character(map$probeid),] >>> ndf$SEQ_ID= map$txname >>> >>> #append the information for the control probes to the new ndf >>> control= probe_info[probe_info$PROBE_CLASS!= "experimental",] >>> ndf= rbind(ndf, control) >>> >>> # change all the probes with unmapped transcripts to random but keep >>> controls intact >>> noalign= !(as.character(probe_info$PROBE_DESIGN_ID) %in% >>> as.character(ndf$PROBE_DESIGN_ID)) >>> noalignprobes= probe_info[noalign,] >>> noalignprobes$SEQ_ID= "noalign" >>> ndf= rbind(ndf, noalignprobes) >>> write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE) >>> >>> #6. Make the new annotation package >>> #make custom chip descriptor file for Nimblegen >>> library(pdInfoBuilder) >>> filename_ndf<- list.files('.',pattern='.ndf',full.names=T) >>> filename_ndf= new("ScalarCharacter", filename_ndf) >>> filename_xys<- list.files('.', pattern=".xys",full.names= T) [1] >>> filename_xys= new("ScalarCharacter", filename_xys) >>> seed<- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf, >>> xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu >>> <mailto:ravi.karra at="" duke.edu="">", biocViews="AnnotationData", chipName= >>> "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7", >>> manufacturer= "Nimblegen", url= >>> "http://www.nimblegen.com/products/exp/eukaryotic.html", organism= >>> "Danio rerio", version= "1.1") >>> makePdInfoPackage(seed, destDir= ".") >>> >>>> sessionInfo() >>> R version 2.12.2 (2011-02-25) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods >>> [7] base >>> >>> other attached packages: >>> [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0 >>> [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2 >>> [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5 >>> [10] Biobase_2.10.0 >>> >>> loaded via a namespace (and not attached): >>> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3 >>> [4] splines_2.12.2 XML_3.2-0 >>> >>> >>> On Mar 1, 2011, at 7:08 AM, Sean Davis wrote: >>> >>>> >>>> >>>> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve<hpages at="" fhcrc.org="">>>> <mailto:hpages at="" fhcrc.org="">> wrote: >>>> >>>> Hi Ravi, >>>> >>>> Sean's suggestion to map the probes directly to the transcriptome can >>>> easily be achieved with the Biostrings/BSgenome/GenomicFeatures >>>> infrastructure. >>>> >>>> The code below uses vwhichPDict (a variant of matchPDict) and the >>>> danRer6 >>>> genome from UCSC (release name: Sanger Institute Zv8). Note that >>>> UCSC just >>>> released danRer7 (Sanger Institute Zv9) but we don't have a >>>> BSgenome package >>>> for it yet. >>>> >>>> Also I don't think we support the Nimblegen Zebrafish 12 x135K >>>> Expression >>>> chip so I'm using the zebrafish chip from Affymetrix instead. >>>> >>>> Depending on your machine and the quality of your internet connection, >>>> this code should run in 8-15 minutes. Less than 800 Mb of RAM is used. >>>> >>>> ## Load the probes: >>>> library(zebrafishprobe) >>>> library(Biostrings) >>>> probes<- DNAStringSet(zebrafishprobe) >>>> >>>> ## Compute the transcriptome (32992 transcripts): >>>> library(GenomicFeatures) >>>> txdb<- makeTranscriptDbFromUCSC(genome="danRer6", >>>> tablename="ensGene") >>>> library(BSgenome.Drerio.UCSC.danRer6) >>>> txseqs<- extractTranscriptsFromGenome(Drerio, txdb) >>>> ## Note that saving 'txseqs' with write.XStringSet() generates >>>> ## a FASTA file that is 51M. >>>> >>>> ## Map 'probes' to 'txseqs' (allowing 2 mismatches): >>>> pdict2<- PDict(probes, max.mismatch=2) >>>> m2<- vwhichPDict(pdict2, txseqs, max.mismatch=2) >>>> makeMap<- function(m, probeids, txnames) >>>> { >>>> probeid<- probeids[unlist(m)] >>>> txname<- rep.int<http: rep.int=""/>(txnames, elementLengths(m)) >>>> ans<- unique(data.frame(probeid, txname)) >>>> rownames(ans)<- NULL >>>> ans >>>> } >>>> map<- makeMap(m2, zebrafishprobe$Probe.Set.Name >>>> <http: probe.set.name=""/>, names(txseqs)) >>>> >>>> Note that the longest step is not the call to vwhichPDict() but the >>>> extraction of the transcriptome. >>>> >>>> However, the final result doesn't look that good. A quick look at the >>>> mapping shows that only 62.4% of the probe ids are mapped and that >>>> some >>>> probe ids are mapped to more than 1 transcript: >>>> >>>> >>>> Nice example, Herve! >>>> >>>> Mapping to multiple transcripts is to be expected, so one needs to >>>> come up with ways of dealing with the situation. Missing probes is >>>> also typical. These can be dealt with by mapping to the unmatched >>>> probes to other transcript databases. For example, one could map to >>>> Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish >>>> ESTs next. >>>> >>>> > head(map) >>>> probeid txname >>>> 1 Dr.19494.1.S1_at ENSDART00000048610 >>>> 2 Dr.19494.1.S1_at ENSDART00000103479 >>>> 3 Dr.19494.1.S1_at ENSDART00000103478 >>>> 4 Dr.10343.1.S1_at ENSDART00000006449 >>>> 5 Dr.12057.1.A1_at ENSDART00000054814 >>>> 6 Dr.22271.1.A1_at ENSDART00000054814 >>>> > length(unique(zebrafishprobe$Probe.Set.Name >>>> <http: probe.set.name=""/>)) >>>> [1] 15617 >>>> > length(unique(map$probeid)) >>>> [1] 9746 >>>> > any(duplicated(map$probeid)) >>>> [1] TRUE >>>> >>>> But maybe you will have more luck with the Nimblegen Zebrafish 12 >>>> x135K >>>> Expression chip. >>>> >>>> >>>> Keep in mind that with 60mer probes, there is potentially a need to >>>> allow gaps in the alignment and that consecutive matches of even 40 or >>>> 50 base pairs are likely to generate a signal on a microarray. >>>> >>>> Sean >>>> >>>> One more thing. While testing the above code, I discovered 2 problems >>>> that were preventing it to work: one with the >>>> extractTranscriptsFromGenome() >>>> function and one with the vwhichPDict() function. Those problems >>>> are now >>>> fixed in the release and devel versions of GenomicFeatures (v >>>> 1.2.4& 1.3.13) >>>> and Biostrings (v 2.18.3& 2.19.12). These new versions will >>>> become available >>>> via biocLite() in the next 24-36 hours. >>>> >>>> Hope this helps. >>>> H. >>>> >>>> >>>> ----- Original Message ----- >>>> From: "Ravi Karra"<ravi.karra at="" gmail.com="">>>> <mailto:ravi.karra at="" gmail.com="">> >>>> To: "Sean Davis"<sdavis2 at="" mail.nih.gov<mailto:sdavis2="" at="" mail.nih.gov="">> >>>> Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>> Sent: Saturday, February 26, 2011 2:56:02 PM >>>> Subject: Re: [BioC] Mapping microarray probes to the genome using >>>> findOverlaps >>>> >>>> Hi Sean, >>>> I could do that, but am not sure how to. The annotated zebrafish >>>> genome is the Tubingen strain and the some of the probes on the >>>> array are from the EK and AB strains. This means that I need to >>>> allow for SNP's in the alignment. I originally tried to align the >>>> probes to the Ensembl Transcripts using matchPDict but when I >>>> allowed for 2 mismatches (max.mismatch = 2) across the probe >>>> sequences, my computer never stopped running the program! I found >>>> TopHat to be much faster (8 min) and TopHat allows for a few >>>> nucleotide wobble by default!. >>>> Do you have a suggestion(s) on another way to align the array to >>>> the Ensembl Transcripts? >>>> Thanks, >>>> Ravi >>>> >>>> >>>> >>>> >>>> On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: >>>> >>>> > >>>> > >>>> > On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra >>>> <ravi.karra at="" gmail.com<mailto:ravi.karra="" at="" gmail.com="">> wrote: >>>> > Hello, >>>> > >>>> > I am still trying to map probes on the Nimblegen Zebrafish 12 >>>> x135K Expression to the Zv9 version of the zebrafish genome >>>> available from Ensembl! I am very reluctantly pursuing an >>>> alignment approach to annotation as the original annotation >>>> provided with the array is quite outdated. >>>> > >>>> > I performed a gapped alignment using the individual probe >>>> sequences (60-mers) from the array using TopHat and loaded the >>>> results into Bioconductor as a GappedAlignments object. I have >>>> made a TranscriptDb object using the Zv9 genome from Ensembl. >>>> Next, I plan to use findOverlaps for the annotation. What is the >>>> best way to get the overlaps (by exon, cds, or by transcript)? I >>>> am a little concerned that using transcriptsByOverlaps might be a >>>> bit too broad and result in mapping reads to multiple genes (for >>>> example transcripts in the genome that have overlapping genomic >>>> coordinates). By contrast, mapping with exonsByOverlaps and >>>> cdsByOverlaps might be too restrictive and miss information in the >>>> UTR's. My gut feeling is that annotating by cds is the >>>> "in-between" approach. What is the recommended approach for >>>> RNA-seq? As you can tell, I am quite new to this! >>>> > >>>> > >>>> > Hi, Ravi. Sorry to answer a question with more questions, but >>>> why not just map the probes against Ensembl Transcripts or refseq? >>>> What is the advantage of mapping to the genome and then going back >>>> to the transcripts? >>>> > >>>> > Sean >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLYlink written 8.1 years ago by Hervé Pagès ♦♦ 13k
We're working on this offline... will get back to the ML once it's solved... but, essentially, the issue is that after his modification, the NDF has multiple rows per X/Y coordinates... b 2011/3/21 Hervé Pagès <hpages at="" fhcrc.org="">: > Sorry Ravi, I've no idea how to do this. ?H. > > On 03/13/2011 07:03 PM, Ravi Karra wrote: >> >> Thanks for the suppprt Herve! >> Do you or Benilton (or anyone else!) have any ideas on how to redesign the >> .ndf file to allow PdInfoBuilder to use the updated probeset information >> from these alignments? >> Thanks again for all of the help, >> Ravi >> >> On Mar 8, 2011, at 8:29 PM, Hervé Pagès wrote: >> >>> Hi Ravi, >>> >>> FWIW, I just made a BSgenome package for Zv9. It will become available >>> to BioC devel users in a couple of hours (source package only for now). >>> >>> Cheers, >>> H. >>> >>> >>> On 03/07/2011 07:11 PM, Ravi Karra wrote: >>>> >>>> Hi Sean and Herve, >>>> >>>> Thanks a ton for the suggestions! I modified Herve's approach to use >>>> biomaRt. Zv9 is available via Biomart/Ensembl and has incorporated many >>>> of the VEGA transcripts, many of which are supposed to be on this array. >>>> I ended up mapping to the cdna's from zv9 and like Herve only match >>>> about 60% of probes to the ensembl transcripts. I too found many probes >>>> that map to multiple transcripts. Like the brainarray group, I chose to >>>> use the probes as a part of transcript-specific probesets... i.e. a >>>> given probe can contribute to the expression of multiple transcripts. >>>> Before trying to extend mapping to other databases as suggested by Sean, >>>> I wanted to try and create a custom ndf file and annotation package to >>>> give the analysis a go. However, I keep getting the error below using >>>> pdinfrobuilder and my custom ndf file. Did I construct my new ndf file >>>> incorrectly? I included my code and session info below. Sorry for the >>>> inefficient code... I am learning on the fly! >>>> >>>> Thanks everyone! This list has been immensely helpful! >>>> >>>> Ravi >>>> >>>> ================================================================= ====== >>>> Building annotation package for Nimblegen Expression Array >>>> NDF: newndf.ndf >>>> XYS: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys >>>> ================================================================= ====== >>>> Parsing file: newndf.ndf... OK >>>> Parsing file: 428234_Slot26_Cycle9_Duke_Fang_2010-08-06_532.xys... OK >>>> Merging NDF and XYS files... OK >>>> Preparing contents for featureSet table... OK >>>> Preparing contents for bgfeature table... OK >>>> Preparing contents for pmfeature table... OK >>>> Creating package in ./pd.newndf >>>> Inserting 27649 rows into table featureSet... OK >>>> Inserting 200368 rows into table pmfeature... Error in >>>> sqliteExecStatement(con, statement, bind.data) : >>>> RS-DBI driver: (RS_SQLite_exec: could not execute: constraint failed) >>>> >>>> My code: >>>> >>>> #1. Load the probes from the ndf file: >>>> library(Biostrings) >>>> array_file= "/Users/ravikarra/Documents/Research Projects/Regeneration >>>> Gene Expression/Bioconductor Code/Array Mapping/090818_Zv7_EXPR.ndf" >>>> probe_info= read.table(array_file, header= TRUE, sep= '\t') >>>> probes= DNAStringSet(as.character(probe_info$PROBE_SEQUENCE)) >>>> probeid= probe_info$PROBE_DESIGN_ID >>>> probeid= as.character(probeid) >>>> names(probes) = probeid >>>> real_probes= which(width(probes) == 60) #don't align the control, empty, >>>> or the crosshyb probes >>>> probes= probes[real_probes] >>>> >>>> #2. Compute the transcriptome (51569 transcripts with sequences as of >>>> March 7, .2011): >>>> library(biomaRt) >>>> ensembl= useMart("ensembl") >>>> ensemblzv9= useDataset("drerio_gene_ensembl",mart=ensembl) >>>> txIDs= getBM(attributes= c("ensembl_transcript_id"), mart= >>>> ensemblzv9)[,1] >>>> txSeqs= getSequence(id= txIDs, type= "ensembl_transcript_id", seqType= >>>> "cdna", mart= ensemblzv9) >>>> #make the DNAStringSet object >>>> zv9txDict= DNAStringSet(txSeqs$cdna) >>>> names(zv9txDict) = txSeqs$ensembl_transcript_id >>>> >>>> #3 Map 'probes' to 'txseqs' (allowing 2 mismatches): >>>> pdict2<- PDict(probes, max.mismatch=2) >>>> m2= vwhichPDict(pdict2, zv9txDict, max.mismatch=2) >>>> makeMap<- function(m, probeids, txnames) { >>>> probeid<- probeids[unlist(m)] >>>> txname<- rep.int(txnames, elementLengths(m)) >>>> ans<- unique(data.frame(probeid, txname)) >>>> rownames(ans)<- NULL >>>> ans >>>> } >>>> map= makeMap(m2, names(probes), names(zv9txDict)) >>>> >>>> #4. Remove transcripts with less than 3 probes >>>> txcounts= data.frame(table(map$txname)) >>>> dim(txcounts) #start with 32961 different transcripts >>>> usefultxs= txcounts[txcounts$Freq> ?2,][,1] #end with 27647 transcripts >>>> map= map[map$txname%in% usefultxs, ] >>>> >>>> #5: Make a new ndf file with the updated transcript information >>>> #probe_info is the ndf file I originally loaded >>>> rownames(probe_info) = probe_info[,1] >>>> ndf= probe_info[as.character(map$probeid),] >>>> ndf$SEQ_ID= map$txname >>>> >>>> #append the information for the control probes to the new ndf >>>> control= probe_info[probe_info$PROBE_CLASS!= "experimental",] >>>> ndf= rbind(ndf, control) >>>> >>>> # change all the probes with unmapped transcripts to random but keep >>>> controls intact >>>> noalign= !(as.character(probe_info$PROBE_DESIGN_ID) %in% >>>> as.character(ndf$PROBE_DESIGN_ID)) >>>> noalignprobes= probe_info[noalign,] >>>> noalignprobes$SEQ_ID= "noalign" >>>> ndf= rbind(ndf, noalignprobes) >>>> write.table (ndf, "newndf.ndf", sep = '\t', row.names = FALSE) >>>> >>>> #6. Make the new annotation package >>>> #make custom chip descriptor file for Nimblegen >>>> library(pdInfoBuilder) >>>> filename_ndf<- list.files('.',pattern='.ndf',full.names=T) >>>> filename_ndf= new("ScalarCharacter", filename_ndf) >>>> filename_xys<- list.files('.', pattern=".xys",full.names= T) [1] >>>> filename_xys= new("ScalarCharacter", filename_xys) >>>> seed<- new("NgsExpressionPDInfoPkgSeed", ndfFile=filename_ndf, >>>> xysFile=filename_xys, author="Ravi Karra",email="ravi.karra at duke.edu >>>> <mailto:ravi.karra at="" duke.edu="">", biocViews="AnnotationData", chipName= >>>> "Danio rerio 12x135K Array", genomebuild="Ensembl and UCSC Zv7", >>>> manufacturer= "Nimblegen", url= >>>> "http://www.nimblegen.com/products/exp/eukaryotic.html", organism= >>>> "Danio rerio", version= "1.1") >>>> makePdInfoPackage(seed, destDir= ".") >>>> >>>>> sessionInfo() >>>> >>>> R version 2.12.2 (2011-02-25) >>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>>> >>>> locale: >>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods >>>> [7] base >>>> >>>> other attached packages: >>>> [1] Biostrings_2.18.4 IRanges_1.8.9 biomaRt_2.6.0 >>>> [4] pdInfoBuilder_1.14.1 oligo_1.14.0 oligoClasses_1.12.2 >>>> [7] affxparser_1.22.1 RSQLite_0.9-4 DBI_0.2-5 >>>> [10] Biobase_2.10.0 >>>> >>>> loaded via a namespace (and not attached): >>>> [1] affyio_1.18.0 preprocessCore_1.12.0 RCurl_1.4-3 >>>> [4] splines_2.12.2 XML_3.2-0 >>>> >>>> >>>> On Mar 1, 2011, at 7:08 AM, Sean Davis wrote: >>>> >>>>> >>>>> >>>>> On Tue, Mar 1, 2011 at 4:20 AM, Pages, Herve<hpages at="" fhcrc.org="">>>>> <mailto:hpages at="" fhcrc.org="">> ?wrote: >>>>> >>>>> ? ?Hi Ravi, >>>>> >>>>> ? ?Sean's suggestion to map the probes directly to the transcriptome >>>>> can >>>>> ? ?easily be achieved with the Biostrings/BSgenome/GenomicFeatures >>>>> ? ?infrastructure. >>>>> >>>>> ? ?The code below uses vwhichPDict (a variant of matchPDict) and the >>>>> ? ?danRer6 >>>>> ? ?genome from UCSC (release name: Sanger Institute Zv8). Note that >>>>> ? ?UCSC just >>>>> ? ?released danRer7 (Sanger Institute Zv9) but we don't have a >>>>> ? ?BSgenome package >>>>> ? ?for it yet. >>>>> >>>>> ? ?Also I don't think we support the Nimblegen Zebrafish 12 x135K >>>>> ? ?Expression >>>>> ? ?chip so I'm using the zebrafish chip from Affymetrix instead. >>>>> >>>>> ? ?Depending on your machine and the quality of your internet >>>>> connection, >>>>> ? ?this code should run in 8-15 minutes. Less than 800 Mb of RAM is >>>>> used. >>>>> >>>>> ? ?## Load the probes: >>>>> ? ?library(zebrafishprobe) >>>>> ? ?library(Biostrings) >>>>> ? ?probes<- DNAStringSet(zebrafishprobe) >>>>> >>>>> ? ?## Compute the transcriptome (32992 transcripts): >>>>> ? ?library(GenomicFeatures) >>>>> ? ?txdb<- makeTranscriptDbFromUCSC(genome="danRer6", >>>>> ? ?tablename="ensGene") >>>>> ? ?library(BSgenome.Drerio.UCSC.danRer6) >>>>> ? ?txseqs<- extractTranscriptsFromGenome(Drerio, txdb) >>>>> ? ?## Note that saving 'txseqs' with write.XStringSet() generates >>>>> ? ?## a FASTA file that is 51M. >>>>> >>>>> ? ?## Map 'probes' to 'txseqs' (allowing 2 mismatches): >>>>> ? ?pdict2<- PDict(probes, max.mismatch=2) >>>>> ? ?m2<- vwhichPDict(pdict2, txseqs, max.mismatch=2) >>>>> ? ?makeMap<- function(m, probeids, txnames) >>>>> ? ?{ >>>>> ? ?probeid<- probeids[unlist(m)] >>>>> ? ?txname<- rep.int<http: rep.int=""/>(txnames, elementLengths(m)) >>>>> ? ?ans<- unique(data.frame(probeid, txname)) >>>>> ? ?rownames(ans)<- NULL >>>>> ? ?ans >>>>> ? ?} >>>>> ? ?map<- makeMap(m2, zebrafishprobe$Probe.Set.Name >>>>> ? ?<http: probe.set.name=""/>, names(txseqs)) >>>>> >>>>> ? ?Note that the longest step is not the call to vwhichPDict() but the >>>>> ? ?extraction of the transcriptome. >>>>> >>>>> ? ?However, the final result doesn't look that good. A quick look at >>>>> the >>>>> ? ?mapping shows that only 62.4% of the probe ids are mapped and that >>>>> ? ?some >>>>> ? ?probe ids are mapped to more than 1 transcript: >>>>> >>>>> >>>>> Nice example, Herve! >>>>> >>>>> Mapping to multiple transcripts is to be expected, so one needs to >>>>> come up with ways of dealing with the situation. Missing probes is >>>>> also typical. These can be dealt with by mapping to the unmatched >>>>> probes to other transcript databases. For example, one could map to >>>>> Ensembl first, RefSeq second, all zebrafish mRNAs next, all zebrafish >>>>> ESTs next. >>>>> >>>>> ? ?> ?head(map) >>>>> ? ?probeid txname >>>>> ? ?1 Dr.19494.1.S1_at ENSDART00000048610 >>>>> ? ?2 Dr.19494.1.S1_at ENSDART00000103479 >>>>> ? ?3 Dr.19494.1.S1_at ENSDART00000103478 >>>>> ? ?4 Dr.10343.1.S1_at ENSDART00000006449 >>>>> ? ?5 Dr.12057.1.A1_at ENSDART00000054814 >>>>> ? ?6 Dr.22271.1.A1_at ENSDART00000054814 >>>>> ? ?> ?length(unique(zebrafishprobe$Probe.Set.Name >>>>> ? ?<http: probe.set.name=""/>)) >>>>> ? ?[1] 15617 >>>>> ? ?> ?length(unique(map$probeid)) >>>>> ? ?[1] 9746 >>>>> ? ?> ?any(duplicated(map$probeid)) >>>>> ? ?[1] TRUE >>>>> >>>>> ? ?But maybe you will have more luck with the Nimblegen Zebrafish 12 >>>>> ? ?x135K >>>>> ? ?Expression chip. >>>>> >>>>> >>>>> Keep in mind that with 60mer probes, there is potentially a need to >>>>> allow gaps in the alignment and that consecutive matches of even 40 or >>>>> 50 base pairs are likely to generate a signal on a microarray. >>>>> >>>>> Sean >>>>> >>>>> ? ?One more thing. While testing the above code, I discovered 2 >>>>> problems >>>>> ? ?that were preventing it to work: one with the >>>>> ? ?extractTranscriptsFromGenome() >>>>> ? ?function and one with the vwhichPDict() function. Those problems >>>>> ? ?are now >>>>> ? ?fixed in the release and devel versions of GenomicFeatures (v >>>>> ? ?1.2.4& ?1.3.13) >>>>> ? ?and Biostrings (v 2.18.3& ?2.19.12). These new versions will >>>>> ? ?become available >>>>> ? ?via biocLite() in the next 24-36 hours. >>>>> >>>>> ? ?Hope this helps. >>>>> ? ?H. >>>>> >>>>> >>>>> ? ?----- Original Message ----- >>>>> ? ?From: "Ravi Karra"<ravi.karra at="" gmail.com="">>>>> ? ?<mailto:ravi.karra at="" gmail.com="">> >>>>> ? ?To: "Sean Davis"<sdavis2 at="" mail.nih.gov<mailto:sdavis2="" at="" mail.nih.gov="">> >>>>> ? ?Cc: bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>>> ? ?Sent: Saturday, February 26, 2011 2:56:02 PM >>>>> ? ?Subject: Re: [BioC] Mapping microarray probes to the genome using >>>>> ? ?findOverlaps >>>>> >>>>> ? ?Hi Sean, >>>>> ? ?I could do that, but am not sure how to. The annotated zebrafish >>>>> ? ?genome is the Tubingen strain and the some of the probes on the >>>>> ? ?array are from the EK and AB strains. This means that I need to >>>>> ? ?allow for SNP's in the alignment. I originally tried to align the >>>>> ? ?probes to the Ensembl Transcripts using matchPDict but when I >>>>> ? ?allowed for 2 mismatches (max.mismatch = 2) across the probe >>>>> ? ?sequences, my computer never stopped running the program! I found >>>>> ? ?TopHat to be much faster (8 min) and TopHat allows for a few >>>>> ? ?nucleotide wobble by default!. >>>>> ? ?Do you have a suggestion(s) on another way to align the array to >>>>> ? ?the Ensembl Transcripts? >>>>> ? ?Thanks, >>>>> ? ?Ravi >>>>> >>>>> >>>>> >>>>> >>>>> ? ?On Feb 26, 2011, at 5:45 PM, Sean Davis wrote: >>>>> >>>>> ? ?> >>>>> ? ?> >>>>> ? ?> ?On Sat, Feb 26, 2011 at 5:29 PM, Ravi Karra >>>>> ? ?<ravi.karra at="" gmail.com<mailto:ravi.karra="" at="" gmail.com="">> ?wrote: >>>>> ? ?> ?Hello, >>>>> ? ?> >>>>> ? ?> ?I am still trying to map probes on the Nimblegen Zebrafish 12 >>>>> ? ?x135K Expression to the Zv9 version of the zebrafish genome >>>>> ? ?available from Ensembl! I am very reluctantly pursuing an >>>>> ? ?alignment approach to annotation as the original annotation >>>>> ? ?provided with the array is quite outdated. >>>>> ? ?> >>>>> ? ?> ?I performed a gapped alignment using the individual probe >>>>> ? ?sequences (60-mers) from the array using TopHat and loaded the >>>>> ? ?results into Bioconductor as a GappedAlignments object. I have >>>>> ? ?made a TranscriptDb object using the Zv9 genome from Ensembl. >>>>> ? ?Next, I plan to use findOverlaps for the annotation. What is the >>>>> ? ?best way to get the overlaps (by exon, cds, or by transcript)? I >>>>> ? ?am a little concerned that using transcriptsByOverlaps might be a >>>>> ? ?bit too broad and result in mapping reads to multiple genes (for >>>>> ? ?example transcripts in the genome that have overlapping genomic >>>>> ? ?coordinates). By contrast, mapping with exonsByOverlaps and >>>>> ? ?cdsByOverlaps might be too restrictive and miss information in the >>>>> ? ?UTR's. My gut feeling is that annotating by cds is the >>>>> ? ?"in-between" approach. What is the recommended approach for >>>>> ? ?RNA-seq? As you can tell, I am quite new to this! >>>>> ? ?> >>>>> ? ?> >>>>> ? ?> ?Hi, Ravi. Sorry to answer a question with more questions, but >>>>> ? ?why not just map the probes against Ensembl Transcripts or refseq? >>>>> ? ?What is the advantage of mapping to the genome and then going back >>>>> ? ?to the transcripts? >>>>> ? ?> >>>>> ? ?> ?Sean >>>>> >>>>> >>>>> ? ?[[alternative HTML version deleted]] >>>>> >>>>> ? ?_______________________________________________ >>>>> ? ?Bioconductor mailing list >>>>> ? ?Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>>> ? ?https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> ? ?Search the archives: >>>>> ? ?http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> ? ?_______________________________________________ >>>>> ? ?Bioconductor mailing list >>>>> ? ?Bioconductor at r-project.org<mailto:bioconductor at="" r-project.org=""> >>>>> ? ?https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> ? ?Search the archives: >>>>> ? ?http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>> >>> >>> >>> -- >>> Hervé Pagès >>> >>> Program in Computational Biology >>> Division of Public Health Sciences >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N, M2-B876 >>> P.O. Box 19024 >>> Seattle, WA 98109-1024 >>> >>> E-mail: hpages at fhcrc.org >>> Phone: ?(206) 667-5791 >>> Fax: ? ?(206) 667-1319 >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 >
ADD REPLYlink written 8.1 years ago by Benilton Carvalho4.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 272 users visited in the last hour