Drosophila tiling array background correction
1
0
Entering edit mode
@benjamin-lansdell-2920
Last seen 9.6 years ago
Hi, I have been trying to analyse drosophila development tiling arrays using Bioconductor. In particular I would like to perform some form of background correction for probe affinities (such as GCRMA) and then quantile normalisation but have been unable to find any packages that can do so without a .cdf file. I have a collection of .cel files and a .bpmap file. I know the package oligo can perform RMA using a package that you build from a .bpmap file but this isn't quite what I want. Is it possible to create the necessary structures myself (AffyBatch, something that contains the probe sequences, etc) for use with the many packages that seem to rely on a .cdf file? Thanks in advance, Ben
cdf probe oligo cdf probe oligo • 1000 views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States
Hi Ben, I don't really have any answers for you, but I'm interested in hearing them myself as well :-) I'm making a few comments below ... I'm sorry, they turned out to be rather long and I'm not making any claim that I know exactly what I'm doing, I'm just trying what makes most sense to me at the moment. If anybody has any positive/negative feedback, I'd be happy to hear. On Jul 15, 2008, at 8:53 PM, Benjamin Lansdell wrote: > Hi, > > I have been trying to analyse drosophila development tiling arrays > using Bioconductor. In particular I would like to perform some form > of background correction for probe affinities (such as GCRMA) and > then quantile normalisation but have been unable to find any > packages that can do so without a .cdf file. I have a collection > of .cel files and a .bpmap file. I've been wrestling with 1.0R Drosophila tiling array myself and haven't really been able to use the AffyBatch objects I get (from reading their .CEL files with ReadAffy) with any success. By that I mean I haven't been able to then pass the batch object into any other processing step for further analysis/QA assessment, even though I think I made the appropriate CDFs correctly (more on that below) So, really the only thing I use the AffyBatch object is to just get the expression info from it: > exps <- exprs(myAffyBatch) You can then quantile normalize the data by just passing the exps expression matrix into the affy::normalize.quantiles function since it only expects a matrix. I've also done my own MvA plots for rudimentary QA. > I know the package oligo can perform RMA using a package that you > build from a .bpmap file but this isn't quite what I want. Can it? My impression was that RMA relies on the idea of "probe sets," which I don't think really applies to tiling arrays as much. I see that it does SNPRMA ... I don't really know what the details of that are, though. Certainly you can annotate your probes to construct your own probe sets, though, but you also have a majority of probes on the array that don't really belong to any probe set. > Is it possible to create the necessary structures myself (AffyBatch, > something that contains the probe sequences, etc) for use with the > many packages that seem to rely on a .cdf file? I've built the appropriate CDF-like structure (I think) by using the makePlatformDesign::makePDpackage, but still never really got it act like a CDF for my affybatch objects like Bioconductor expects. If you haven't done so, I've put mine up previously for someone else to download, and you can still grab them. They were built for both chips of the 1.0R version of the tiling array (forward and backward strand (aka MF and MR)) on a 64-bit Linux machine: http://cbio.mskcc.org/~lianos/files/bioconductor/ I've been manually attacking this data, probably doing more work than necessary since I can't figure out how to leverage most of bioconductor tools just yet. Ideally I'd like to take a similar approach to what was done here: http://genomebiology.com/2008/9/7/R112 The methods appeared earlier in the proceedings of this year's PSB conference. They've also provided MATLAB source code implementing their analysis techniques: http://www.fml.tuebingen.mpg.de/raetsch/projects/PSBTiling In reality, as a first approach, I'm probably just going to implement (in R) their slightly modified version of M. Gerstein's Sequence Quantile Normalization: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/8/988 and explore the signals I'm getting off different loci of the genome. I'm hoping I can pull this off within the time constraints of my current rotation project. If others find it useful, and I'm successful, I'd be happy to share it back w/ the R/BioC community. In my adventures, I've created some SQLite database that you might find useful: (i) A DB with all of the annotation data for the Dmel (v5) genome (protein coding genes, RNA's, exonic positions, etc) from the GTF file taken from ensembl: ftp://ftp.ensembl.org/pub/current_gtf/ (ii) A DB with meta data about the tiling chips, this includes the following tables: (a) probe information as extracted from the bpmap file (sequence, x/ y chip coordinates, pm/mm, etc) (b) alignment information for every probe to the latest release of the genome. For this I used MUMmer and had it return all alignments >= 23 NTs in length (taking Wolfgang Huber's choice from his tilingArray package) (c) a table used for quick lookup of how many times each probe aligns perfectly, and "almost perfectly" to the reference genome. I'm now building RData probe_annotation objects for each chromosome that tie this information together. This data.frame essentially has: (i) probe information (sequence, x/y, pm/mm) (ii) number of perfect/close hits for the probe (iii) where in the genome the probe (perfectly) lands: gene name, type of gene (protein coding, miRNA, snoRNA, rRNA, etc) and whether it's exonic/intronic. If you think any of these would be helpful to you, let me know and I'll try to put them up when I think they're reasonably correct. Like I said before, I'm probably not leveraging the BioC tools that are available as I've gotten lost in the myriad of options (which are good to have) that are there. -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Cornell Medical College http://cbio.mskcc.org/~lianos
ADD COMMENT
0
Entering edit mode
Hi, you can extract the signals manually (watch out that takes some time..). ##################################################### library(affy) library(affxparser) #here the location of your bpmap bpmap_file <- "Dm35b_MR_v02-3_BDGPv4h.bpmap" bpmap <- readBpmap(bpmap_file) summary(bpmap) # you get a list of chromosome mappings only a few of them are Drosophila ones # these you specify in here: allchrs <- c("Dm:BDGPv4h;chr2L", "Dm:BDGPv4h;chr2R","Dm:BDGPv4h;chr2h","Dm:BDGPv4h;chr3L", "Dm:BDGPv4h;chr3R","Dm:BDGPv4h;chr3h","Dm:BDGPv4h;chr4", "Dm:BDGPv4h;chr4h","Dm:BDGPv4h;chrU","Dm:BDGPv4h;chrX", "Dm:BDGPv4h;chrXh","Dm:BDGPv4h;chrYh") cel_files <- c("file1.cel","file2.cel", "file3.cel", "file4.cel", "file5.cel", "file6.cel") cel_header <- readCelHeader(cel_files[1]) cel_indices <- lapply(bpmap, function(x) xy2indices(x[["pmx"]], x[["pmy"]], nr=cel_header[["cols"]])) chr<-allchrs[1] data <- readCelIntensities(cel_files, cel_indices[[chr]]) signals <- data colnames(signals) <- cel_files layout <- data.frame(chr= factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq= bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]], stringsAsFactors = FALSE) layout$probe.id <- paste(layout[,1], layout[,3], sep="_") lm <- layout for (chr in allchrs[2:length(allchrs)]) { data <- readCelIntensities(cel_files, cel_indices[[chr]]) signals <-rbind(signals, data) layout <- data.frame(chr= factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq= bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]], stringsAsFactors = FALSE) layout$probe.id <- paste(layout[,1], layout[,3], sep="_") lm <- rbind(lm, layout) } rownames(signals)<-lm$probe.id layout<-data.frame("probe.id"=lm$probe.id, "chr"=sub("chr","",lm$chr), "pos"=lm$pos, "sequence"=lm$seq) layout$length=c(25) #################################### you end up with a 'signal' matrix and a 'layout' dataframe the signals you can easily normalize without having them in an AffyBatch object. sth. like norm <- exprs(vsn2(signals, subsample=200000)) or whatever else best Tobias On Jul 16, 2008, at 4:01 PM, Steve Lianoglou wrote: > Hi Ben, > > I don't really have any answers for you, but I'm interested in > hearing them myself as well :-) > > I'm making a few comments below ... I'm sorry, they turned out to be > rather long and I'm not making any claim that I know exactly what > I'm doing, I'm just trying what makes most sense to me at the moment. > > If anybody has any positive/negative feedback, I'd be happy to hear. > > On Jul 15, 2008, at 8:53 PM, Benjamin Lansdell wrote: > >> Hi, >> >> I have been trying to analyse drosophila development tiling arrays >> using Bioconductor. In particular I would like to perform some form >> of background correction for probe affinities (such as GCRMA) and >> then quantile normalisation but have been unable to find any >> packages that can do so without a .cdf file. I have a collection >> of .cel files and a .bpmap file. > > I've been wrestling with 1.0R Drosophila tiling array myself and > haven't really been able to use the AffyBatch objects I get (from > reading their .CEL files with ReadAffy) with any success. By that I > mean I haven't been able to then pass the batch object into any > other processing step for further analysis/QA assessment, even > though I think I made the appropriate CDFs correctly (more on that > below) > > So, really the only thing I use the AffyBatch object is to just get > the expression info from it: > > > exps <- exprs(myAffyBatch) > > You can then quantile normalize the data by just passing the exps > expression matrix into the affy::normalize.quantiles function since > it only expects a matrix. > > I've also done my own MvA plots for rudimentary QA. > >> I know the package oligo can perform RMA using a package that you >> build from a .bpmap file but this isn't quite what I want. > > Can it? My impression was that RMA relies on the idea of "probe > sets," which I don't think really applies to tiling arrays as much. > I see that it does SNPRMA ... I don't really know what the details > of that are, though. > > Certainly you can annotate your probes to construct your own probe > sets, though, but you also have a majority of probes on the array > that don't really belong to any probe set. > >> Is it possible to create the necessary structures myself >> (AffyBatch, something that contains the probe sequences, etc) for >> use with the many packages that seem to rely on a .cdf file? > > > I've built the appropriate CDF-like structure (I think) by using the > makePlatformDesign::makePDpackage, but still never really got it act > like a CDF for my affybatch objects like Bioconductor expects. If > you haven't done so, I've put mine up previously for someone else to > download, and you can still grab them. They were built for both > chips of the 1.0R version of the tiling array (forward and backward > strand (aka MF and MR)) on a 64-bit Linux machine: > > http://cbio.mskcc.org/~lianos/files/bioconductor/ > > I've been manually attacking this data, probably doing more work > than necessary since I can't figure out how to leverage most of > bioconductor tools just yet. > > Ideally I'd like to take a similar approach to what was done here: > http://genomebiology.com/2008/9/7/R112 > > The methods appeared earlier in the proceedings of this year's PSB > conference. They've also provided MATLAB source code implementing > their analysis techniques: > http://www.fml.tuebingen.mpg.de/raetsch/projects/PSBTiling > > In reality, as a first approach, I'm probably just going to > implement (in R) their slightly modified version of M. Gerstein's > Sequence Quantile Normalization: > http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/8/988 > > and explore the signals I'm getting off different loci of the > genome. I'm hoping I can pull this off within the time constraints > of my current rotation project. If others find it useful, and I'm > successful, I'd be happy to share it back w/ the R/BioC community. > > In my adventures, I've created some SQLite database that you might > find useful: > > (i) A DB with all of the annotation data for the Dmel (v5) genome > (protein coding genes, RNA's, exonic positions, etc) from the GTF > file taken from ensembl: > ftp://ftp.ensembl.org/pub/current_gtf/ > > (ii) A DB with meta data about the tiling chips, this includes the > following tables: > (a) probe information as extracted from the bpmap file (sequence, x/ > y chip coordinates, pm/mm, etc) > (b) alignment information for every probe to the latest release of > the genome. For this I used MUMmer and had it return all alignments > >= 23 NTs in length (taking Wolfgang Huber's choice from his > tilingArray package) > (c) a table used for quick lookup of how many times each probe > aligns perfectly, and "almost perfectly" to the reference genome. > > I'm now building RData probe_annotation objects for each chromosome > that tie this information together. This data.frame essentially has: > > (i) probe information (sequence, x/y, pm/mm) > (ii) number of perfect/close hits for the probe > (iii) where in the genome the probe (perfectly) lands: gene name, > type of gene (protein coding, miRNA, snoRNA, rRNA, etc) and whether > it's exonic/intronic. > > If you think any of these would be helpful to you, let me know and > I'll try to put them up when I think they're reasonably correct. > > Like I said before, I'm probably not leveraging the BioC tools that > are available as I've gotten lost in the myriad of options (which > are good to have) that are there. > > -steve > > -- > Steve Lianoglou > Graduate Student: Physiology, Biophysics and Systems Biology > Weill Cornell Medical College > > http://cbio.mskcc.org/~lianos > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ====================================================================== Dr. Tobias Straub Adolf-Butenandt-Institute, Molecular Biology tel: +49-89-2180 75 439 Schillerstr. 44, 80336 Munich, Germany
ADD REPLY
0
Entering edit mode
Thanks for your replies. Tobias, I'm hoping it won't come to creating everything manually but if it does thanks for the code! Steve, we seem to be having the same problem. I can use ReadAffy, and can even specify the package that I create using makePDpackage (which I outlined here http://article.gmane.org/ gmane.science.biology.informatics.conductor/19012 ) but not use the AffyBatch for anything other than exprs(). In reference to using RMA in oligo I should have been clearer. I'm really only interested in background correction and normalisation. The summarisation step doesn't make as much sense. oligo seems to be able to do rma() using its TilingFeatureSet object which would presumably not perform summarisation. I haven't checked whether it's given the right answer, and that it hasn't summarised something that it shouldn't have, because I'd like to try something more involved like GCRMA correction but perhaps it's something you can consider. Thanks for the links you provided. I'm looking at the Matlab source code at the moment. I'll let you know how it goes. Regards, Ben. On 17/07/2008, at 2:22 AM, Tobias Straub wrote: > Hi, > you can extract the signals manually (watch out that takes some > time..). > > ##################################################### > > library(affy) > library(affxparser) > > #here the location of your bpmap > bpmap_file <- "Dm35b_MR_v02-3_BDGPv4h.bpmap" > bpmap <- readBpmap(bpmap_file) > summary(bpmap) > > # you get a list of chromosome mappings only a few of them are > Drosophila ones > # these you specify in here: > allchrs <- c("Dm:BDGPv4h;chr2L", > "Dm:BDGPv4h;chr2R","Dm:BDGPv4h;chr2h","Dm:BDGPv4h;chr3L", > "Dm:BDGPv4h;chr3R","Dm:BDGPv4h;chr3h","Dm:BDGPv4h;chr4", > "Dm:BDGPv4h;chr4h","Dm:BDGPv4h;chrU","Dm:BDGPv4h;chrX", > "Dm:BDGPv4h;chrXh","Dm:BDGPv4h;chrYh") > > cel_files <- c("file1.cel","file2.cel", > "file3.cel", "file4.cel", > "file5.cel", "file6.cel") > > cel_header <- readCelHeader(cel_files[1]) > cel_indices <- lapply(bpmap, function(x) xy2indices(x[["pmx"]], x > [["pmy"]], nr=cel_header[["cols"]])) > > chr<-allchrs[1] > data <- readCelIntensities(cel_files, cel_indices[[chr]]) > signals <- data > colnames(signals) <- cel_files > layout <- data.frame(chr= > factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq= > bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]], > stringsAsFactors = FALSE) > layout$probe.id <- paste(layout[,1], layout[,3], sep="_") > lm <- layout > > for (chr in allchrs[2:length(allchrs)]) { > data <- readCelIntensities(cel_files, cel_indices[[chr]]) > signals <-rbind(signals, data) > layout <- data.frame(chr= > factor(bpmap[[chr]][["seqInfo"]][["name"]]), seq= > bpmap[[chr]][["probeseq"]], pos=bpmap[[chr]][["startpos"]], > stringsAsFactors = FALSE) > layout$probe.id <- paste(layout[,1], layout[,3], sep="_") > lm <- rbind(lm, layout) > } > > rownames(signals)<-lm$probe.id > layout<-data.frame("probe.id"=lm$probe.id, "chr"=sub("chr","",lm > $chr), "pos"=lm$pos, "sequence"=lm$seq) > layout$length=c(25) > > #################################### > > you end up with a 'signal' matrix and a 'layout' dataframe > the signals you can easily normalize without having them in an > AffyBatch object. > > sth. like > > norm <- exprs(vsn2(signals, subsample=200000)) > > or whatever else > > best > Tobias > > > On Jul 16, 2008, at 4:01 PM, Steve Lianoglou wrote: > >> Hi Ben, >> >> I don't really have any answers for you, but I'm interested in >> hearing them myself as well :-) >> >> I'm making a few comments below ... I'm sorry, they turned out to >> be rather long and I'm not making any claim that I know exactly >> what I'm doing, I'm just trying what makes most sense to me at the >> moment. >> >> If anybody has any positive/negative feedback, I'd be happy to hear. >> >> On Jul 15, 2008, at 8:53 PM, Benjamin Lansdell wrote: >> >>> Hi, >>> >>> I have been trying to analyse drosophila development tiling >>> arrays using Bioconductor. In particular I would like to perform >>> some form of background correction for probe affinities (such as >>> GCRMA) and then quantile normalisation but have been unable to >>> find any packages that can do so without a .cdf file. I have a >>> collection of .cel files and a .bpmap file. >> >> I've been wrestling with 1.0R Drosophila tiling array myself and >> haven't really been able to use the AffyBatch objects I get (from >> reading their .CEL files with ReadAffy) with any success. By that >> I mean I haven't been able to then pass the batch object into any >> other processing step for further analysis/QA assessment, even >> though I think I made the appropriate CDFs correctly (more on that >> below) >> >> So, really the only thing I use the AffyBatch object is to just >> get the expression info from it: >> >> > exps <- exprs(myAffyBatch) >> >> You can then quantile normalize the data by just passing the exps >> expression matrix into the affy::normalize.quantiles function >> since it only expects a matrix. >> >> I've also done my own MvA plots for rudimentary QA. >> >>> I know the package oligo can perform RMA using a package that you >>> build from a .bpmap file but this isn't quite what I want. >> >> Can it? My impression was that RMA relies on the idea of "probe >> sets," which I don't think really applies to tiling arrays as >> much. I see that it does SNPRMA ... I don't really know what the >> details of that are, though. >> >> Certainly you can annotate your probes to construct your own probe >> sets, though, but you also have a majority of probes on the array >> that don't really belong to any probe set. >> >>> Is it possible to create the necessary structures myself >>> (AffyBatch, something that contains the probe sequences, etc) for >>> use with the many packages that seem to rely on a .cdf file? >> >> >> I've built the appropriate CDF-like structure (I think) by using >> the makePlatformDesign::makePDpackage, but still never really got >> it act like a CDF for my affybatch objects like Bioconductor >> expects. If you haven't done so, I've put mine up previously for >> someone else to download, and you can still grab them. They were >> built for both chips of the 1.0R version of the tiling array >> (forward and backward strand (aka MF and MR)) on a 64-bit Linux >> machine: >> >> http://cbio.mskcc.org/~lianos/files/bioconductor/ >> >> I've been manually attacking this data, probably doing more work >> than necessary since I can't figure out how to leverage most of >> bioconductor tools just yet. >> >> Ideally I'd like to take a similar approach to what was done here: >> http://genomebiology.com/2008/9/7/R112 >> >> The methods appeared earlier in the proceedings of this year's PSB >> conference. They've also provided MATLAB source code implementing >> their analysis techniques: >> http://www.fml.tuebingen.mpg.de/raetsch/projects/PSBTiling >> >> In reality, as a first approach, I'm probably just going to >> implement (in R) their slightly modified version of M. Gerstein's >> Sequence Quantile Normalization: >> http://bioinformatics.oxfordjournals.org/cgi/content/abstract/ >> 23/8/988 >> >> and explore the signals I'm getting off different loci of the >> genome. I'm hoping I can pull this off within the time constraints >> of my current rotation project. If others find it useful, and I'm >> successful, I'd be happy to share it back w/ the R/BioC community. >> >> In my adventures, I've created some SQLite database that you might >> find useful: >> >> (i) A DB with all of the annotation data for the Dmel (v5) genome >> (protein coding genes, RNA's, exonic positions, etc) from the GTF >> file taken from ensembl: >> ftp://ftp.ensembl.org/pub/current_gtf/ >> >> (ii) A DB with meta data about the tiling chips, this includes the >> following tables: >> (a) probe information as extracted from the bpmap file (sequence, >> x/y chip coordinates, pm/mm, etc) >> (b) alignment information for every probe to the latest release >> of the genome. For this I used MUMmer and had it return all >> alignments >= 23 NTs in length (taking Wolfgang Huber's choice >> from his tilingArray package) >> (c) a table used for quick lookup of how many times each probe >> aligns perfectly, and "almost perfectly" to the reference genome. >> >> I'm now building RData probe_annotation objects for each >> chromosome that tie this information together. This data.frame >> essentially has: >> >> (i) probe information (sequence, x/y, pm/mm) >> (ii) number of perfect/close hits for the probe >> (iii) where in the genome the probe (perfectly) lands: gene name, >> type of gene (protein coding, miRNA, snoRNA, rRNA, etc) and >> whether it's exonic/intronic. >> >> If you think any of these would be helpful to you, let me know and >> I'll try to put them up when I think they're reasonably correct. >> >> Like I said before, I'm probably not leveraging the BioC tools >> that are available as I've gotten lost in the myriad of options >> (which are good to have) that are there. >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Physiology, Biophysics and Systems Biology >> Weill Cornell Medical College >> >> http://cbio.mskcc.org/~lianos >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor > > ====================================================================== > Dr. Tobias Straub Adolf-Butenandt-Institute, Molecular Biology > tel: +49-89-2180 75 439 Schillerstr. 44, 80336 Munich, Germany > ====================================================================== > >
ADD REPLY

Login before adding your answer.

Traffic: 695 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6