gplots Heatmap

0

Entering edit mode

Polikepahad, Sumanth ▴ 10

@polikepahad-sumanth-5111

Last seen 9.6 years ago

Hi, I have analyzed my deep sequencing data with DESeq and successfully generated a heatmap showing differentially expressed miRNAs by using gplots heatmap.2. However, I want to put different colors to the fonts of miRNA names that are shown on y-axis, depending on whether they are up or down regulated. For example, I want to show the names of all down-regulated miRNAs in blue and up-regulated in Red. How do I do this? I guess I must use if else statement, but not sure if its the right way to do it. I am very new to R and I really appreciate any help. Thanks in advance. ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of bioconductor-request@r-project.org [bioconductor-request@r-project.org] Sent: Friday, February 10, 2012 5:00 AM To: bioconductor at r-project.org Subject: Bioconductor Digest, Vol 108, Issue 10 Send Bioconductor mailing list submissions to bioconductor at r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://stat.ethz.ch/mailman/listinfo/bioconductor or, via email, send a message with subject or body 'help' to bioconductor-request at r-project.org You can reach the person managing the list at bioconductor-owner at r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Bioconductor digest..." Today's Topics: 1. Re: Limma: questions about data pre-processing (Vladimir Krasikov) 2. Re: DESeq and transcript-wise analysis (Nicolas Delhomme) 3. Re: Error message with RNAither (Martin Morgan) 4. GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou) 5. February-March 2012 ***R/S-PLUS Courses***by XLSolutions Corp at 9 USA locations (sue at xlsolutions-corp.com) 6. Re: DESeq and transcript-wise analysis (Thomas Girke) 7. Re: DESeq and transcript-wise analysis (Nicolas Delhomme) 8. Re: Limma: questions about data pre-processing (axel.klenk at actelion.com) 9. suggestions/comments on DESeq transcript wise analysis (Akula, Nirmala (NIH/NIMH) [C]) 10. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Tim Triche, Jr.) 11. Re: suggestions/comments on DESeq transcript wise analysis (Abhishek Pratap) 12. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou) 13. Optimisation (Simon No?l) 14. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Tim Triche, Jr.) 15. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou) 16. how can i convert a gse object(expression sets) to affybatch (Salwa Eid) 17. Import sequences from MacClade 4.* (Brian) 18. Re: DESeq and transcript-wise analysis (Elena Sorokin) 19. RE : maping SNPs (Simon No?l) 20. Re: RE : maping SNPs (Hervé Pagès) 21. RE : RE : maping SNPs (Simon No?l) 22. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Michael Lawrence) 23. Re: GenomicFeatures::makeTranscriptDbFromGTF? (Steve Lianoglou) 24. Removing probes with low variance across samples (Infinium 450k) (khadeeja ismail) 25. Re: RE : RE : maping SNPs (Hervé Pagès) 26. Using write.table with output from topTags [was: report a possible bug of edgeR] (Gordon K Smyth) 27. Re: DESeq and transcript-wise analysis (Ann Loraine) 28. Re: Removing probes with low variance across samples (Infinium 450k) (Tim Triche, Jr.) 29. Re: how can i convert a gse object(expression sets) to affybatch (Sean Davis) 30. ConsensusClusterPlus extracting features for clusters (somnath bandyopadhyay) 31. limma barcodeplot with 2 groups (Dario Strbenac) 32. bioMart (Bogdan Tanasa) 33. Inconsistent coefficient values (Richard Coulson [guest]) 34. a problem in reading in cel files (Manuela Di Russo) 35. Re: NormqPCR and ReadqPCR (James Perkins) ---------------------------------------------------------------------- Message: 1 Date: Thu, 09 Feb 2012 13:47:59 +0100 From: "Vladimir Krasikov" <v.v.krasikov@gmail.com> To: axel.klenk at actelion.com Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org Subject: Re: [BioC] Limma: questions about data pre-processing Message-ID: <op.v9ewh9grb6kgy0 at="" u028550.uva.nl=""> Content-Type: text/plain; charset=iso-8859-15; format=flowed; delsp=yes Dear Axel Once again thanks... Q2: The only thing I know now is that it was rather new Agilent edition of March 2011, however our company stripped away all information in files ( even removed all control spots). Do you think there is still a way to make visualizations? Q5: After reading Rquantile description I now see some rationale about this normalization, when all Red chanels contoined common reference (which is commercial "universal human reference"). However, question remains, what kind of plots, metrics are useful to judge the results of normalizations? On Tue, 07 Feb 2012 15:32:03 +0100, <axel.klenk at="" actelion.com=""> wrote: > Dear Vladimir, > > I'll only answer or comment on some of your questions and leave > the others for the true experts... > > Q2: yes, for example using package arrayQualityMetrics, if you know > the array layout. FES output usually contains columns Col and Row for > spot coordinates but apparently your "service provider" has removed > them. I could send you a coordinates <--> oligo mapping by email if you > can tell me your array type -- is it 1x44K, 4x44K or 4x44Kv2? > Alternatively, > you can try to find that information on Agilent's eArray web site: > earray.chem.agilent.com > > Q5: for a common reference design, dye swaps are not required and > I would not apply a loess normalization -- depending on what you have > hybridized as the common reference, the assumptions may not hold. > As for the between-array normalization, Rquantile may also be an > option for your design and boxplots and density plots may be used > for judging the results. > > Cheers, > > - axel > > > Axel Klenk > Research Informatician > Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / > Switzerland > > > > > From: > "Vladimir Krasikov" <v.v.krasikov at="" gmail.com=""> > To: > bioconductor at r-project.org > Date: > 07.02.2012 14:27 > Subject: > [BioC] Limma: questions about data pre-processing > Sent by: > bioconductor-bounces at r-project.org > > > > Dear limma experts > > During creating the pipe-line for dissecting differential gene expression > in frame of limma, > several questions have arose. > > Experiment: > I have 62 two-color Agilent human arrays. > The samples are from several human more or less related to each other > disorders and vary in age, sex, disease duration and diagnosis. > Company that made hybridizations performed all hybs in one direction (no > dye-swaps), > where all samples were in G channel and common Ref in R channel, > and unfortunately provided us only excepts of Feature Extraction > which contained info on G, Gb, R, Rb, and FNO (non-uniformity outliers) > and separate gene annotation table. > > I performed generic import of the data and assigned zero-weight to the > FNO > spots: > I analyzed density and MA-plots, box-plots of M-values, G and R channels > and box-plots of background intensities, > and removed from experiment 1 array with aberrant raw G-channel density. > (I will discuss experiment description later, when come to the linear > model) > > Q1: Is there a rationale of down-weighting FNO (around 100-200 spots per > array) for background correction and further normalization? > Q2: Is there way to make image representation of Agilent microarray (for > each channel and backgrounds)? > In another words is there known 'layout' for human 44K Agilent? > > Next I corrected the background with: >> RG.b <- backgroundCorrect(RG.raw, method="minimum", offset=50) > (recommended method=normexp produced shifted curves for five arrays after > taking a look on density plots, > and box-plots for separate G and R channels also look less uniform as > compared with 'minimum' method) > > Q3: I guess it is also possible to remove those 5 arrays from the > experiment. Is it fair? > Q4: What kind of reasoning should be used for the choice between > background subtraction methods? > > Then performed standard loess within array normalization: >> MA.loess <- normalizeWithinArrays(RG.b, method="loess",bc.method="none") > > Q5: Do I need to perform between array normalization? > How to judge which of the methods (non, scale, quantile, Aquantile) > is > best for my experiment? > > For now I decide to stuck with background=minimum, within=loess, and > between=is under the question > > Next I would like to ask questions about > linear model of my experiment, but I will make it in a next help request > > Thanks a lot in advance > > and finally >> sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 > [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C > [5] LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.10.2 >> > > With kind regards > Vladimir > -- > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > The information of this email and in any file transmitted with it is > strictly confidential and may be legally privileged. > It is intended solely for the addressee. If you are not the intended > recipient, any copying, distribution or any other use of this email is > prohibited and may be unlawful. In such case, you should please notify > the sender immediately and destroy this email. > The content of this email is not legally binding unless confirmed by > letter. > Any views expressed in this message are those of the individual sender, > except where the message states otherwise and the sender is authorised > to state them to be the views of the sender's company. For further > information about Actelion please see our website at > http://www.actelion.com > -- ------------------------------ Message: 2 Date: Thu, 9 Feb 2012 14:38:02 +0100 From: Nicolas Delhomme <delhomme@embl.de> To: Abhishek Pratap <apratap at="" lbl.gov=""> Cc: bioconductor at r-project.org Subject: Re: [BioC] DESeq and transcript-wise analysis Message-ID: <36CC7B64-054F-4E9C-8CEB-190A1D857D5C at embl.de> Content-Type: text/plain; charset=us-ascii Dear Abhi, If you could point me to some published strand specific data or let me get an excerpt of yours, I could easily had strand-specificity in the easyRNASeq package. Thanks, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 9 Feb 2012, at 00:41, Abhishek Pratap wrote: > Hi Elena > > Good timing with me on this. I recently was contemplating the best way > to move forward for a similar analysis. HTSeq a python based toolkit > by Simon can help you do the counting. FYI : It can also take strand > info into account. If you dont have stranded data you could also look > at easyrnaseq package. > > So if you have an annotation file like gff/gtf with the isoform > information you could then do the read counting at isoform or gene > level based on which attribute of the gff file you select to do the > counting. Check out > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html. > > Also you want to keep in mind that at isoform level you would be > double counting the reads in exons which are shared in the isoforms > which can bias your results to some extent. But as Wolfgang pointed > out in a recent post if you use FDR, it should not matter a lost as > the bias will be cancelled between denominator /numerator. > > You also might want to check the DEXSeq which can help infer > differential expression from RNA-Seq exons which could then be related > back to genes/isoforms. > > Hope this helps and let us know about your progress. I would be > interested in learning from your experience too. > > Cheers! > -Abhi > > ---------------------------------- > Abhishek Pratap > Bioinformatics Systems Analyst - 3 > DOE- Joint Genome Institute > Lawrence Berkeley National Lab > > > > > On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at="" wisc.edu=""> wrote: >> Greetings all, >> >> After re-reading related posts in the listserv archive, I still didn't know >> the exact answer to my question, so here goes. I'd like to use DESeq to >> measure differential isoform expression. Has Simon or anybody else written a >> script that will convert aligned reads (.bam/.sam file) into a table of >> isoform counts, suitable for input to DESEq - similar to what Simon has done >> at the gene-wise level, but instead for making a table of counts by isoform? >> >> I would try to do this myself, but I'm a novice at programming. Sorry if >> this has been answered elsewhere... If so, please let me know the link. >> >> Thanks, >> Elena >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 3 Date: Thu, 09 Feb 2012 06:10:37 -0800 From: Martin Morgan <mtmorgan@fhcrc.org> To: Catherine Garry <cagarry at="" tcd.ie=""> Cc: bioconductor at r-project.org Subject: Re: [BioC] Error message with RNAither Message-ID: <4F33D3DD.7040303 at fhcrc.org> Content-Type: text/plain; charset=windows-1252; format=flowed On 02/09/2012 02:54 AM, Catherine Garry wrote: > Hi Dan, > > When i tried all of this, this is what i got: > >> source("http://www.bioconductor.org/biocLite.R") > BiocInstaller version 1.2.1, ?biocLite for help >> biocLite("RNAither") > BioC_mirror: 'http://www.bioconductor.org' > Using R version 2.14, BiocInstaller version 1.2.1. > Installing package(s) 'RNAither' > Installing package(s) into ?C:/Users/Catherine/Documents/R/win- library/2.14? > (as ?lib? is unspecified) > trying URL ' > http://www.bioconductor.org/packages/2.9/bioc/bin/windows/contrib/2. 14/RNAither_2.2.0.zip > ' > Content type 'application/zip' length 1414362 bytes (1.3 Mb) > opened URL > downloaded 1.3 Mb > package ?RNAither? successfully unpacked and MD5 sums checked > The downloaded packages are in > C:\Users\Catherine\AppData\Local\Temp\RtmpETRn43\downloaded_packages > Warning message: > 'boot' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'foreign' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'KernSmooth' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'Matrix' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'mgcv' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'nlme' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'rpart' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'survival' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable >> biocLite(character(), ask=FALSE) > BioC_mirror: 'http://www.bioconductor.org' > Using R version 2.14, BiocInstaller version 1.2.1. > Warning message: > 'boot' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'foreign' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'KernSmooth' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'Matrix' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'mgcv' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'nlme' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'rpart' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable > 'survival' cannot be updated, installed directory 'C:/Program > Files/R/R-2.14.0/library' not writeable >> > Any ideas on how to recify this? These are warnings, and can be ignored; you could update your R installation to R-2.14.1 but that is not strictly necessary. The good news is > package ?RNAither? successfully unpacked and MD5 sums checked do you still have the problems >>> library("RNAither"). but i also get an error here: >>> >>> Loading required package: robustbase >>> Error: package ?robustbase? could not be loaded >>> In addition: Warning messages: >>> 1: package ?AnnotationDbi? was built under R version 2.14.1 >>> 2: package ?SparseM? was built under R version 2.14.1 >>> 3: In library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc ? Martin > > On 8 February 2012 19:13, Dan Tenenbaum<dtenenba at="" fhcrc.org=""> wrote: > >> On Wed, Feb 8, 2012 at 10:58 AM, Catherine Garry<cagarry at="" tcd.ie=""> wrote: >>> Hi, >>> >>> I keep getting an error message when i try to generate the dataset file >> in >>> RNAither. This is the error i keep getting. >>> >>>> generateDatasetFile("DGS", "DGScreen in cells", >>> + NA_character_, "RNAither_output_Rep1.txt", plateLayout1, plateLayout2, >>> + 8, 12, 1, emptyWells, poorWells, controlCoordsOutput, >>> backgroundValOutput, cellnumOutput) >>> >>> Error: could not find function "generateDatasetFile" >>> Can anyone help explain why this is happening? >>> At the beginning, as usual, i add in: >>> >>> library("RNAither"). but i also get an error here: >>> >>> Loading required package: robustbase >>> Error: package ?robustbase? could not be loaded >>> In addition: Warning messages: >>> 1: package ?AnnotationDbi? was built under R version 2.14.1 >>> 2: package ?SparseM? was built under R version 2.14.1 >>> 3: In library(pkg, character.only = TRUE, logical.return = TRUE, lib.loc >> = >>> lib.loc) : >>> there is no package called ?robustbase? >>> >>> I also used: >>> >>> source("http://www.bioconductor.org/biocLite.R") >>> biocLite("RNAither") >>> >>> , to install all the files etc. >>> >>> Can anyone help me rectify whatever is happening? >> >> It would be helpful to see the output of sessionInfo() so we can see >> exactly what packages you have, and a little bit about your system >> configuration. >> >> Without seeing it, I can guess that you have some out of date >> packages, and you can fix this as follows: >> >> biocLite(character(), ask=FALSE) >> That will update ALL Bioconductor and CRAN packages on your system >> which are outdated. >> >> It would also be good to see the complete output of >> biocLite("RNAither"). >> >> Thanks, >> Dan >> >> >>> >>> Thank you. >>> >>> [[alternative HTML version deleted]] >>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 ------------------------------ Message: 4 Date: Thu, 9 Feb 2012 10:55:29 -0500 From: Steve Lianoglou <mailinglist.honeypot@gmail.com> To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: [BioC] GenomicFeatures::makeTranscriptDbFromGTF? Message-ID: <caha9mcpberwps+eexrtqbo0mbhyaod1wljfgho9kxk6uewxecg at="" mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hi, I thought it would be handy to make a GenomicFeatures::TranscriptDb from a gtf file and was somehow surprised to see that I couldn't find such a function in GenomicFeatures. I'm happy to whip up such a function, but wanted to check in to make sure I'm not missing something like (1) you can already do it; or (2) it's not as straightforward as I think; (3) maybe it's there and I'm not looking hard enough. Right now I just want to build it on the gff that the knew versions of tophat build when you pass in a gtf/gff file of known transcripts (the --G/--GTF cmd line arg), but I think it'd be handy overall. I don't think gff/gtf's have any column for exon ordering, though, in which case I'd just assume the exons are ordered linearly (bye bye trans-splicing)). Good idea? Bad idea? Already done? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ------------------------------ Message: 5 Date: Thu, 9 Feb 2012 08:56:06 -0700 From: "sue@xlsolutions-corp.com" <sue@xlsolutions-corp.com> To: <bioconductor at="" stat.math.ethz.ch=""> Subject: [BioC] February-March 2012 ***R/S-PLUS Courses***by XLSolutions Corp at 9 USA locations Message-ID: <20120209085606.aa8924c5d28ca71e2a043bb294e795eb.a4e946a69b.wbe at email04.secureserver.net> Content-Type: text/plain; charset="utf-8" Happy New Year ! XLSolutions February-March 2012 R/S-PLUS courses schedule is now available online at 9 USA cities for with 13 new courses: *** Suggest a future course date/city (1) R-PLUS: A Point-and-Click Approach to R (2) S-PLUS / R : Programming Essentials. (3) R/S+ Fundamentals and Programming Techniques (4) R/S-PLUS Functions by Example. (5) S/R-PLUS Programming 3: Advanced Techniques and Efficiencies. (6) R/S+ System: Advanced Programming. (7) R/S-PLUS Graphics: Essentials. (8) R/S-PLUS Graphics for SAS Users (9) R/S-PLUS Graphical Techniques for Marketing Research. (10) Multivariate Statistical Methods in R/S-PLUS: Practical Research Applications (11) Introduction to Applied Econometrics with R/S-PLUS (12) Exploratory Analysis for Large and Complex Problems in R/S-PLUS (13) Determining Power and Sample Size Using R/S-PLUS. (14) R/S-PLUS: Data Preparation for Data Mining (15) Data Cleaning Techniques in R/S-PLUS (16) R/S-PLUS: Applied Clustering Techniques More on website http://www.xlsolutions-corp.com/rplus.asp Ask for group discount and reserve your seat Now - Earlybird Rates. Payment due after the class! Email Sue Turner: sue at xlsolutions-corp.com Phone: 206-686-1578 Please let us know if you and your colleagues are interested in this class to take advantage of group discount. Register now to secure your seat. Cheers, Elvis Miller, PhD Manager Training. XLSolutions Corporation 206 686 1578 www.xlsolutions-corp.com elvis at xlsolutions-corp.com ------------------------------ Message: 6 Date: Thu, 9 Feb 2012 08:21:29 -0800 From: Thomas Girke <thomas.girke@ucr.edu> To: Nicolas Delhomme <delhomme at="" embl.de=""> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] DESeq and transcript-wise analysis Message-ID: <20120209162129.GA4433 at Thomas-Girkes-MacBook-Pro.local> Content-Type: text/plain; charset=us-ascii This study contains some strand specific RNA-Seq data: http://www.hubmed.org/display.cgi?uids=18423832 I would expect that most RNA-Seq experiments in the near future may be performed in a strand-specific manner, since the strand information carries a lot of biologically relevant information in this application domain. Thus, adding analysis support for it is definitely not a waste of time. I have not used easyRNASeq yet, but I will certainly give it a try. In my group we currently use for RNA-Seq analysis the following components: Rsubread (or tophat) -> rtracklayer/Rsamtools/GenomicRanges -> DESeq/edgeR. This allows any type strand and non-strand specific read counts for exons, transcripts, genes, intergenic features, etc. A huge advantage of this environment is its flexibility and broad application spectrum for most applications domains in the NGS field, such as SNP-Seq, ChiP-Seq, smallRNA-Seq, etc. For instance, our ChIP-Seq analysis routines use most of these tools plus some peak callers. Thomas On Thu, Feb 09, 2012 at 01:38:02PM +0000, Nicolas Delhomme wrote: > Dear Abhi, > > If you could point me to some published strand specific data or let me get an excerpt of yours, I could easily had strand-specificity in the easyRNASeq package. > > Thanks, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 9 Feb 2012, at 00:41, Abhishek Pratap wrote: > > > Hi Elena > > > > Good timing with me on this. I recently was contemplating the best way > > to move forward for a similar analysis. HTSeq a python based toolkit > > by Simon can help you do the counting. FYI : It can also take strand > > info into account. If you dont have stranded data you could also look > > at easyrnaseq package. > > > > So if you have an annotation file like gff/gtf with the isoform > > information you could then do the read counting at isoform or gene > > level based on which attribute of the gff file you select to do the > > counting. Check out > > http://www-huber.embl.de/users/anders/HTSeq/doc/count.html. > > > > Also you want to keep in mind that at isoform level you would be > > double counting the reads in exons which are shared in the isoforms > > which can bias your results to some extent. But as Wolfgang pointed > > out in a recent post if you use FDR, it should not matter a lost as > > the bias will be cancelled between denominator /numerator. > > > > You also might want to check the DEXSeq which can help infer > > differential expression from RNA-Seq exons which could then be related > > back to genes/isoforms. > > > > Hope this helps and let us know about your progress. I would be > > interested in learning from your experience too. > > > > Cheers! > > -Abhi > > > > ---------------------------------- > > Abhishek Pratap > > Bioinformatics Systems Analyst - 3 > > DOE- Joint Genome Institute > > Lawrence Berkeley National Lab > > > > > > > > > > On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at="" wisc.edu=""> wrote: > >> Greetings all, > >> > >> After re-reading related posts in the listserv archive, I still didn't know > >> the exact answer to my question, so here goes. I'd like to use DESeq to > >> measure differential isoform expression. Has Simon or anybody else written a > >> script that will convert aligned reads (.bam/.sam file) into a table of > >> isoform counts, suitable for input to DESEq - similar to what Simon has done > >> at the gene-wise level, but instead for making a table of counts by isoform? > >> > >> I would try to do this myself, but I'm a novice at programming. Sorry if > >> this has been answered elsewhere... If so, please let me know the link. > >> > >> Thanks, > >> Elena > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 7 Date: Thu, 9 Feb 2012 17:31:06 +0100 From: Nicolas Delhomme <delhomme@embl.de> To: Thomas Girke <thomas.girke at="" ucr.edu=""> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] DESeq and transcript-wise analysis Message-ID: <1FFF4CFC-2C52-425D-80AC-56FF777BB562 at embl.de> Content-Type: text/plain; charset=us-ascii Hi Thomas, On 9 Feb 2012, at 17:21, Thomas Girke wrote: > This study contains some strand specific RNA-Seq data: > http://www.hubmed.org/display.cgi?uids=18423832 > Thanks for the pointer. > I would expect that most RNA-Seq experiments in the near future may be > performed in a strand-specific manner, since the strand information > carries a lot of biologically relevant information in this application > domain. Thus, adding analysis support for it is definitely not a waste > of time. Clearly. I would have already done had I had the time. > > I have not used easyRNASeq yet, but I will certainly give it a try. Let me know when you do. > > In my group we currently use for RNA-Seq analysis the following > components: Rsubread (or tophat) -> rtracklayer/Rsamtools/GenomicRanges > -> DESeq/edgeR. This allows any type strand and non-strand specific read > counts for exons, transcripts, genes, intergenic features, etc. A huge > advantage of this environment is its flexibility and broad application > spectrum for most applications domains in the NGS field, such as > SNP-Seq, ChiP-Seq, smallRNA-Seq, etc. For instance, our ChIP-Seq > analysis routines use most of these tools plus some peak callers. Exactly why I have been developing and using easyRNASeq as well. Cheers, Nico > > Thomas > > On Thu, Feb 09, 2012 at 01:38:02PM +0000, Nicolas Delhomme wrote: >> Dear Abhi, >> >> If you could point me to some published strand specific data or let me get an excerpt of yours, I could easily had strand-specificity in the easyRNASeq package. >> >> Thanks, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On 9 Feb 2012, at 00:41, Abhishek Pratap wrote: >> >>> Hi Elena >>> >>> Good timing with me on this. I recently was contemplating the best way >>> to move forward for a similar analysis. HTSeq a python based toolkit >>> by Simon can help you do the counting. FYI : It can also take strand >>> info into account. If you dont have stranded data you could also look >>> at easyrnaseq package. >>> >>> So if you have an annotation file like gff/gtf with the isoform >>> information you could then do the read counting at isoform or gene >>> level based on which attribute of the gff file you select to do the >>> counting. Check out >>> http://www-huber.embl.de/users/anders/HTSeq/doc/count.html. >>> >>> Also you want to keep in mind that at isoform level you would be >>> double counting the reads in exons which are shared in the isoforms >>> which can bias your results to some extent. But as Wolfgang pointed >>> out in a recent post if you use FDR, it should not matter a lost as >>> the bias will be cancelled between denominator /numerator. >>> >>> You also might want to check the DEXSeq which can help infer >>> differential expression from RNA-Seq exons which could then be related >>> back to genes/isoforms. >>> >>> Hope this helps and let us know about your progress. I would be >>> interested in learning from your experience too. >>> >>> Cheers! >>> -Abhi >>> >>> ---------------------------------- >>> Abhishek Pratap >>> Bioinformatics Systems Analyst - 3 >>> DOE- Joint Genome Institute >>> Lawrence Berkeley National Lab >>> >>> >>> >>> >>> On Wed, Feb 8, 2012 at 3:26 PM, Elena Sorokin <sorokin at="" wisc.edu=""> wrote: >>>> Greetings all, >>>> >>>> After re-reading related posts in the listserv archive, I still didn't know >>>> the exact answer to my question, so here goes. I'd like to use DESeq to >>>> measure differential isoform expression. Has Simon or anybody else written a >>>> script that will convert aligned reads (.bam/.sam file) into a table of >>>> isoform counts, suitable for input to DESEq - similar to what Simon has done >>>> at the gene-wise level, but instead for making a table of counts by isoform? >>>> >>>> I would try to do this myself, but I'm a novice at programming. Sorry if >>>> this has been answered elsewhere... If so, please let me know the link. >>>> >>>> Thanks, >>>> Elena >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 8 Date: Thu, 9 Feb 2012 18:02:40 +0100 From: axel.klenk@actelion.com To: "Vladimir Krasikov" <v.v.krasikov at="" gmail.com=""> Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org Subject: Re: [BioC] Limma: questions about data pre-processing Message-ID: <11537_1328807013_4F33FC65_11537_9013_1_OF849F3859.3DF5C743-ON C125799F.005B5234-C125799F.005DB562 at actelion.com> Content-Type: text/plain; charset="US-ASCII" Dear Vladimir, sorry for the late reply... I'll give it a try and hope some true expert will correct me if it is nonsense... :-) Q2: rather new in 2011 would mean *probably* 4x44Kv2... the type should be available in the file header if you still have one (?) or otherwise one could guess it from the set of identifiers in column "ProbeName" if you still have one... can you make one file available via web or ftp for a quick look? Visualization should still be feasible, with missing spots missing, of course, and in this case it's a pity the positive controls are missing... IIRC, Agilent FES does produce these plots for their QC -- but I suppose your company did not include them? Q5: ok, so same or similar common reference we are using... and to be useful it should give a reasonable signal for (almost) all probes on the array which is the whole genome -- but only a proportion of these will be expressed in any real biological sample which is why I think that a) MA plots will look pretty unusual for these arrays and b) LOESS normalization will seemingly fix that but actually distort your data. As for the choice of normalization method, since all normalization steps bear the risk of "normalizing" away the biological signal you're interested in, you should do only as much as necessary, using the least stringent method that will produce proper diagnostic plots. For comparison between arrays, density and box plots would be appropriate. I realize this is probably too general to be useful :-) maybe the literature referenced in limma's ?normalizeWithinArrays and ?normalizeBetweenArrays can be of any help? Cheers, - axel Axel Klenk Research Informatician Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / Switzerland From: "Vladimir Krasikov" <v.v.krasikov at="" gmail.com=""> To: axel.klenk at actelion.com Cc: bioconductor at r-project.org, bioconductor-bounces at r-project.org Date: 09.02.2012 13:47 Subject: Re: [BioC] Limma: questions about data pre-processing Dear Axel Once again thanks... Q2: The only thing I know now is that it was rather new Agilent edition of March 2011, however our company stripped away all information in files ( even removed all control spots). Do you think there is still a way to make visualizations? Q5: After reading Rquantile description I now see some rationale about this normalization, when all Red chanels contoined common reference (which is commercial "universal human reference"). However, question remains, what kind of plots, metrics are useful to judge the results of normalizations? On Tue, 07 Feb 2012 15:32:03 +0100, <axel.klenk at="" actelion.com=""> wrote: > Dear Vladimir, > > I'll only answer or comment on some of your questions and leave > the others for the true experts... > > Q2: yes, for example using package arrayQualityMetrics, if you know > the array layout. FES output usually contains columns Col and Row for > spot coordinates but apparently your "service provider" has removed > them. I could send you a coordinates <--> oligo mapping by email if you > can tell me your array type -- is it 1x44K, 4x44K or 4x44Kv2? > Alternatively, > you can try to find that information on Agilent's eArray web site: > earray.chem.agilent.com > > Q5: for a common reference design, dye swaps are not required and > I would not apply a loess normalization -- depending on what you have > hybridized as the common reference, the assumptions may not hold. > As for the between-array normalization, Rquantile may also be an > option for your design and boxplots and density plots may be used > for judging the results. > > Cheers, > > - axel > > > Axel Klenk > Research Informatician > Actelion Pharmaceuticals Ltd / Gewerbestrasse 16 / CH-4123 Allschwil / > Switzerland > > > > > From: > "Vladimir Krasikov" <v.v.krasikov at="" gmail.com=""> > To: > bioconductor at r-project.org > Date: > 07.02.2012 14:27 > Subject: > [BioC] Limma: questions about data pre-processing > Sent by: > bioconductor-bounces at r-project.org > > > > Dear limma experts > > During creating the pipe-line for dissecting differential gene expression > in frame of limma, > several questions have arose. > > Experiment: > I have 62 two-color Agilent human arrays. > The samples are from several human more or less related to each other > disorders and vary in age, sex, disease duration and diagnosis. > Company that made hybridizations performed all hybs in one direction (no > dye-swaps), > where all samples were in G channel and common Ref in R channel, > and unfortunately provided us only excepts of Feature Extraction > which contained info on G, Gb, R, Rb, and FNO (non-uniformity outliers) > and separate gene annotation table. > > I performed generic import of the data and assigned zero-weight to the > FNO > spots: > I analyzed density and MA-plots, box-plots of M-values, G and R channels > and box-plots of background intensities, > and removed from experiment 1 array with aberrant raw G-channel density. > (I will discuss experiment description later, when come to the linear > model) > > Q1: Is there a rationale of down-weighting FNO (around 100-200 spots per > array) for background correction and further normalization? > Q2: Is there way to make image representation of Agilent microarray (for > each channel and backgrounds)? > In another words is there known 'layout' for human 44K Agilent? > > Next I corrected the background with: >> RG.b <- backgroundCorrect(RG.raw, method="minimum", offset=50) > (recommended method=normexp produced shifted curves for five arrays after > taking a look on density plots, > and box-plots for separate G and R channels also look less uniform as > compared with 'minimum' method) > > Q3: I guess it is also possible to remove those 5 arrays from the > experiment. Is it fair? > Q4: What kind of reasoning should be used for the choice between > background subtraction methods? > > Then performed standard loess within array normalization: >> MA.loess <- normalizeWithinArrays(RG.b, method="loess",bc.method="none") > > Q5: Do I need to perform between array normalization? > How to judge which of the methods (non, scale, quantile, Aquantile) > is > best for my experiment? > > For now I decide to stuck with background=minimum, within=loess, and > between=is under the question > > Next I would like to ask questions about > linear model of my experiment, but I will make it in a next help request > > Thanks a lot in advance > > and finally >> sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 > [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C > [5] LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] limma_3.10.2 >> > > With kind regards > Vladimir > -- > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > The information of this email and in any file transmitted with it is > strictly confidential and may be legally privileged. > It is intended solely for the addressee. If you are not the intended > recipient, any copying, distribution or any other use of this email is > prohibited and may be unlawful. In such case, you should please notify > the sender immediately and destroy this email. > The content of this email is not legally binding unless confirmed by > letter. > Any views expressed in this message are those of the individual sender, > except where the message states otherwise and the sender is authorised > to state them to be the views of the sender's company. For further > information about Actelion please see our website at > http://www.actelion.com > -- Using Opera's revolutionary email client: http://www.opera.com/mail/ The information of this email and in any file transmitted with it is strictly confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any copying, distribution or any other use of this email is prohibited and may be unlawful. In such case, you should please notify the sender immediately and destroy this email. The content of this email is not legally binding unless confirmed by letter. Any views expressed in this message are those of the individual sender, except where the message states otherwise and the sender is authorised to state them to be the views of the sender's company. For further information about Actelion please see our website at http://www.actelion.com ------------------------------ Message: 9 Date: Thu, 9 Feb 2012 12:06:30 -0500 From: "Akula, Nirmala (NIH/NIMH) [C]" <akulan@mail.nih.gov> To: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: [BioC] suggestions/comments on DESeq transcript wise analysis Message-ID: <d56d6a723c413c4db8e608ed137ec6a20d46536cfc at="" nihmlbx12.nih.gov=""> Content-Type: text/plain Based on the previous threads when using DESeq the reads should not be double counted. I am following pipeline for RNA-seq analysis and would like to know any suggestions/comments regarding the pipeline: 1. Mapping the reads using Tophat 2. Convert Tophat output.bam to Sam 3. Create bed file from Sam file 4. Use CoverageBed along with reference genome for counting the reads 5. Sum count of reads from all exons in a transcript 6. DESeq to analyze the counts/transcript Thank you very much Nirmala [[alternative HTML version deleted]] ------------------------------ Message: 10 Date: Thu, 9 Feb 2012 09:50:37 -0800 From: "Tim Triche, Jr." <tim.triche@gmail.com> To: Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF? Message-ID: <cac+n9bxh2d_ywvcr=ofgo=kwa6ixpfeixxacbjesc77az1jzqg at="" mail.gmail.com=""> Content-Type: text/plain Along these lines, I took Kasper Hansen's df2GR() function and tarted it up a bit, wrote a GR2df() function, and added some hooks to store the entire works in a trivial (4-table plus views) database schema. This stores descriptions of what each track/range means, where it came from, and when it was added, not unlike biomaRt. (I had trouble finding "pickling" functions for tracks and ranges so I rolled my own) It's not perfect (because I don't yet understand how to "bolt on" the appropriate BSgenome so that out-of-memory sequence access is performed the way I want it to) but it works well enough that I have started migrating other types of data (segmented copynumber, gene-, splice-, exon-wise RNAseq, and BS-seq) into freeze-dried GRanges. If there is a better container schema in GenomicFeatures::TranscriptDb, I'll switch to it today... And since it's fresh in my mind, how does one automatically attach the appropriate BSgenome to a GenomicRanges? It seems like it should be trivial, and the documentation hints at it, but I haven't succeeded yet in doing this automatically. I store the assembly from which a given GRanges object's features were aligned, so if I can figure out the syntax, it's done. thanks all, --t On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi, > > I thought it would be handy to make a GenomicFeatures::TranscriptDb > from a gtf file and was somehow surprised to see that I couldn't find > such a function in GenomicFeatures. > > I'm happy to whip up such a function, but wanted to check in to make > sure I'm not missing something like (1) you can already do it; or (2) > it's not as straightforward as I think; (3) maybe it's there and I'm > not looking hard enough. > > Right now I just want to build it on the gff that the knew versions of > tophat build when you pass in a gtf/gff file of known transcripts (the > --G/--GTF cmd line arg), but I think it'd be handy overall. > > I don't think gff/gtf's have any column for exon ordering, though, in > which case I'd just assume the exons are ordered linearly (bye bye > trans-splicing)). > > Good idea? Bad idea? Already done? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 11 Date: Thu, 9 Feb 2012 10:01:08 -0800 From: Abhishek Pratap <apratap@lbl.gov> To: "Akula, Nirmala (NIH/NIMH) [C]" <akulan at="" mail.nih.gov=""> Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] suggestions/comments on DESeq transcript wise analysis Message-ID: <ca+7hxbx4lgvrbkktcrgvvx4jr=xmsv7ovkoy1=hcuoby66gh9g at="" mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Unless you are counting reads on the unique disjoin set of exons from the transcripts , double counting will be inevitable as most of the transcripts for a gene will have overlapping exons. You can also see Simon's post today on a different thread about the same issue. He explains in more detail why one would want to stay with gene level or exon level differential expression and relate it back to gene/isoforms. -Abhi On Thu, Feb 9, 2012 at 9:06 AM, Akula, Nirmala (NIH/NIMH) [C] <akulan at="" mail.nih.gov=""> wrote: > Based on the previous threads when using DESeq the reads should not be double counted. I am following pipeline for RNA-seq analysis and would like to know any suggestions/comments regarding the pipeline: > > > ?1. ?Mapping the reads using Tophat > ?2. ?Convert Tophat output.bam to Sam > ?3. ?Create bed file from Sam file > ?4. ?Use CoverageBed along with reference genome for counting the reads > ?5. ?Sum count of reads from all exons in a transcript > ?6. ?DESeq to analyze the counts/transcript > > Thank you very much > Nirmala > > > > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 12 Date: Thu, 9 Feb 2012 13:17:29 -0500 From: Steve Lianoglou <mailinglist.honeypot@gmail.com> To: ttriche at usc.edu Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF? Message-ID: <caha9mcmhsph4cs6neutym2yo4dlqywg551wtk7+-nefhmb2sxg at="" mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hi Tim, A quick answer to one of your questions, specifically: > And since it's fresh in my mind, how does one automatically attach the > appropriate BSgenome to a GenomicRanges? ?It seems like it should be > trivial, and the documentation hints at it, but I haven't succeeded yet in > doing this automatically. I have a generic method defined somewhere in one of my utility called `getBsGenome` which loads the appropriate genome of choice. It's not bullet proof, but is handy. The genome is identified by its annotation/accession/build/release, whatever you call it, where I mostly go by UCSC conveintions, ie. 'hg19', 'mm9', etc. I use it like this bsg <- getBsGenome('hg19') And it will load the BSgenome.Hsapiens.UCSC.hg19 library and return the `Hsapiens` object (that is also, now, in your workspace). I define methods for different classes that I have that basically end up calling down to the base function I pasted below. Assuming you are on bioc2.9, a potential GenomicRanges impl of the method would look like so (untested): setMethod("getBsGenome", c(x="GenomicRanges"), function(x, ...) { g <- unique(genome(x)) if (!is.character(g) || length(g) != 1L || nchar(g) < 1) { stop("Expected a single, valid genome identifier in seqinfo slot") } getBsGenome(g, ...) }) You would have to add more cases in the switch statement of the base function when you want to expand your reportoire of organisms: setMethod("getBsGenome", c(x="character"), function(x, organism=NULL, anno.source='UCSC', ...) { lib.name <- 'BSgenome.:organism:.:anno.source:.:genome:' if (is.null(organism)) { organism <- switch(substring(x, 1, 2), hg='Hsapiens', mm='Mmusculus', sa='Scerevisiae', dm='Dmelanogaster', rn='Rnorvegicus', ce='Celegans', stop("Unknown genome", x, sep=" ")) } lib.name <- gsub(':organism:', organism, lib.name) lib.name <- gsub(':anno.source:', anno.source, lib.name) lib.name <- gsub(':genome:', x, lib.name) suppressPackageStartupMessages({ found <- requirelib.name, character.only=TRUE) }) if (!found) { stoplib.name, " package required.") } get(organism, pos=paste('package', lib.name, sep=":")) }) HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ------------------------------ Message: 13 Date: Thu, 9 Feb 2012 13:37:29 -0500 From: Simon No?l <simon.noel.2@ulaval.ca> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: [BioC] Optimisation Message-ID: <1C970CA8E580A84394CE5CF42FEC0046E2CBDCC106 at EXCH- MBX-E.ulaval.ca> Content-Type: text/plain; charset="iso-8859-1" Hi every one, I have a big formula with some parameter that I want to minimise with some condition like sum of all of them must be one, in some condition, one of them must equal 0, etc. I am trying to solve that with constrOptim but I got some broblem... Like how to have a parametter equal to 0 Is there any package who is more user friendly that you can suggest to me? Simon No??l CdeC ------------------------------ Message: 14 Date: Thu, 9 Feb 2012 10:55:11 -0800 From: "Tim Triche, Jr." <tim.triche@gmail.com> To: Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF? Message-ID: <cac+n9bv69pjk_9b65nj9+cfpuhr_uk2bj8xuvzwewz37i8uyvq at="" mail.gmail.com=""> Content-Type: text/plain I'm running bioc-devel (2.10) and R-devel r58079 from SVN, if that makes any difference... Thanks much, will play with these and see what happens :-) On Thu, Feb 9, 2012 at 10:17 AM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi Tim, > > A quick answer to one of your questions, specifically: > > > And since it's fresh in my mind, how does one automatically attach the > > appropriate BSgenome to a GenomicRanges? It seems like it should be > > trivial, and the documentation hints at it, but I haven't succeeded yet > in > > doing this automatically. > > I have a generic method defined somewhere in one of my utility called > `getBsGenome` which loads the appropriate genome of choice. It's not > bullet proof, but is handy. > > The genome is identified by its annotation/accession/build/release, > whatever you call it, where I mostly go by UCSC conveintions, ie. > 'hg19', 'mm9', etc. > > I use it like this > > bsg <- getBsGenome('hg19') > > And it will load the BSgenome.Hsapiens.UCSC.hg19 library and return > the `Hsapiens` object (that is also, now, in your workspace). > > I define methods for different classes that I have that basically end > up calling down to the base function I pasted below. > > Assuming you are on bioc2.9, a potential GenomicRanges impl of the > method would look like so (untested): > > setMethod("getBsGenome", c(x="GenomicRanges"), function(x, ...) { > g <- unique(genome(x)) > if (!is.character(g) || length(g) != 1L || nchar(g) < 1) { > stop("Expected a single, valid genome identifier in seqinfo slot") > } > getBsGenome(g, ...) > }) > > You would have to add more cases in the switch statement of the base > function when you want to expand your reportoire of organisms: > > setMethod("getBsGenome", c(x="character"), > function(x, organism=NULL, anno.source='UCSC', ...) { > lib.name <- 'BSgenome.:organism:.:anno.source:.:genome:' > if (is.null(organism)) { > organism <- switch(substring(x, 1, 2), > hg='Hsapiens', > mm='Mmusculus', > sa='Scerevisiae', > dm='Dmelanogaster', > rn='Rnorvegicus', > ce='Celegans', > stop("Unknown genome", x, sep=" ")) > } > > lib.name <- gsub(':organism:', organism, lib.name) > lib.name <- gsub(':anno.source:', anno.source, lib.name) > lib.name <- gsub(':genome:', x, lib.name) > > suppressPackageStartupMessages({ > found <- requirelib.name, character.only=TRUE) > }) > > if (!found) { > stoplib.name, " package required.") > } > > get(organism, pos=paste('package', lib.name, sep=":")) > }) > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 15 Date: Thu, 9 Feb 2012 13:59:30 -0500 From: Steve Lianoglou <mailinglist.honeypot@gmail.com> To: ttriche at usc.edu Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF? Message-ID: <caha9mcot9d+n1rwzhmt_pslswagsioka98hhj-lntabyq5nh5w at="" mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 On Thu, Feb 9, 2012 at 1:55 PM, Tim Triche, Jr. <tim.triche at="" gmail.com=""> wrote: > I'm running bioc-devel (2.10) and R-devel r58079 from SVN, if that makes any > difference... I should have said "running at least bioc2.9" .. I think it's only in 2.9 that GenomicRanges objects got the seqinfo slots and, therefore, the `genome` function ... my release-history-trivia is a bit weak, though :-) FWIW, I'm typically on *-devel too ... (^^ not worth much ...) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ------------------------------ Message: 16 Date: Thu, 9 Feb 2012 19:35:57 +0000 From: Salwa Eid <salwaeid@hotmail.com> To: <bioconductor at="" r-project.org=""> Subject: [BioC] how can i convert a gse object(expression sets) to affybatch Message-ID: <snt115-w5658aa61757753c7db59cacf7b0 at="" phx.gbl=""> Content-Type: text/plain Dear all, I have read cel files from ncbi website using the getGEO from GEOquery package. I want then to use the combineaffybatch from the matchprobe package but the objects I have to pass is affybatch. I need to convert the gse object or the expression sets to affybatch to be able to use it. Any ideas? Regards,salwa [[alternative HTML version deleted]] ------------------------------ Message: 17 Date: Thu, 09 Feb 2012 20:37:24 +0100 From: Brian <zenlines@gmail.com> To: bioconductor at r-project.org Subject: [BioC] Import sequences from MacClade 4.* Message-ID: <4F342074.2040304 at gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hello List, sorry if this is a stupid question, but I am returning some old sequences that I have laying around, the program is a mac program called MacClade, but the sequence file looks like: #NEXUS [MacClade 4.03 registered to University] BEGIN DATA; DIMENSIONS NTAX=53 NCHAR=673; FORMAT DATATYPE=DNA MISSING=? GAP=- INTERLEAVE ; MATRIX [ 10 20 30 40 50 60] [ . . . . . .] [Modal TGAACCTGCGGAAGGAAAATATTATTGAATATATTTTTTA] AI1A TGAACCTGCGGAACGAAAATATTATTGAATATATTTTTTA [60] ... Does this look familiar to anyone? Did I overlook some function in the "seqinr" package? Before I write some function to get the sequences out for me. Thanks for the help! Cheers, Brian ------------------------------ Message: 18 Date: Thu, 09 Feb 2012 13:40:49 -0600 From: Elena Sorokin <sorokin@wisc.edu> To: bioconductor at r-project.org Subject: Re: [BioC] DESeq and transcript-wise analysis Message-ID: <4F342141.1040209 at wisc.edu> Content-Type: text/plain; CHARSET=US-ASCII; format=flowed Dear Simon, Abhi, and Martin, Thanks for your replies. I just learned about DEXseq and am working through the vignette. Importantly for me, I see that DEXseq can also do generalized linear models. I will compare that to my current isoform analysis in DESeq with its GLM function, then compare both those to previous results from Cuffdiff, where unfortunately I've only been able to do 2-sample comparisons. Simon, I'm working with C. elegans, not with a vertebrate, and am usually dealing with 1-2 isoforms per gene, though sometimes it can be much more. I don't know if that still presents a huge problem for DESeq, but I figure it's worth trying anyway. Thanks again for everyone's advice. It is much appreciated! Best, Elena ------------------------------ Message: 19 Date: Thu, 9 Feb 2012 15:28:39 -0500 From: Simon No?l <simon.noel.2@ulaval.ca> To: Hervé Pagès <hpages at="" fhcrc.org=""> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: [BioC] RE : maping SNPs Message-ID: <1C970CA8E580A84394CE5CF42FEC0046E2CBDCC109 at EXCH- MBX-E.ulaval.ca> Content-Type: text/plain; charset="iso-8859-1" Hello Mr. Pag?s, At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow. As you remember, with your help we changed it a little bit and we have got with R2.10 : library("IRanges") library("GenomicRanges") library("GenomicFeatures") #? changer si une version plus r?cente de la librairie est t?l?charg?e. library(SNPlocs.Hsapiens.dbSNP.20101109) library("org.Hs.eg.db") #Allocation de la m?moire sous windows memory.limit(size = 4095) #v?rification de la librairie SNPlocs.Hsapiens.dbSNP getSNPcount() ch22snps <- getSNPlocs("ch22") ch22snps[1:5, ] #Create a GRange objetc to use with GenomicRanges library makeGRangesFromRefSNPids <- function(myids, verbose=FALSE) { ans_seqnames <- character(length(myids)) ans_seqnames[] <- "unknown" ans_locs <- integer(length(myids)) for (seqname in names(getSNPcount())) { if (verbose) cat("Processing ", seqname, " SNPs ...\n", sep="") locs <- getSNPlocs(seqname) ids <- paste("rs", locs$RefSNP_id, sep="") myrows <- match(myids, ids) hit_idx <- !is.na(myrows) ans_seqnames[hit_idx] <- seqname ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]] } GRanges(seqnames=ans_seqnames, IRanges(start=ans_locs, width=1), RefSNP_id=myids) } #Test en utilisant les premi?res sondes du premier et second chormosome #myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") #ouverture du fichier pour aller chercher nos num?ros rs rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE) myids <- rs_SNPs[,1] mysnps <- makeGRangesFromRefSNPids(myids) mysnps # a GRanges object with 1 SNP per row #create a TranscriptDb txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") txdb #extract the transcript locations together with their genes tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) tx # a GRanges object with 1 transcript per row #rename chromosome to fit USCS standard seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) #v?rifier pour X/Y -> seqnames(mysnps) <- sub("chrX", "chrY", seqnames(mysnps)) #mapping but not on a readable format map <- as.matrix(findOverlaps(mysnps, tx)) #making the mapping readable mapped_genes <- values(tx)$gene_id[map[, 2]] mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) snp2gene <- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) rownames(snp2gene) <- NULL snp2gene #snp2gene se travaille mal alors on le transf?re en matrice snp2geneTmp = t(t(snp2gene)) #aller chercher les symboles correspondants ? nos gene id symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) save.image(file = "map.RData") And everything was working perfectly. Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping. I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error... Can you help me? The problems seems to start with "map <- as.matrix(findOverlaps(mysnps, tx))" and the other error seems to result from that problem. sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] org.Hs.eg.db_2.6.4 [2] RSQLite_0.11.1 [3] DBI_0.2-5 [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 [5] GenomicFeatures_1.6.7 [6] AnnotationDbi_1.16.11 [7] Biobase_2.14.0 [8] GenomicRanges_1.6.7 [9] IRanges_1.12.6 loaded via a namespace (and not attached): [1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0 RCurl_1.9-5 [5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4 zlibbioc_1.0.0 > library("IRanges") Attaching package: ?IRanges? The following object(s) are masked from ?package:base?: cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int, pmin, pmin.int, rbind, rep.int, setdiff, table, union > library("GenomicRanges") > library("GenomicFeatures") Loading required package: AnnotationDbi Loading required package: Biobase Welcome to Bioconductor Vignettes contain introductory material. To view, type 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation("pkgname")'. Attaching package: ?Biobase? The following object(s) are masked from ?package:IRanges?: updateObject > #? changer si une version plus r?cente de la librairie est t?l?charg?e. > library(SNPlocs.Hsapiens.dbSNP.20110815) > library("org.Hs.eg.db") Loading required package: DBI > > > #Allocation de la m?moire sous windows > memory.limit(size = 4095) [1] Inf Warning message: 'memory.limit()' is Windows-specific > > #v?rification de la librairie SNPlocs.Hsapiens.dbSNP > getSNPcount() ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 1542256 1521919 1104719 1031214 949642 1084538 917737 886293 732039 788556 ch21 ch22 chX chY chMT 468379 454939 920890 75108 942 > ch22snps <- getSNPlocs("ch22") > ch22snps[1:5, ] RefSNP_id alleles_as_ambig loc 1 56342815 K 16050353 2 149201999 Y 16050408 3 146752890 S 16050612 4 139377059 Y 16050678 5 143300205 R 16050822 > > #########################? FAIRE CANOPUS################### > > #Create a GRange objetc to use with GenomicRanges library > makeGRangesFromRefSNPids <- function(myids, verbose=FALSE) + { + ans_seqnames <- character(length(myids)) + ans_seqnames[] <- "unknown" + ans_locs <- integer(length(myids)) + for (seqname in names(getSNPcount())) { + if (verbose) + cat("Processing ", seqname, " SNPs ...\n", sep="") + locs <- getSNPlocs(seqname) + ids <- paste("rs", locs$RefSNP_id, sep="") + myrows <- match(myids, ids) + hit_idx <- !is.na(myrows) + ans_seqnames[hit_idx] <- seqname + ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]] + } + GRanges(seqnames=ans_seqnames, + IRanges(start=ans_locs, width=1), + RefSNP_id=myids) + } > > > #Test en utilisant les premi?res sondes du premier et second chormosome > #myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") > > #ouverture du fichier pour aller chercher nos num?ros rs > rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE) > myids <- rs_SNPs[,1] > > mysnps <- makeGRangesFromRefSNPids(myids) > mysnps # a GRanges object with 1 SNP per row GRanges with 348411 ranges and 1 elementMetadata value: seqnames ranges strand | RefSNP_id <rle> <iranges> <rle> | <factor> [1] ch1 [2195117, 2195117] * | rs7547453 [2] ch1 [2291680, 2291680] * | rs2840542 [3] ch1 [3256108, 3256108] * | rs1999527 [4] ch1 [3577321, 3577321] * | rs4648545 [5] ch1 [4230463, 4230463] * | rs10915459 [6] ch1 [4404344, 4404344] * | rs16838750 [7] ch1 [4501911, 4501911] * | rs12128230 [8] ch1 [4535148, 4535148] * | rs7541288 [9] ch1 [4581230, 4581230] * | rs12039682 ... ... ... ... ... ... [348403] chX [154514047, 154514047] * | rs499428 [348404] chX [154514919, 154514919] * | rs507127 [348405] chX [154737376, 154737376] * | rs5940372 [348406] chX [154780283, 154780283] * | rs6642287 [348407] chX [154830377, 154830377] * | rs5983658 [348408] chX [154870197, 154870197] * | rs473772 [348409] chX [154892230, 154892230] * | rs553678 [348410] chX [154899846, 154899846] * | rs473491 [348411] chX [154929412, 154929412] * | rs557132 --- seqlengths: ch1 ch10 ch11 ch12 ch13 ... ch8 ch9 chX unknown NA NA NA NA NA ... NA NA NA NA > > #create a TranscriptDb > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") Download the refGene table ... OK Download the refLink table ... OK Extract the 'transcripts' data frame ... OK Extract the 'splicings' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Prepare the 'metadata' data frame ... OK Make the TranscriptDb object ... OK There were 50 or more warnings (use warnings() to see the first 50) > txdb TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: hg19 | Genus and Species: Homo sapiens | UCSC Table: refGene | Resource URL: http://genome.ucsc.edu/ | Type of Gene ID: Entrez Gene ID | Full dataset: yes | transcript_nrow: 41677 | exon_nrow: 235596 | cds_nrow: 206518 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012) | GenomicFeatures version at creation time: 1.6.7 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 | package: GenomicFeatures > > #extract the transcript locations together with their genes > tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > tx # a GRanges object with 1 transcript per row GRanges with 41677 ranges and 3 elementMetadata values: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr1 [ 11874, 14408] + | 1127 NR_046018 [2] chr1 [ 69091, 70008] + | 1128 NM_001005484 [3] chr1 [323892, 328581] + | 1130 NR_028327 [4] chr1 [323892, 328581] + | 1132 NR_028325 [5] chr1 [323892, 328581] + | 1133 NR_028322 [6] chr1 [367659, 368597] + | 1131 NM_001005221 [7] chr1 [367659, 368597] + | 1134 NM_001005224 [8] chr1 [367659, 368597] + | 1135 NM_001005277 [9] chr1 [763064, 789740] + | 198 NR_015368 ... ... ... ... ... ... ... [41669] chrY [27177050, 27198251] - | 20790 NM_004678 [41670] chrY [27177050, 27198251] - | 20793 NM_001002761 [41671] chrY [27177050, 27198251] - | 20794 NM_001002760 [41672] chrY [27209230, 27246039] - | 20791 NR_002177 [41673] chrY [27209230, 27246039] - | 20792 NR_002178 [41674] chrY [27209230, 27246039] - | 20795 NR_001525 [41675] chrY [27329790, 27330920] - | 20796 NR_002179 [41676] chrY [27329790, 27330920] - | 20797 NR_002180 [41677] chrY [27329790, 27330920] - | 20798 NR_001526 gene_id <compressedcharacterlist> [1] 100287102 [2] 79501 [3] 100133331 [4] 100132062 [5] 100132287 [6] 729759 [7] 26683 [8] 81399 [9] 643837 ... ... [41669] 9083 [41670] 442868 [41671] 442867 [41672] 474150 [41673] 474149 [41674] 114761 [41675] 474152 [41676] 474151 [41677] 252949 --- seqlengths: chr1 chr2 ... chr18_gl000207_random 249250621 243199373 ... 4262 > > #rename chromosome to fit USCS standard > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) : levels of supplied 'seqnames' must be identical to 'seqlevels(x)' > #v?rifier pour X/Y -> seqnames(mysnps) <- sub("chrX", "chrY", seqnames(mysnps)) > > #mapping but not on a readable format > map <- as.matrix(findOverlaps(mysnps, tx)) Warning message: In .Seqinfo.mergexy(x, y) : Each of the 2 combined objects has sequence levels not in the other: - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated] > > > #making the mapping readable > mapped_genes <- values(tx)$gene_id[map[, 2]] > mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) > snp2gene <- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene) <- NULL > snp2gene [1] SNPNAME gene_id <0 rows> (or 0-length row.names) > > > #snp2gene se travaille mal alors on le transf?re en matrice > snp2geneTmp = t(t(snp2gene)) > > #aller chercher les symboles correspondants ? nos gene id > symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) : error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) : keys must be supplied in a character vector with no NAs > > > save.image(file = "map.RData") > Simon No?l CdeC ________________________________________ De : Hervé Pagès [hpages at fhcrc.org] Date d'envoi : 5 d?cembre 2010 23:43 ? : Simon No?l Cc : bioconductor at r-project.org Objet : Re: [BioC] maping SNPs Hi Simon, On 12/03/2010 10:17 AM, Simon No?l wrote: > > Hi, > > > > I have a really big list of SNPs names like : > > > > SNPNAME > > rs7547453 > > rs2840542 > > rs1999527 > > rs4648545 > > rs10915459 > > rs16838750 > > rs12128230 > > ... > > > > I woudlike to map them to their official gene symbol. What the best way to > procede? Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings from SNPs to genes and I don't think we have this kind of mappings either in our collection of annotations (*.db packages). But if your SNPs are Human then you can do the mapping yourself by using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures packages. The latest SNPlocs.Hsapies.dbSNP.* package is SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains SNP locations relative to the GRCh37 genome: > library(SNPlocs.Hsapiens.dbSNP.20101109) > getSNPcount() ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075 1158707 ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 1147722 1105364 815729 740129 657719 757926 641905 645646 520666 586708 ch21 ch22 chX chY chMT 338254 331060 529608 67438 624 Note that it doesn't contain *all* SNPs from dbSNP Build 132: only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109 for the details). > ch22snps <- getSNPlocs("ch22") > ch22snps[1:5, ] RefSNP_id alleles_as_ambig loc 1 56342815 K 16050353 2 7288968 S 16050994 3 6518357 M 16051107 4 7292503 R 16051209 5 6518368 Y 16051241 Note that the rs prefix has been dropped. So here is how to proceed: First you can use the following function to make a GRanges object from your SNP ids: makeGRangesFromRefSNPids <- function(myids) { ans_seqnames <- character(length(myids)) ans_seqnames[] <- "unknown" ans_locs <- integer(length(myids)) for (seqname in names(getSNPcount())) { locs <- getSNPlocs(seqname) ids <- paste("rs", locs$RefSNP_id, sep="") myrows <- match(myids, ids) ans_seqnames[!is.na(myrows)] <- seqname ans_locs[!is.na(myrows)] <- locs$loc[myrows] } GRanges(seqnames=ans_seqnames, IRanges(start=ans_locs, width=1), RefSNP_id=myids) } This takes between 3 and 5 minutes: > myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs10915459", "rs16838750", "rs12128230", "rs999999999") > mysnps <- makeGRangesFromRefSNPids(myids) > mysnps # a GRanges object with 1 SNP per row GRanges with 8 ranges and 1 elementMetadata value seqnames ranges strand | myids <rle> <iranges> <rle> | <character> [1] ch1 [2195117, 2195117] * | rs7547453 [2] ch1 [2291680, 2291680] * | rs2840542 [3] ch1 [3256108, 3256108] * | rs1999527 [4] ch1 [3577321, 3577321] * | rs4648545 [5] ch1 [4230463, 4230463] * | rs10915459 [6] ch1 [4404344, 4404344] * | rs16838750 [7] ch1 [4501911, 4501911] * | rs12128230 [8] unknown [ 0, 0] * | rs999999999 seqlengths ch1 unknown NA NA The next step is to create a TranscriptDb object with makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart() from the GenomicFeatures package. This TranscriptDb object will contain the transcript locations and their associated genes extracted from the annotation source you choose. For example, if you want to use RefSeq genes: ## Takes about 3 minutes: > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > txdb TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: hg19 | UCSC Table: refGene | Type of Gene ID: Entrez Gene ID | Full dataset: yes | transcript_nrow: 37924 | exon_nrow: 230024 | cds_nrow: 204571 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010) | GenomicFeatures version at creation time: 1.2.2 | RSQLite version at creation time: 0.9-4 | DBSCHEMAVERSION: 1.0 Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb object: this means that later you will be able to use the org.Hs.eg.db package to map your gene ids to their symbol (the org.*.eg.db packages are Entrez Gene ID centric). To extract the transcript locations together with their genes: > tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > tx # a GRanges object with 1 transcript per row GRanges with 37924 ranges and 1 elementMetadata value seqnames ranges strand | gene_id <rle> <iranges> <rle> | <compressedcharacterlist> [1] chr1 [ 69091, 70008] + | 79501 [2] chr1 [323892, 328581] + | 100133331 [3] chr1 [323892, 328581] + | 100132287 [4] chr1 [323892, 328581] + | 100132062 [5] chr1 [367659, 368597] + | 81399 [6] chr1 [367659, 368597] + | 729759 [7] chr1 [367659, 368597] + | 26683 [8] chr1 [763064, 789740] + | 643837 [9] chr1 [861121, 879961] + | 148398 ... ... ... ... ... ... [37916] chrY [27177050, 27198251] - | 9083 [37917] chrY [27177050, 27198251] - | 442867 [37918] chrY [27177050, 27198251] - | 442868 [37919] chrY [27209230, 27246039] - | 114761 [37920] chrY [27209230, 27246039] - | 474150 [37921] chrY [27209230, 27246039] - | 474149 [37922] chrY [27329790, 27330920] - | 252949 [37923] chrY [27329790, 27330920] - | 474152 [37924] chrY [27329790, 27330920] - | 474151 seqlengths chr1 chr2 ... chr18_gl000207_random 249250621 243199373 ... 4262 Now you can use findOverlaps() on 'mysnps' and 'tx' to find the transcripts hits by your snps. But before you can do this, you need to rename the sequences in 'mysnps' because dbSNPs and UCSC use different naming conventions for the chromosomes: > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) Then: > map <- as.matrix(findOverlaps(mysnps, tx)) 'map' contains the mapping between your SNPs and their genes but not in a readable form (this matrix contains indices) so we make the 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for the associated gene ids: > mapped_genes <- values(tx)$gene_id[map[, 2]] > mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]], elementLengths(mapped_genes)) > snp2gene <- unique(data.frame(snp_id=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene) <- NULL > snp2gene[1:4, ] snp_id gene_id 1 rs7547453 6497 2 rs2840542 79906 3 rs1999527 63976 4 rs4648545 7161 Note that there is no guarantee that the number of rows in this data frame is the number of your original SNP ids because the relation between SNP ids and gene ids is of course not one-to-one. Also the method described above considers that a SNP hits a gene if it's located between the start and the end of one of its transcripts but it doesn't take in account the exon structure of the transcripts. If you want to do this you need to use exonsBy() (from GenomicFeatures) to extract the exons grouped by transcripts (this will be stored in a GRangesList object) and use this object instead of 'tx' in the call to findOverlaps(). Hope this helps, H. > > > > Simon No??l > CdeC > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 ------------------------------ Message: 20 Date: Thu, 09 Feb 2012 13:34:27 -0800 From: Hervé Pagès <hpages@fhcrc.org> To: Simon No?l <simon.noel.2 at="" ulaval.ca=""> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] RE : maping SNPs Message-ID: <4F343BE3.1040102 at fhcrc.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Hi Simon, On 02/09/2012 12:28 PM, Simon No?l wrote: > Hello Mr. Pag?s, > > At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow. As you remember, with your help we changed it a little bit and we have got with R2.10 : > > library("IRanges") > library("GenomicRanges") > library("GenomicFeatures") > #? changer si une version plus r?cente de la librairie est t?l?charg?e. > library(SNPlocs.Hsapiens.dbSNP.20101109) > library("org.Hs.eg.db") > > #Allocation de la m?moire sous windows > memory.limit(size = 4095) > #v?rification de la librairie SNPlocs.Hsapiens.dbSNP > getSNPcount() > ch22snps<- getSNPlocs("ch22") > ch22snps[1:5, ] > > #Create a GRange objetc to use with GenomicRanges library > makeGRangesFromRefSNPids<- function(myids, verbose=FALSE) > { > ans_seqnames<- character(length(myids)) > ans_seqnames[]<- "unknown" > ans_locs<- integer(length(myids)) > for (seqname in names(getSNPcount())) { > if (verbose) > cat("Processing ", seqname, " SNPs ...\n", sep="") > locs<- getSNPlocs(seqname) > ids<- paste("rs", locs$RefSNP_id, sep="") > myrows<- match(myids, ids) > hit_idx<- !is.na(myrows) > ans_seqnames[hit_idx]<- seqname > ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]] > } > GRanges(seqnames=ans_seqnames, > IRanges(start=ans_locs, width=1), > RefSNP_id=myids) > } > > #Test en utilisant les premi?res sondes du premier et second chormosome > #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") > #ouverture du fichier pour aller chercher nos num?ros rs > rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE) > myids<- rs_SNPs[,1] > mysnps<- makeGRangesFromRefSNPids(myids) > mysnps # a GRanges object with 1 SNP per row > #create a TranscriptDb > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > txdb > #extract the transcript locations together with their genes > tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > tx # a GRanges object with 1 transcript per row > #rename chromosome to fit USCS standard > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps)) > #mapping but not on a readable format > map<- as.matrix(findOverlaps(mysnps, tx)) > > #making the mapping readable > mapped_genes<- values(tx)$gene_id[map[, 2]] > mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) > snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene)<- NULL > snp2gene > > #snp2gene se travaille mal alors on le transf?re en matrice > snp2geneTmp = t(t(snp2gene)) > #aller chercher les symboles correspondants ? nos gene id > symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) > > save.image(file = "map.RData") > > > > > > And everything was working perfectly. > > Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping. I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error... You've also upgraded from R-2.10/BioC-2.5 to R-2.14/BioC-2.9, which means a lot of things could have changed. You should not assume that your problems are caused only because you are using a more recent SNPlocs package. > Can you help me? The problems seems to start with "map<- as.matrix(findOverlaps(mysnps, tx))" The problem starts earlier with: > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) : levels of supplied 'seqnames' must be identical to 'seqlevels(x)' The right way to do this with recent version of GenomicRanges is to use seqlevels() instead of seqnames(). > and the other error seems to result from that problem. Failing to rename the seqlevels will surely cause you some troubles later down so I would try to address this issue first see if that solves the other problems. Also note that with recent versions of the SNPlocs packages (i.e. version >= 0.99.6), you can use rsidsToGRanges() to do what your home made makeGRangesFromRefSNPids() function does. The latter is much faster BUT, unlike the former, it will fail if some of the supplied rs ids are not found in the SNPlocs package (it will issue an error showing the list of rs ids that could not be found). I've already received some request to improve this so I'll try to work on this soon. Cheers, H. > > > > sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: x86_64-pc-linux-gnu (64-bit) > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] org.Hs.eg.db_2.6.4 > [2] RSQLite_0.11.1 > [3] DBI_0.2-5 > [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 > [5] GenomicFeatures_1.6.7 > [6] AnnotationDbi_1.16.11 > [7] Biobase_2.14.0 > [8] GenomicRanges_1.6.7 > [9] IRanges_1.12.6 > loaded via a namespace (and not attached): > [1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0 RCurl_1.9-5 > [5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4 zlibbioc_1.0.0 > >> library("IRanges") > Attaching package: ?IRanges? > The following object(s) are masked from ?package:base?: > cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int, > pmin, pmin.int, rbind, rep.int, setdiff, table, union >> library("GenomicRanges") >> library("GenomicFeatures") > Loading required package: AnnotationDbi > Loading required package: Biobase > Welcome to Bioconductor > Vignettes contain introductory material. To view, type > 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation("pkgname")'. > > Attaching package: ?Biobase? > The following object(s) are masked from ?package:IRanges?: > updateObject >> #? changer si une version plus r?cente de la librairie est t?l?charg?e. >> library(SNPlocs.Hsapiens.dbSNP.20110815) >> library("org.Hs.eg.db") > Loading required package: DBI >> >> >> #Allocation de la m?moire sous windows >> memory.limit(size = 4095) > [1] Inf > Warning message: > 'memory.limit()' is Windows-specific >> >> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP >> getSNPcount() > ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 > 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307 > ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 > 1542256 1521919 1104719 1031214 949642 1084538 917737 886293 732039 788556 > ch21 ch22 chX chY chMT > 468379 454939 920890 75108 942 >> ch22snps<- getSNPlocs("ch22") >> ch22snps[1:5, ] > RefSNP_id alleles_as_ambig loc > 1 56342815 K 16050353 > 2 149201999 Y 16050408 > 3 146752890 S 16050612 > 4 139377059 Y 16050678 > 5 143300205 R 16050822 >> >> #########################? FAIRE CANOPUS################### >> >> #Create a GRange objetc to use with GenomicRanges library >> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE) > + { > + ans_seqnames<- character(length(myids)) > + ans_seqnames[]<- "unknown" > + ans_locs<- integer(length(myids)) > + for (seqname in names(getSNPcount())) { > + if (verbose) > + cat("Processing ", seqname, " SNPs ...\n", sep="") > + locs<- getSNPlocs(seqname) > + ids<- paste("rs", locs$RefSNP_id, sep="") > + myrows<- match(myids, ids) > + hit_idx<- !is.na(myrows) > + ans_seqnames[hit_idx]<- seqname > + ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]] > + } > + GRanges(seqnames=ans_seqnames, > + IRanges(start=ans_locs, width=1), > + RefSNP_id=myids) > + } >> >> >> #Test en utilisant les premi?res sondes du premier et second chormosome >> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") >> >> #ouverture du fichier pour aller chercher nos num?ros rs >> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE) >> myids<- rs_SNPs[,1] >> >> mysnps<- makeGRangesFromRefSNPids(myids) >> mysnps # a GRanges object with 1 SNP per row > GRanges with 348411 ranges and 1 elementMetadata value: > seqnames ranges strand | RefSNP_id > <rle> <iranges> <rle> |<factor> > [1] ch1 [2195117, 2195117] * | rs7547453 > [2] ch1 [2291680, 2291680] * | rs2840542 > [3] ch1 [3256108, 3256108] * | rs1999527 > [4] ch1 [3577321, 3577321] * | rs4648545 > [5] ch1 [4230463, 4230463] * | rs10915459 > [6] ch1 [4404344, 4404344] * | rs16838750 > [7] ch1 [4501911, 4501911] * | rs12128230 > [8] ch1 [4535148, 4535148] * | rs7541288 > [9] ch1 [4581230, 4581230] * | rs12039682 > ... ... ... ... ... ... > [348403] chX [154514047, 154514047] * | rs499428 > [348404] chX [154514919, 154514919] * | rs507127 > [348405] chX [154737376, 154737376] * | rs5940372 > [348406] chX [154780283, 154780283] * | rs6642287 > [348407] chX [154830377, 154830377] * | rs5983658 > [348408] chX [154870197, 154870197] * | rs473772 > [348409] chX [154892230, 154892230] * | rs553678 > [348410] chX [154899846, 154899846] * | rs473491 > [348411] chX [154929412, 154929412] * | rs557132 > --- > seqlengths: > ch1 ch10 ch11 ch12 ch13 ... ch8 ch9 chX unknown > NA NA NA NA NA ... NA NA NA NA >> >> #create a TranscriptDb >> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > Download the refGene table ... OK > Download the refLink table ... OK > Extract the 'transcripts' data frame ... OK > Extract the 'splicings' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Prepare the 'metadata' data frame ... OK > Make the TranscriptDb object ... OK > There were 50 or more warnings (use warnings() to see the first 50) >> txdb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | Genus and Species: Homo sapiens > | UCSC Table: refGene > | Resource URL: http://genome.ucsc.edu/ > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 41677 > | exon_nrow: 235596 > | cds_nrow: 206518 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012) > | GenomicFeatures version at creation time: 1.6.7 > | RSQLite version at creation time: 0.11.1 > | DBSCHEMAVERSION: 1.0 > | package: GenomicFeatures >> >> #extract the transcript locations together with their genes >> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) >> tx # a GRanges object with 1 transcript per row > GRanges with 41677 ranges and 3 elementMetadata values: > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> |<integer> <character> > [1] chr1 [ 11874, 14408] + | 1127 NR_046018 > [2] chr1 [ 69091, 70008] + | 1128 NM_001005484 > [3] chr1 [323892, 328581] + | 1130 NR_028327 > [4] chr1 [323892, 328581] + | 1132 NR_028325 > [5] chr1 [323892, 328581] + | 1133 NR_028322 > [6] chr1 [367659, 368597] + | 1131 NM_001005221 > [7] chr1 [367659, 368597] + | 1134 NM_001005224 > [8] chr1 [367659, 368597] + | 1135 NM_001005277 > [9] chr1 [763064, 789740] + | 198 NR_015368 > ... ... ... ... ... ... ... > [41669] chrY [27177050, 27198251] - | 20790 NM_004678 > [41670] chrY [27177050, 27198251] - | 20793 NM_001002761 > [41671] chrY [27177050, 27198251] - | 20794 NM_001002760 > [41672] chrY [27209230, 27246039] - | 20791 NR_002177 > [41673] chrY [27209230, 27246039] - | 20792 NR_002178 > [41674] chrY [27209230, 27246039] - | 20795 NR_001525 > [41675] chrY [27329790, 27330920] - | 20796 NR_002179 > [41676] chrY [27329790, 27330920] - | 20797 NR_002180 > [41677] chrY [27329790, 27330920] - | 20798 NR_001526 > gene_id > <compressedcharacterlist> > [1] 100287102 > [2] 79501 > [3] 100133331 > [4] 100132062 > [5] 100132287 > [6] 729759 > [7] 26683 > [8] 81399 > [9] 643837 > ... ... > [41669] 9083 > [41670] 442868 > [41671] 442867 > [41672] 474150 > [41673] 474149 > [41674] 114761 > [41675] 474152 > [41676] 474151 > [41677] 252949 > --- > seqlengths: > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 >> >> #rename chromosome to fit USCS standard >> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > Error in `seqnames<-`(`*tmp*`, value =<s4 object="" of="" class="" "rle"="">) : > levels of supplied 'seqnames' must be identical to 'seqlevels(x)' >> #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps)) >> >> #mapping but not on a readable format >> map<- as.matrix(findOverlaps(mysnps, tx)) > Warning message: > In .Seqinfo.mergexy(x, y) : > Each of the 2 combined objects has sequence levels not in the other: > - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown > - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated] >> >> >> #making the mapping readable >> mapped_genes<- values(tx)$gene_id[map[, 2]] >> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) >> snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) >> rownames(snp2gene)<- NULL >> snp2gene > [1] SNPNAME gene_id > <0 rows> (or 0-length row.names) >> >> >> #snp2gene se travaille mal alors on le transf?re en matrice >> snp2geneTmp = t(t(snp2gene)) >> >> #aller chercher les symboles correspondants ? nos gene id >> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) > Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) : > error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) : > keys must be supplied in a character vector with no NAs >> >> >> save.image(file = "map.RData") >> > > > > > > Simon No?l > CdeC > ________________________________________ > De : Hervé Pagès [hpages at fhcrc.org] > Date d'envoi : 5 d?cembre 2010 23:43 > ? : Simon No?l > Cc : bioconductor at r-project.org > Objet : Re: [BioC] maping SNPs > > Hi Simon, > > On 12/03/2010 10:17 AM, Simon No?l wrote: >> >> Hi, >> >> >> >> I have a really big list of SNPs names like : >> >> >> >> SNPNAME >> >> rs7547453 >> >> rs2840542 >> >> rs1999527 >> >> rs4648545 >> >> rs10915459 >> >> rs16838750 >> >> rs12128230 >> >> ... >> >> >> >> I woudlike to map them to their official gene symbol. What the best way to >> procede? > > Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings > from SNPs to genes and I don't think we have this kind of mappings > either in our collection of annotations (*.db packages). > > But if your SNPs are Human then you can do the mapping yourself by > using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures > packages. > > The latest SNPlocs.Hsapies.dbSNP.* package is > SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains > SNP locations relative to the GRCh37 genome: > > > library(SNPlocs.Hsapiens.dbSNP.20101109) > > getSNPcount() > ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 > ch10 > 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075 > 1158707 > ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 > ch20 > 1147722 1105364 815729 740129 657719 757926 641905 645646 520666 > 586708 > ch21 ch22 chX chY chMT > 338254 331060 529608 67438 624 > > Note that it doesn't contain *all* SNPs from dbSNP Build 132: > only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109 > for the details). > > > ch22snps<- getSNPlocs("ch22") > > ch22snps[1:5, ] > RefSNP_id alleles_as_ambig loc > 1 56342815 K 16050353 > 2 7288968 S 16050994 > 3 6518357 M 16051107 > 4 7292503 R 16051209 > 5 6518368 Y 16051241 > > Note that the rs prefix has been dropped. > > So here is how to proceed: > > First you can use the following function to make a GRanges object from > your SNP ids: > > makeGRangesFromRefSNPids<- function(myids) > { > ans_seqnames<- character(length(myids)) > ans_seqnames[]<- "unknown" > ans_locs<- integer(length(myids)) > for (seqname in names(getSNPcount())) { > locs<- getSNPlocs(seqname) > ids<- paste("rs", locs$RefSNP_id, sep="") > myrows<- match(myids, ids) > ans_seqnames[!is.na(myrows)]<- seqname > ans_locs[!is.na(myrows)]<- locs$loc[myrows] > } > GRanges(seqnames=ans_seqnames, > IRanges(start=ans_locs, width=1), > RefSNP_id=myids) > } > > This takes between 3 and 5 minutes: > > > myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", > "rs10915459", "rs16838750", "rs12128230", "rs999999999") > > mysnps<- makeGRangesFromRefSNPids(myids) > > mysnps # a GRanges object with 1 SNP per row > GRanges with 8 ranges and 1 elementMetadata value > seqnames ranges strand | myids > <rle> <iranges> <rle> |<character> > [1] ch1 [2195117, 2195117] * | rs7547453 > [2] ch1 [2291680, 2291680] * | rs2840542 > [3] ch1 [3256108, 3256108] * | rs1999527 > [4] ch1 [3577321, 3577321] * | rs4648545 > [5] ch1 [4230463, 4230463] * | rs10915459 > [6] ch1 [4404344, 4404344] * | rs16838750 > [7] ch1 [4501911, 4501911] * | rs12128230 > [8] unknown [ 0, 0] * | rs999999999 > > seqlengths > ch1 unknown > NA NA > > The next step is to create a TranscriptDb object with > makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart() > from the GenomicFeatures package. This TranscriptDb object will > contain the transcript locations and their associated > genes extracted from the annotation source you choose. > For example, if you want to use RefSeq genes: > > ## Takes about 3 minutes: > > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > > txdb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | UCSC Table: refGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 37924 > | exon_nrow: 230024 > | cds_nrow: 204571 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010) > | GenomicFeatures version at creation time: 1.2.2 > | RSQLite version at creation time: 0.9-4 > | DBSCHEMAVERSION: 1.0 > > Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb > object: this means that later you will be able to use the org.Hs.eg.db > package to map your gene ids to their symbol (the org.*.eg.db packages > are Entrez Gene ID centric). > > To extract the transcript locations together with their genes: > > > tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > > tx # a GRanges object with 1 transcript per row > GRanges with 37924 ranges and 1 elementMetadata value > seqnames ranges strand | gene_id > <rle> <iranges> <rle> |<compressedcharacterlist> > [1] chr1 [ 69091, 70008] + | 79501 > [2] chr1 [323892, 328581] + | 100133331 > [3] chr1 [323892, 328581] + | 100132287 > [4] chr1 [323892, 328581] + | 100132062 > [5] chr1 [367659, 368597] + | 81399 > [6] chr1 [367659, 368597] + | 729759 > [7] chr1 [367659, 368597] + | 26683 > [8] chr1 [763064, 789740] + | 643837 > [9] chr1 [861121, 879961] + | 148398 > ... ... ... ... ... ... > [37916] chrY [27177050, 27198251] - | 9083 > [37917] chrY [27177050, 27198251] - | 442867 > [37918] chrY [27177050, 27198251] - | 442868 > [37919] chrY [27209230, 27246039] - | 114761 > [37920] chrY [27209230, 27246039] - | 474150 > [37921] chrY [27209230, 27246039] - | 474149 > [37922] chrY [27329790, 27330920] - | 252949 > [37923] chrY [27329790, 27330920] - | 474152 > [37924] chrY [27329790, 27330920] - | 474151 > > seqlengths > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 > > Now you can use findOverlaps() on 'mysnps' and 'tx' to find > the transcripts hits by your snps. But before you can do this, > you need to rename the sequences in 'mysnps' because dbSNPs and > UCSC use different naming conventions for the chromosomes: > > > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > > Then: > > > map<- as.matrix(findOverlaps(mysnps, tx)) > > 'map' contains the mapping between your SNPs and their genes but not > in a readable form (this matrix contains indices) so we make the > 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for > the associated gene ids: > > > mapped_genes<- values(tx)$gene_id[map[, 2]] > > mapped_snps<- rep.int(values(mysnps)$myids[map[, 1]], > elementLengths(mapped_genes)) > > snp2gene<- unique(data.frame(snp_id=mapped_snps, > gene_id=unlist(mapped_genes))) > > rownames(snp2gene)<- NULL > > snp2gene[1:4, ] > snp_id gene_id > 1 rs7547453 6497 > 2 rs2840542 79906 > 3 rs1999527 63976 > 4 rs4648545 7161 > > Note that there is no guarantee that the number of rows in this > data frame is the number of your original SNP ids because the > relation between SNP ids and gene ids is of course not one-to-one. > > Also the method described above considers that a SNP hits a gene > if it's located between the start and the end of one of its > transcripts but it doesn't take in account the exon structure of > the transcripts. If you want to do this you need to use exonsBy() > (from GenomicFeatures) to extract the exons grouped by transcripts > (this will be stored in a GRangesList object) and use this object > instead of 'tx' in the call to findOverlaps(). > > Hope this helps, > H. > > >> >> >> >> Simon No??l >> CdeC >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 ------------------------------ Message: 21 Date: Thu, 9 Feb 2012 16:43:23 -0500 From: Simon No?l <simon.noel.2@ulaval.ca> To: Hervé Pagès <hpages at="" fhcrc.org=""> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: [BioC] RE : RE : maping SNPs Message-ID: <1C970CA8E580A84394CE5CF42FEC0046E2CBDCC110 at EXCH- MBX-E.ulaval.ca> Content-Type: text/plain; charset="iso-8859-1" Thank's alot. It seem to work now:) But for the home made makeGRangesFromRefSNPids, it's not my work. You gave me that function ;) Simon No?l CdeC ________________________________________ De : Hervé Pagès [hpages at fhcrc.org] Date d'envoi : 9 f?vrier 2012 16:34 ? : Simon No?l Cc : bioconductor at r-project.org Objet : Re: RE : [BioC] maping SNPs Hi Simon, On 02/09/2012 12:28 PM, Simon No?l wrote: > Hello Mr. Pag?s, > > At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow. As you remember, with your help we changed it a little bit and we have got with R2.10 : > > library("IRanges") > library("GenomicRanges") > library("GenomicFeatures") > #? changer si une version plus r?cente de la librairie est t?l?charg?e. > library(SNPlocs.Hsapiens.dbSNP.20101109) > library("org.Hs.eg.db") > > #Allocation de la m?moire sous windows > memory.limit(size = 4095) > #v?rification de la librairie SNPlocs.Hsapiens.dbSNP > getSNPcount() > ch22snps<- getSNPlocs("ch22") > ch22snps[1:5, ] > > #Create a GRange objetc to use with GenomicRanges library > makeGRangesFromRefSNPids<- function(myids, verbose=FALSE) > { > ans_seqnames<- character(length(myids)) > ans_seqnames[]<- "unknown" > ans_locs<- integer(length(myids)) > for (seqname in names(getSNPcount())) { > if (verbose) > cat("Processing ", seqname, " SNPs ...\n", sep="") > locs<- getSNPlocs(seqname) > ids<- paste("rs", locs$RefSNP_id, sep="") > myrows<- match(myids, ids) > hit_idx<- !is.na(myrows) > ans_seqnames[hit_idx]<- seqname > ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]] > } > GRanges(seqnames=ans_seqnames, > IRanges(start=ans_locs, width=1), > RefSNP_id=myids) > } > > #Test en utilisant les premi?res sondes du premier et second chormosome > #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") > #ouverture du fichier pour aller chercher nos num?ros rs > rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE) > myids<- rs_SNPs[,1] > mysnps<- makeGRangesFromRefSNPids(myids) > mysnps # a GRanges object with 1 SNP per row > #create a TranscriptDb > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > txdb > #extract the transcript locations together with their genes > tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > tx # a GRanges object with 1 transcript per row > #rename chromosome to fit USCS standard > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps)) > #mapping but not on a readable format > map<- as.matrix(findOverlaps(mysnps, tx)) > > #making the mapping readable > mapped_genes<- values(tx)$gene_id[map[, 2]] > mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) > snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) > rownames(snp2gene)<- NULL > snp2gene > > #snp2gene se travaille mal alors on le transf?re en matrice > snp2geneTmp = t(t(snp2gene)) > #aller chercher les symboles correspondants ? nos gene id > symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) > > save.image(file = "map.RData") > > > > > > And everything was working perfectly. > > Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping. I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error... You've also upgraded from R-2.10/BioC-2.5 to R-2.14/BioC-2.9, which means a lot of things could have changed. You should not assume that your problems are caused only because you are using a more recent SNPlocs package. > Can you help me? The problems seems to start with "map<- as.matrix(findOverlaps(mysnps, tx))" The problem starts earlier with: > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps)) Error in `seqnames<-`(`*tmp*`, value = <s4 object="" of="" class="" "rle"="">) : levels of supplied 'seqnames' must be identical to 'seqlevels(x)' The right way to do this with recent version of GenomicRanges is to use seqlevels() instead of seqnames(). > and the other error seems to result from that problem. Failing to rename the seqlevels will surely cause you some troubles later down so I would try to address this issue first see if that solves the other problems. Also note that with recent versions of the SNPlocs packages (i.e. version >= 0.99.6), you can use rsidsToGRanges() to do what your home made makeGRangesFromRefSNPids() function does. The latter is much faster BUT, unlike the former, it will fail if some of the supplied rs ids are not found in the SNPlocs package (it will issue an error showing the list of rs ids that could not be found). I've already received some request to improve this so I'll try to work on this soon. Cheers, H. > > > > sessionInfo() > R version 2.14.1 (2011-12-22) > Platform: x86_64-pc-linux-gnu (64-bit) > locale: > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 > [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] org.Hs.eg.db_2.6.4 > [2] RSQLite_0.11.1 > [3] DBI_0.2-5 > [4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6 > [5] GenomicFeatures_1.6.7 > [6] AnnotationDbi_1.16.11 > [7] Biobase_2.14.0 > [8] GenomicRanges_1.6.7 > [9] IRanges_1.12.6 > loaded via a namespace (and not attached): > [1] biomaRt_2.10.0 Biostrings_2.22.0 BSgenome_1.22.0 RCurl_1.9-5 > [5] rtracklayer_1.14.4 tools_2.14.1 XML_3.9-4 zlibbioc_1.0.0 > >> library("IRanges") > Attaching package: ?IRanges? > The following object(s) are masked from ?package:base?: > cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int, > pmin, pmin.int, rbind, rep.int, setdiff, table, union >> library("GenomicRanges") >> library("GenomicFeatures") > Loading required package: AnnotationDbi > Loading required package: Biobase > Welcome to Bioconductor > Vignettes contain introductory material. To view, type > 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")' and for packages 'citation("pkgname")'. > > Attaching package: ?Biobase? > The following object(s) are masked from ?package:IRanges?: > updateObject >> #? changer si une version plus r?cente de la librairie est t?l?charg?e. >> library(SNPlocs.Hsapiens.dbSNP.20110815) >> library("org.Hs.eg.db") > Loading required package: DBI >> >> >> #Allocation de la m?moire sous windows >> memory.limit(size = 4095) > [1] Inf > Warning message: > 'memory.limit()' is Windows-specific >> >> #v?rification de la librairie SNPlocs.Hsapiens.dbSNP >> getSNPcount() > ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 > 2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307 > ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20 > 1542256 1521919 1104719 1031214 949642 1084538 917737 886293 732039 788556 > ch21 ch22 chX chY chMT > 468379 454939 920890 75108 942 >> ch22snps<- getSNPlocs("ch22") >> ch22snps[1:5, ] > RefSNP_id alleles_as_ambig loc > 1 56342815 K 16050353 > 2 149201999 Y 16050408 > 3 146752890 S 16050612 > 4 139377059 Y 16050678 > 5 143300205 R 16050822 >> >> #########################? FAIRE CANOPUS################### >> >> #Create a GRange objetc to use with GenomicRanges library >> makeGRangesFromRefSNPids<- function(myids, verbose=FALSE) > + { > + ans_seqnames<- character(length(myids)) > + ans_seqnames[]<- "unknown" > + ans_locs<- integer(length(myids)) > + for (seqname in names(getSNPcount())) { > + if (verbose) > + cat("Processing ", seqname, " SNPs ...\n", sep="") > + locs<- getSNPlocs(seqname) > + ids<- paste("rs", locs$RefSNP_id, sep="") > + myrows<- match(myids, ids) > + hit_idx<- !is.na(myrows) > + ans_seqnames[hit_idx]<- seqname > + ans_locs[hit_idx]<- locs$loc[myrows[hit_idx]] > + } > + GRanges(seqnames=ans_seqnames, > + IRanges(start=ans_locs, width=1), > + RefSNP_id=myids) > + } >> >> >> #Test en utilisant les premi?res sondes du premier et second chormosome >> #myids<- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436") >> >> #ouverture du fichier pour aller chercher nos num?ros rs >> rs_SNPs<- read.csv("info_snps.txt", sep = "\t", header=TRUE) >> myids<- rs_SNPs[,1] >> >> mysnps<- makeGRangesFromRefSNPids(myids) >> mysnps # a GRanges object with 1 SNP per row > GRanges with 348411 ranges and 1 elementMetadata value: > seqnames ranges strand | RefSNP_id > <rle> <iranges> <rle> |<factor> > [1] ch1 [2195117, 2195117] * | rs7547453 > [2] ch1 [2291680, 2291680] * | rs2840542 > [3] ch1 [3256108, 3256108] * | rs1999527 > [4] ch1 [3577321, 3577321] * | rs4648545 > [5] ch1 [4230463, 4230463] * | rs10915459 > [6] ch1 [4404344, 4404344] * | rs16838750 > [7] ch1 [4501911, 4501911] * | rs12128230 > [8] ch1 [4535148, 4535148] * | rs7541288 > [9] ch1 [4581230, 4581230] * | rs12039682 > ... ... ... ... ... ... > [348403] chX [154514047, 154514047] * | rs499428 > [348404] chX [154514919, 154514919] * | rs507127 > [348405] chX [154737376, 154737376] * | rs5940372 > [348406] chX [154780283, 154780283] * | rs6642287 > [348407] chX [154830377, 154830377] * | rs5983658 > [348408] chX [154870197, 154870197] * | rs473772 > [348409] chX [154892230, 154892230] * | rs553678 > [348410] chX [154899846, 154899846] * | rs473491 > [348411] chX [154929412, 154929412] * | rs557132 > --- > seqlengths: > ch1 ch10 ch11 ch12 ch13 ... ch8 ch9 chX unknown > NA NA NA NA NA ... NA NA NA NA >> >> #create a TranscriptDb >> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > Download the refGene table ... OK > Download the refLink table ... OK > Extract the 'transcripts' data frame ... OK > Extract the 'splicings' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Prepare the 'metadata' data frame ... OK > Make the TranscriptDb object ... OK > There were 50 or more warnings (use warnings() to see the first 50) >> txdb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | Genus and Species: Homo sapiens > | UCSC Table: refGene > | Resource URL: http://genome.ucsc.edu/ > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 41677 > | exon_nrow: 235596 > | cds_nrow: 206518 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012) > | GenomicFeatures version at creation time: 1.6.7 > | RSQLite version at creation time: 0.11.1 > | DBSCHEMAVERSION: 1.0 > | package: GenomicFeatures >> >> #extract the transcript locations together with their genes >> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) >> tx # a GRanges object with 1 transcript per row > GRanges with 41677 ranges and 3 elementMetadata values: > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> |<integer> <character> > [1] chr1 [ 11874, 14408] + | 1127 NR_046018 > [2] chr1 [ 69091, 70008] + | 1128 NM_001005484 > [3] chr1 [323892, 328581] + | 1130 NR_028327 > [4] chr1 [323892, 328581] + | 1132 NR_028325 > [5] chr1 [323892, 328581] + | 1133 NR_028322 > [6] chr1 [367659, 368597] + | 1131 NM_001005221 > [7] chr1 [367659, 368597] + | 1134 NM_001005224 > [8] chr1 [367659, 368597] + | 1135 NM_001005277 > [9] chr1 [763064, 789740] + | 198 NR_015368 > ... ... ... ... ... ... ... > [41669] chrY [27177050, 27198251] - | 20790 NM_004678 > [41670] chrY [27177050, 27198251] - | 20793 NM_001002761 > [41671] chrY [27177050, 27198251] - | 20794 NM_001002760 > [41672] chrY [27209230, 27246039] - | 20791 NR_002177 > [41673] chrY [27209230, 27246039] - | 20792 NR_002178 > [41674] chrY [27209230, 27246039] - | 20795 NR_001525 > [41675] chrY [27329790, 27330920] - | 20796 NR_002179 > [41676] chrY [27329790, 27330920] - | 20797 NR_002180 > [41677] chrY [27329790, 27330920] - | 20798 NR_001526 > gene_id > <compressedcharacterlist> > [1] 100287102 > [2] 79501 > [3] 100133331 > [4] 100132062 > [5] 100132287 > [6] 729759 > [7] 26683 > [8] 81399 > [9] 643837 > ... ... > [41669] 9083 > [41670] 442868 > [41671] 442867 > [41672] 474150 > [41673] 474149 > [41674] 114761 > [41675] 474152 > [41676] 474151 > [41677] 252949 > --- > seqlengths: > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 >> >> #rename chromosome to fit USCS standard >> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > Error in `seqnames<-`(`*tmp*`, value =<s4 object="" of="" class="" "rle"="">) : > levels of supplied 'seqnames' must be identical to 'seqlevels(x)' >> #v?rifier pour X/Y -> seqnames(mysnps)<- sub("chrX", "chrY", seqnames(mysnps)) >> >> #mapping but not on a readable format >> map<- as.matrix(findOverlaps(mysnps, tx)) > Warning message: > In .Seqinfo.mergexy(x, y) : > Each of the 2 combined objects has sequence levels not in the other: > - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown > - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated] >> >> >> #making the mapping readable >> mapped_genes<- values(tx)$gene_id[map[, 2]] >> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes)) >> snp2gene<- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes))) >> rownames(snp2gene)<- NULL >> snp2gene > [1] SNPNAME gene_id > <0 rows> (or 0-length row.names) >> >> >> #snp2gene se travaille mal alors on le transf?re en matrice >> snp2geneTmp = t(t(snp2gene)) >> >> #aller chercher les symboles correspondants ? nos gene id >> symbol<- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA)) > Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) : > error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) : > keys must be supplied in a character vector with no NAs >> >> >> save.image(file = "map.RData") >> > > > > > > Simon No?l > CdeC > ________________________________________ > De : Hervé Pagès [hpages at fhcrc.org] > Date d'envoi : 5 d?cembre 2010 23:43 > ? : Simon No?l > Cc : bioconductor at r-project.org > Objet : Re: [BioC] maping SNPs > > Hi Simon, > > On 12/03/2010 10:17 AM, Simon No?l wrote: >> >> Hi, >> >> >> >> I have a really big list of SNPs names like : >> >> >> >> SNPNAME >> >> rs7547453 >> >> rs2840542 >> >> rs1999527 >> >> rs4648545 >> >> rs10915459 >> >> rs16838750 >> >> rs12128230 >> >> ... >> >> >> >> I woudlike to map them to their official gene symbol. What the best way to >> procede? > > Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings > from SNPs to genes and I don't think we have this kind of mappings > either in our collection of annotations (*.db packages). > > But if your SNPs are Human then you can do the mapping yourself by > using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures > packages. > > The latest SNPlocs.Hsapies.dbSNP.* package is > SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains > SNP locations relative to the GRCh37 genome: > > > library(SNPlocs.Hsapiens.dbSNP.20101109) > > getSNPcount() > ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 > ch10 > 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075 > 1158707 > ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 > ch20 > 1147722 1105364 815729 740129 657719 757926 641905 645646 520666 > 586708 > ch21 ch22 chX chY chMT > 338254 331060 529608 67438 624 > > Note that it doesn't contain *all* SNPs from dbSNP Build 132: > only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109 > for the details). > > > ch22snps<- getSNPlocs("ch22") > > ch22snps[1:5, ] > RefSNP_id alleles_as_ambig loc > 1 56342815 K 16050353 > 2 7288968 S 16050994 > 3 6518357 M 16051107 > 4 7292503 R 16051209 > 5 6518368 Y 16051241 > > Note that the rs prefix has been dropped. > > So here is how to proceed: > > First you can use the following function to make a GRanges object from > your SNP ids: > > makeGRangesFromRefSNPids<- function(myids) > { > ans_seqnames<- character(length(myids)) > ans_seqnames[]<- "unknown" > ans_locs<- integer(length(myids)) > for (seqname in names(getSNPcount())) { > locs<- getSNPlocs(seqname) > ids<- paste("rs", locs$RefSNP_id, sep="") > myrows<- match(myids, ids) > ans_seqnames[!is.na(myrows)]<- seqname > ans_locs[!is.na(myrows)]<- locs$loc[myrows] > } > GRanges(seqnames=ans_seqnames, > IRanges(start=ans_locs, width=1), > RefSNP_id=myids) > } > > This takes between 3 and 5 minutes: > > > myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", > "rs10915459", "rs16838750", "rs12128230", "rs999999999") > > mysnps<- makeGRangesFromRefSNPids(myids) > > mysnps # a GRanges object with 1 SNP per row > GRanges with 8 ranges and 1 elementMetadata value > seqnames ranges strand | myids > <rle> <iranges> <rle> |<character> > [1] ch1 [2195117, 2195117] * | rs7547453 > [2] ch1 [2291680, 2291680] * | rs2840542 > [3] ch1 [3256108, 3256108] * | rs1999527 > [4] ch1 [3577321, 3577321] * | rs4648545 > [5] ch1 [4230463, 4230463] * | rs10915459 > [6] ch1 [4404344, 4404344] * | rs16838750 > [7] ch1 [4501911, 4501911] * | rs12128230 > [8] unknown [ 0, 0] * | rs999999999 > > seqlengths > ch1 unknown > NA NA > > The next step is to create a TranscriptDb object with > makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart() > from the GenomicFeatures package. This TranscriptDb object will > contain the transcript locations and their associated > genes extracted from the annotation source you choose. > For example, if you want to use RefSeq genes: > > ## Takes about 3 minutes: > > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene") > > txdb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | UCSC Table: refGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 37924 > | exon_nrow: 230024 > | cds_nrow: 204571 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010) > | GenomicFeatures version at creation time: 1.2.2 > | RSQLite version at creation time: 0.9-4 > | DBSCHEMAVERSION: 1.0 > > Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb > object: this means that later you will be able to use the org.Hs.eg.db > package to map your gene ids to their symbol (the org.*.eg.db packages > are Entrez Gene ID centric). > > To extract the transcript locations together with their genes: > > > tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id")) > > tx # a GRanges object with 1 transcript per row > GRanges with 37924 ranges and 1 elementMetadata value > seqnames ranges strand | gene_id > <rle> <iranges> <rle> |<compressedcharacterlist> > [1] chr1 [ 69091, 70008] + | 79501 > [2] chr1 [323892, 328581] + | 100133331 > [3] chr1 [323892, 328581] + | 100132287 > [4] chr1 [323892, 328581] + | 100132062 > [5] chr1 [367659, 368597] + | 81399 > [6] chr1 [367659, 368597] + | 729759 > [7] chr1 [367659, 368597] + | 26683 > [8] chr1 [763064, 789740] + | 643837 > [9] chr1 [861121, 879961] + | 148398 > ... ... ... ... ... ... > [37916] chrY [27177050, 27198251] - | 9083 > [37917] chrY [27177050, 27198251] - | 442867 > [37918] chrY [27177050, 27198251] - | 442868 > [37919] chrY [27209230, 27246039] - | 114761 > [37920] chrY [27209230, 27246039] - | 474150 > [37921] chrY [27209230, 27246039] - | 474149 > [37922] chrY [27329790, 27330920] - | 252949 > [37923] chrY [27329790, 27330920] - | 474152 > [37924] chrY [27329790, 27330920] - | 474151 > > seqlengths > chr1 chr2 ... chr18_gl000207_random > 249250621 243199373 ... 4262 > > Now you can use findOverlaps() on 'mysnps' and 'tx' to find > the transcripts hits by your snps. But before you can do this, > you need to rename the sequences in 'mysnps' because dbSNPs and > UCSC use different naming conventions for the chromosomes: > > > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps)) > > Then: > > > map<- as.matrix(findOverlaps(mysnps, tx)) > > 'map' contains the mapping between your SNPs and their genes but not > in a readable form (this matrix contains indices) so we make the > 'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for > the associated gene ids: > > > mapped_genes<- values(tx)$gene_id[map[, 2]] > > mapped_snps<- rep.int(values(mysnps)$myids[map[, 1]], > elementLengths(mapped_genes)) > > snp2gene<- unique(data.frame(snp_id=mapped_snps, > gene_id=unlist(mapped_genes))) > > rownames(snp2gene)<- NULL > > snp2gene[1:4, ] > snp_id gene_id > 1 rs7547453 6497 > 2 rs2840542 79906 > 3 rs1999527 63976 > 4 rs4648545 7161 > > Note that there is no guarantee that the number of rows in this > data frame is the number of your original SNP ids because the > relation between SNP ids and gene ids is of course not one-to-one. > > Also the method described above considers that a SNP hits a gene > if it's located between the start and the end of one of its > transcripts but it doesn't take in account the exon structure of > the transcripts. If you want to do this you need to use exonsBy() > (from GenomicFeatures) to extract the exons grouped by transcripts > (this will be stored in a GRangesList object) and use this object > instead of 'tx' in the call to findOverlaps(). > > Hope this helps, > H. > > >> >> >> >> Simon No??l >> CdeC >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 ------------------------------ Message: 22 Date: Thu, 9 Feb 2012 13:48:19 -0800 From: Michael Lawrence <lawrence.michael@gene.com> To: Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF? Message-ID: <caoq5nyfevw0_a-g_j65-ic9e67xcw7m9ri10ro2jnn8dfyv_pw at="" mail.gmail.com=""> Content-Type: text/plain Hi Steve, I had thought someone would want something like this, and it would be nice to able to parse a GTF file into something more useful than a GRanges, as rtracklayer supports now. So I'd definitely encourage you to go ahead, while recommending that it be based on the import.gtf function in rtracklayer. We really need to have a centralized track I/O package. I'm not sure why, but everyone seems intent on writing the same parser over and over again. Duplicated effort frustrates me. It doesn't have to be rtracklayer, but somewhere in the infrastructure we need reliable I/O that handles all of the corner cases. I've embarked on a time consuming effort of implementing a comprehensive test suite for rtracklayer, which should make it at least a base for such a package. Thanks a lot for volunteering this contribution, Michael On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou < mailinglist.honeypot at gmail.com> wrote: > Hi, > > I thought it would be handy to make a GenomicFeatures::TranscriptDb > from a gtf file and was somehow surprised to see that I couldn't find > such a function in GenomicFeatures. > > I'm happy to whip up such a function, but wanted to check in to make > sure I'm not missing something like (1) you can already do it; or (2) > it's not as straightforward as I think; (3) maybe it's there and I'm > not looking hard enough. > > Right now I just want to build it on the gff that the knew versions of > tophat build when you pass in a gtf/gff file of known transcripts (the > --G/--GTF cmd line arg), but I think it'd be handy overall. > > I don't think gff/gtf's have any column for exon ordering, though, in > which case I'd just assume the exons are ordered linearly (bye bye > trans-splicing)). > > Good idea? Bad idea? Already done? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] ------------------------------ Message: 23 Date: Thu, 9 Feb 2012 17:16:56 -0500 From: Steve Lianoglou <mailinglist.honeypot@gmail.com> To: Michael Lawrence <lawrence.michael at="" gene.com=""> Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] GenomicFeatures::makeTranscriptDbFromGTF? Message-ID: <caha9mcn4bxggysrgoarbsqedu1zo6s3o+55yy3pjr3cij8du_g at="" mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hey Michael, I'll see if I can take a crack at it this w/e. Also I'm in complete agreement w.r.t i/o stuff and your thoroughness w/ rtracklayer is very much appreciated. If we do have a centralized track i/o package that is outside of rtracklayer, I nominate we name it bIO ... or something ;-) ... yes, I can hear the crickets ... -steve On Thu, Feb 9, 2012 at 4:48 PM, Michael Lawrence <lawrence.michael at="" gene.com=""> wrote: > Hi Steve, > > I had thought someone would want something like this, and it would be nice > to able to parse a GTF file into something more useful than a GRanges, as > rtracklayer supports now. > > So I'd definitely encourage you to go ahead, while recommending that it be > based on the import.gtf function in rtracklayer. > > We really need to have a centralized track I/O package. I'm not sure why, > but everyone seems intent on writing the same parser over and over again. > Duplicated effort frustrates me. It doesn't have to be rtracklayer, but > somewhere in the infrastructure we need reliable I/O that handles all of the > corner cases. I've embarked on a time consuming effort of implementing a > comprehensive test suite for rtracklayer, which should make it at least a > base for such a package. > > Thanks a lot for volunteering this contribution, > Michael > > On Thu, Feb 9, 2012 at 7:55 AM, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: >> >> Hi, >> >> I thought it would be handy to make a GenomicFeatures::TranscriptDb >> from a gtf file and was somehow surprised to see that I couldn't find >> such a function in GenomicFeatures. >> >> I'm happy to whip up such a function, but wanted to check in to make >> sure I'm not missing something like (1) you can already do it; or (2) >> it's not as straightforward as I think; (3) maybe it's there and I'm >> not looking hard enough. >> >> Right now I just want to build it on the gff that the knew versions of >> tophat build when you pass in a gtf/gff file of known transcripts (the >> --G/--GTF cmd line arg), but I think it'd be handy overall. >> >> I don't think gff/gtf's have any column for exon ordering, though, in >> which case I'd just assume the exons are ordered linearly (bye bye >> trans-splicing)). >> >> Good idea? Bad idea? Already done? >> >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> ?| Memorial Sloan-Kettering Cancer Center >> ?| Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ------------------------------ Message: 24 Date: Thu, 9 Feb 2012 14:17:04 -0800 (PST) From: khadeeja ismail <hajjja@yahoo.com> To: "Bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: [BioC] Removing probes with low variance across samples (Infinium 450k) Message-ID: <1328825824.36437.YahooMailNeo at web114703.mail.gq1.yahoo.com> Content-Type: text/plain Hi, I have a question regarding the Illumina Human Methylation 450k array and the genefilter package. I used the 'nsfilter' function in gene filter to remove probes that have low variance across samples. When I checked the documentation for nsfilter, I found out that applying the function removes 50% of the probes by default. I computed the variance for each probe in the remaining probes and for the removed probes separately. When I plot the density for each set of variances, they overlap completely showing that both sets have most of their probes with variance close to zero and few with high variance. This leaves me wondering how nsfilter actually filters probes, as it doesn't appear from the plot that the probes with the lowest variances are removed. What would be the best way to filter out low variance probes in 450k data? If the default value in nsfilter is set to 50% assuming that 40% of genes in a cell are not expressed, what percentage cutoff can be used for methylation data? Would be great if anyone can explain it. Thanks, Khadeeja [[alternative HTML version deleted]] ------------------------------ Message: 25 Date: Thu, 09 Feb 2012 14:18:04 -0800 From: Hervé Pagès <hpages@fhcrc.org> To: Simon No?l <simon.noel.2 at="" ulaval.ca=""> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] RE : RE : maping SNPs Message-ID: <4F34461C.2050502 at fhcrc.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 02/09/2012 01:43 PM, Simon No?l wrote: > Thank's alot. It seem to work now:) But for the home made makeGRangesFromRefSNPids, it's not my work. You gave me that function ;) Sure, I remember now. No problem, you can keep it ;-) H. ------------------------------ Message: 26 Date: Fri, 10 Feb 2012 09:44:37 +1100 (AUS Eastern Daylight Time) From: Gordon K Smyth <smyth@wehi.edu.au> To: wang peter <wng.peter at="" gmail.com=""> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> Subject: [BioC] Using write.table with output from topTags [was: report a possible bug of edgeR] Message-ID: <pine.wnt.4.64.1202100925590.4124 at="" pc765.wehi.edu.au=""> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Dear Peter (or Shan Gao), Short answer: You are using an old version of edgeR. You need to either install the current version of edgeR from Bioconductor or, with the version you have, you can use write.table(result$table, etc) instead of write.table(result, etc). Longer answer: This is not an edgeR bug. The topTags results object has been produced correctly. However, when you try to write it to a file using write.table(), you are trying to treat the results object as a data.frame, which it is not. Hence the error message. In the version of edgeR in the most recent Bioconductor release we added a method for coercing TopTags objects to data.frames, to allow people to do exactly what you're doing. Please have a look at the posting guide: http://www.bioconductor.org/help/mailing-list/posting-guide/ It is a good idea to include output from sessionInfo() in any post. Best wishes Gordon > Date: Wed, 8 Feb 2012 16:33:38 -0500 > From: wang peter <wng.peter at="" gmail.com=""> > To: bioconductor at r-project.org > Subject: [BioC] report a possible bug of edgeR > > i have two group of samples, > each group have 3 biological replicate > library(edgeR) > library(limma) > > raw.data <- read.table("111",row.names=1) > d <- raw.data[, 1:dim(raw.data)[2]-1 > group <- factor(c(rep("sap", 3),rep("vas", 3))) > d <- DGEList(counts = d, lib.size = > c(9893630,11055814,11207084,9663487,11455088,8140053), group = group) > d <- d[rowSums(d$counts) >= length(group)/2,] > > #normalization > d <- calcNormFactors(d) > > #To estimate common dispersion: > d <- estimateCommonDisp(d) > #To estimate tagwise dispersions: > d <- estimateTagwiseDisp(d) > > et <- exactTest(d) > result <- topTags(et, n=dim(d)[1]+1, adjust.method="BH", sort.by="logFC") > write.table(result,file = "DEresult",sep = "\t") > > Error in data.frame(table = list(logConc = c(-30.7452523676841, > -30.9164328563752, : > arguments imply differing number of rows: 92305, 1, 2 > > but write.table(result[1:92304,],file = "DEresult",sep = "\t") can work well > and then result[92305,] can also work well: > > logConc logFC P.Value FDR > UN089145 -16.96614 -4.874874e-05 1 1 > > > another problem is the result file > "table.logConc" "table.logFC" "table.P.Value" "table.adj.P.Val" "adjust.method" "comparison" > "UN038758" -30.7452523676841 -38.5416037522595 1.78446331842105e-37 2.74524811011424e-33 "BH" "sap" > > > "comparison" should include two group name like, "sap" vs "vas" > but why only one name? > > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} ------------------------------ Message: 27 Date: Thu, 9 Feb 2012 17:46:58 -0500 From: Ann Loraine <aloraine@gmail.com> To: bioconductor at r-project.org Subject: Re: [BioC] DESeq and transcript-wise analysis Message-ID: <cao=20estfpfzcokld6x0hj=iedfofw+jkvjc5r9-tkhk2wirdw at="" mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hello, Is there anything that can assess differences at the individual splice site level? Testing at the isoform level is sometimes not useful because many, many factors go into determining the final structure of a fully-processed mRNA transcript, such as the transcription start site, cleavage at the three prime end, splicing of many internal introns, and so on. I think it is much more interesting and useful biologically to focus in on one or two aspects of transcription and test whether a condition or treatment has affected that one aspect. For example, I'd like to know if a condition or treatment affects individual splice site preference. ex): 5' V1 XXXXX------XXXXXXXXX 3' V2 XXXXX--------------XXXXX In this simple example, we have two isoforms arising from the same gene. To assess whether the condition has changed splicing, i.e., changed which splice site is preferred, then I need to know how many transcripts used acceptor site V1 versus acceptor site V2. I can do this by counting reads that align across the intron. A spliced read can support the V1 acceptor or the V2 acceptor, but never both. So I can be sure of which choice the spliced read represents. But let's say I have data from three treatments and three controls. How can I determine whether the treatment changed the ratio of V1 to V2 spliced reads? Can I do something like calculate the percentage of V1 reads overall and then compare the percentages using a t test? [[elided Yahoo spam]] Best wishes, Ann ------------------------------ Message: 28 Date: Thu, 9 Feb 2012 16:07:39 -0800 From: "Tim Triche, Jr." <tim.triche@gmail.com> Cc: "Bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] Removing probes with low variance across samples (Infinium 450k) Message-ID: <cac+n9buvsmxlbh_w4o8botx6p5m9yq3jzht=um7dqw0mt4nnpq at="" mail.gmail.com=""> Content-Type: text/plain Genomic DNA -- what you're assaying on these arrays, or at least what they're designed for -- need not be expressed. It's just... there, chopped up after extraction, bisulfite conversion, and whole-genome amplification, waiting to hybridize. Thus nsfilter's fundamental assumption -- that some large fraction of the probes on the array are in fact pure noise -- is violated. It may be that there is (almost always) local correlation between probes within +/- 1kb of each other, but if the protocols for these arrays are followed carefully, you can expect better than 99% of the probes to hybridize (which is NOT the case with expression arrays, and you would not expect 99% of the genome to align in an RNAseq experiment either). So the decision of how many probes to retain then comes down to your judgment. Biological annotation (e.g. from ChIP-seq peak calls for histone marks, transcription factors, or physical interactions) can become very useful in making sense of these data. If you lack normal samples (or don't know which ones are "normal") it is possible to see low variability in regions which are consistently aberrant, so that may not always be the best approach. I find the GenomicRanges, GenomicFeatures, and rtracklayer packages useful for this type of annotation, FWIW. Hope this helps, --t > Hi, > > I have a question regarding the Illumina Human Methylation 450k array and > the genefilter package. > I used the 'nsfilter' function in gene filter to remove probes that have > low variance across samples. When I checked the documentation for nsfilter, > I found out that applying the function removes 50% of the probes by > default. > I computed the variance for each probe in the remaining probes and for the > removed probes separately. When I plot the density for each set of > variances, they overlap completely showing that both sets have most of > their probes with variance close to zero and few with high variance. > This leaves me wondering how nsfilter actually filters probes, as it > doesn't appear from the plot that the probes with the lowest variances are > removed. > What would be the best way to filter out low variance probes in 450k data? > If the default value in nsfilter is set to 50% assuming that 40% of genes > in a cell are not expressed, what percentage cutoff can be used for > methylation data? > Would be great if anyone can explain it. > > Thanks, > Khadeeja > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ------------------------------ Message: 29 Date: Thu, 9 Feb 2012 19:32:53 -0500 From: Sean Davis <sdavis2@mail.nih.gov> To: Salwa Eid <salwaeid at="" hotmail.com=""> Cc: bioconductor at r-project.org Subject: Re: [BioC] how can i convert a gse object(expression sets) to affybatch Message-ID: <caneavbnmmvkbrvqlc39uexbqxfs5unw3l8rhfccyzop=vkpf+a at="" mail.gmail.com=""> Content-Type: text/plain; charset=UTF-8 On Thu, Feb 9, 2012 at 2:35 PM, Salwa Eid <salwaeid at="" hotmail.com=""> wrote: > > > > > Dear all, ? ?I have read cel files from ncbi website using the getGEO from GEOquery package. ?I want then to use the combineaffybatch from the matchprobe package but the objects I have to pass is affybatch. ?I need to convert the gse object or the expression sets to affybatch to be able to use it. ?Any ideas? Regards,salwa > Hi, Salwa. You cannot convert a GSE object or ExpressionSet into an affybatch since the probe-level data are not contained in either object. The getGEO function downloads the normalized data from GEO, not the .CEL files. You want to use the getGEOSuppFiles() function instead. This will download raw data, typically in .tar format. You can use the untar() function to then expand the archive. Finally, you can use the affy package to read the .CEL.gz files to get an affybatch. Hope that helps. Sean ------------------------------ Message: 30 Date: Thu, 9 Feb 2012 19:46:28 -0500 From: somnath bandyopadhyay <genome1976@hotmail.com> To: <bioconductor at="" r-project.org=""> Subject: [BioC] ConsensusClusterPlus extracting features for clusters Message-ID: <snt108-w1374f3fdc88f4315e7a936cd780 at="" phx.gbl=""> Content-Type: text/plain Hi, Could anybody suggest a way of extracting features/genes associated with related clusters from the results file obtained from ConsensusClusterPlus?Any help would be greatly appreciated. Thanks,Som. [[alternative HTML version deleted]] ------------------------------ Message: 31 Date: Fri, 10 Feb 2012 12:00:12 +1100 (EST) From: Dario Strbenac <d.strbenac@garvan.org.au> To: bioconductor at r-project.org Subject: [BioC] limma barcodeplot with 2 groups Message-ID: <20120210120012.BQA38016 at gimr.garvan.unsw.edu.au> Content-Type: text/plain; charset="us-ascii" Hello, The labels that I get are still on the side, as for a 1-sample plot. Would it make more sense graphically to have them at the top and botton of the graphic for 2-sample plots ? In the current format, I cannot tell which label relates to which colour. I have limma_3.10.2 -------------------------------------- Dario Strbenac Research Assistant Cancer Epigenetics Garvan Institute of Medical Research Darlinghurst NSW 2010 Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: barcode2groups.png Type: image/png Size: 10527 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120210="" 52195af4="" attachment-0001.png=""> ------------------------------ Message: 32 Date: Fri, 10 Feb 2012 01:13:30 -0800 From: Bogdan Tanasa <tanasa@gmail.com> To: <bioconductor at="" stat.math.ethz.ch="">, <bioc-sig-sequencing at="" r-project.org=""> Subject: [BioC] bioMart Message-ID: <ca+jem00hsjviz9rfe0k-sdmy=8diaj7cp6t9hltdbr3nwfaxbg at="" mail.gmail.com=""> Content-Type: text/plain Dear all, given current database updates, just to make sure, when using BioMart, in the command below >mart <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") how shall I access the homo sapiens ensembl database that is quivalent to UCSC genome hg18 ? thanks ! Bogdan [[alternative HTML version deleted]] ------------------------------ Message: 33 Date: Fri, 10 Feb 2012 02:22:13 -0800 (PST) From: "Richard Coulson [guest]" <guest@bioconductor.org> To: bioconductor at r-project.org, richard.coulson at cimr.cam.ac.uk Subject: [BioC] Inconsistent coefficient values Message-ID: <20120210102213.D70D31255EF at mamba.fhcrc.org> Hi, I have a problem with using 'limma' when I'm analysing some microarray data. If I run the below code WITHOUT setting a seed, I get slightly different values for the coefficients each time it's run; however this problem does not occur if I do set one (e.g. set.seed(1223762671)) :- raw.data <-ReadAffy( celfile.path="CEL directory" ) normalised.data <-vsnrma(raw.data) transfect.lmFit <-lmFit( normalised.data, design.matrix ) cont.lmFit <-contrasts.fit(transfect.lmFit, cont.matrix) i.e. the values in cont.lmFit$coefficients are altered from one R session to another. Please could anyone help with this? Many thanks, Richard. -- output of sessionInfo(): > sessionInfo() R version 2.12.2 (2011-02-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hugene11stv1cdf_1.26.0 limma_3.4.4 vsn_3.16.0 [4] affyPLM_1.24.0 preprocessCore_1.10.0 gcrma_2.20.0 [7] affy_1.26.1 Biobase_2.8.0 loaded via a namespace (and not attached): [1] affyio_1.16.0 Biostrings_2.16.9 grid_2.12.2 IRanges_1.6.8 [5] lattice_0.19-17 splines_2.12.2 tools_2.12.2 > cont.matrix Contrasts Levels (C.GFP.24+C.GFP.48+C.GFP.72)-(mock.24+mock.48+mock.72) C.GFP.24 1 C.GFP.48 1 C.GFP.72 1 mock.24 -1 mock.48 -1 mock.72 -1 myc.24 0 myc.48 0 myc.72 0 N.GFP.24 0 N.GFP.48 0 N.GFP.72 0 untransfected.0 0 Contrasts Levels (N.GFP.24+N.GFP.48+N.GFP.72)-(mock.24+mock.48+mock.72) C.GFP.24 0 C.GFP.48 0 C.GFP.72 0 mock.24 -1 mock.48 -1 mock.72 -1 myc.24 0 myc.48 0 myc.72 0 N.GFP.24 1 N.GFP.48 1 N.GFP.72 1 untransfected.0 0 Contrasts Levels (myc.24+myc.48+myc.72)-(mock.24+mock.48+mock.72) C.GFP.24 0 C.GFP.48 0 C.GFP.72 0 mock.24 -1 mock.48 -1 mock.72 -1 myc.24 1 myc.48 1 myc.72 1 N.GFP.24 0 N.GFP.48 0 N.GFP.72 0 untransfected.0 0 -- Sent via the guest posting facility at bioconductor.org. ------------------------------ Message: 34 Date: Fri, 10 Feb 2012 11:40:04 +0100 From: "Manuela Di Russo" <manuela.dirusso@for.unipi.it> To: <bioconductor at="" r-project.org=""> Subject: [BioC] a problem in reading in cel files Message-ID: <fd438ece49ff463591d66ef5526f3119 at="" maanalysis2=""> Content-Type: text/plain Dear all, I am learning to analyse Affymetrix microarray data but I have a problem in reading .cel files in. I downloaded from GEO the raw data provided as supplementary files (GSE12345_RAW.tar), than I have extracted the cel files in a directory which I have set as my working directory. Here is the R code I used: > setwd("C:/BACKUP/Dati/Progetti/Landi/meta-analisi MPM/GSE12345_RAW") > library(affy) Carico il pacchetto richiesto: Biobase Welcome to Bioconductor Vignettes contain introductory material. To view, type 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation("pkgname")'. > dir() [1] "data analysis.txt" "E-GEOD-12345.sdrf.txt" "E-GEOD-12345.sdrf.xls" [4] "GSM309986.CEL" "GSM309987.CEL" "GSM309988.CEL" [7] "GSM309989.CEL" "GSM309990.CEL" "GSM309991.CEL" [10] "GSM310012.CEL" "GSM310013.CEL" "GSM310014.CEL" [13] "GSM310015.CEL" "GSM310016.CEL" "GSM310068.CEL" [16] "GSM310070.CEL" "target.txt" "target.xls" > pd <- read.AnnotatedDataFrame("target.txt",header=TRUE,row.names=1,a s.is=TRUE) > pData(pd) FileName Target N1 GSM309986.CEL pleural tissue N2 GSM309987.CEL pleural tissue N3 GSM309988.CEL pleural tissue N4 GSM309989.CEL pleural tissue MM1 GSM309990.CEL mesothelioma tissue MM2 GSM309991.CEL mesothelioma tissue MM3 GSM310012.CEL mesothelioma tissue MM4 GSM310013.CEL mesothelioma tissue MM5 GSM310014.CEL mesothelioma tissue MM6 GSM310015.CEL mesothelioma tissue MM7 GSM310016.CEL mesothelioma tissue MM8 GSM310068.CEL mesothelioma tissue MM9 GSM310070.CEL mesothelioma tissue > rawData <- read.affybatch(filenames=pData(pd)$FileName,phenoData=pd) Error in try(.Call("ReadHeaderDetailed", filename, PACKAGE = "affyio")) : Is GSM310016.CEL really a CEL file? tried reading as text, gzipped text, binary, gzipped binary, command console and gzipped command console formats. Errore in read.celfile.header(filenames[i], info = "full") : Failed to get full header information for GSM310016.CEL > rawData1<-ReadAffy() Error in try(.Call("ReadHeaderDetailed", filename, PACKAGE = "affyio")) : Is C:/BACKUP/Dati/Progetti/Landi/meta-analisi MPM/GSE12345_RAW/GSM310016.CEL really a CEL file? tried reading as text, gzipped text, binary, gzipped binary, command console and gzipped command console formats. Errore in read.celfile.header(filenames[i], info = "full") : Failed to get full header information for C:/BACKUP/Dati/Progetti/Landi/meta-analisi MPM/GSE12345_RAW/GSM310016.CEL > sessionInfo() R version 2.14.1 (2011-12-22) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 [3] LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C [5] LC_TIME=Italian_Italy.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] affy_1.32.1 Biobase_2.14.0 loaded via a namespace (and not attached): [1] affyio_1.22.0 BiocInstaller_1.2.1 preprocessCore_1.16.0 [4] zlibbioc_1.0.0 > traceback() 7: stop("Failed to get full header information for ", filename) 6: read.celfile.header(filenames[i], info = "full") 5: FUN(1:13[[11L]], ...) 4: lapply(X = X, FUN = FUN, ...) 3: sapply(seq_len(length(filenames)), function(i) { sdate <- read.celfile.header(filenames[i], info = "full")[["ScanDate"]] if (is.null(sdate) || length(sdate) == 0) NA_character_ else sdate }) 2: read.affybatch(filenames = l$filenames, phenoData = l$phenoData, description = l$description, notes = notes, compress = compress, rm.mask = rm.mask, rm.outliers = rm.outliers, rm.extra = rm.extra, verbose = verbose, sd = sd, cdfname = cdfname) 1: ReadAffy() May be there is a problem in reading the cel file header, so I opened one of the cel files with a text-editor but it seems correct. Can anyone help me? Thank you very much! Manuela ---------------------------------------------------------------------- ------------------ Manuela Di Russo, Ph.D. Student Department of Experimental Pathology, MBIE University of Pisa Pisa, Italy e-mail: manuela.dirusso at for.unipi.it tel: +39050993538 [[alternative HTML version deleted]] ------------------------------ Message: 35 Date: Fri, 10 Feb 2012 11:55:47 +0100 From: James Perkins <jperkins@biochem.ucl.ac.uk> To: Pepijn Kooij <pkooij at="" bio.ku.dk=""> Cc: bioconductor at r-project.org Subject: Re: [BioC] NormqPCR and ReadqPCR Message-ID: <caljpjm2ojypf10h-don1_xhkrrgutt1e2uoiwzt15e_uwkvkna at="" mail.gmail.com=""> Content-Type: text/plain HI Pepijn Thanks for your interest. >My question now is, is there a way or package to go further from this? As in, I am interested to see which of the genes have a foldchange above a threshhold of 1, is there a way to see that? By a "foldchange threshold of 1", do you mean a doubling or halving? i.e. |logFC | > 1? If so you can find everything in your ddCt output data.frame that meets this criteria and look at only these genes. ddCtRes[abs(log2(as.numeric(as.vector(ddCtRes$`2^-ddCt`)))) > 1,] For the gene IDs alone this will work: ddCtRes[abs(log2(as.numeric(as.vector(ddCtRes$`2^-ddCt`)))) > 1,"ID"] This is assuming you have done something with the undetermined values, if you have "+"s and "-"s for your 2^ddCt values it gets a bit more involved. >And, is it possible to plot the 2^??????Ct values in a bargraph with SD's or SE's? I hope you can help me out with this, It is possible, you could use something like barplot2 from gplots and plot the min 2^-ddCt.min and max values 2^-ddCt.max as upper and lower confidence intevals. The min and max values are calculated along the lines of http://www.ncbi.nlm.nih.gov/pubmed/11846609, and the error bars will be assymetric if you plot the x axis in non-log space, since the Ct values from which SD worked out on arithmetic scale but then gets 2^ddCT+/-SD. arguments to barplot2 something like: plot.ci=TRUE, ci.l=MIN2^DDCTS, ci.u=MIN2^DDCTS, Hope this helps, let me know if you have any issues with the plotting, and if you want to send me a more concrete example I can give some more focused code. Cheers, Jim On 8 February 2012 16:32, Pepijn Kooij <pkooij at="" bio.ku.dk=""> wrote: > Dear all > > My name is Pepijn Kooij, PhD student at the Centre for Social Evolution, > Copenhagen University. > I was grateful to find the ReadqPCR and NormqPCR packages to analyze my > rt-qPCR data. I managed to use both packages, loading in my data and > normalizing it. In the end using the deltaDeltaCt() function I am able to > retrieve the table with all the data I need. > My question now is, is there a way or package to go further from this? As > in, I am interested to see which of the genes have a foldchange above a > threshhold of 1, is there a way to see that? > And, is it possible to plot the 2^??????Ct values in a bargraph with SD's or > SE's? > I hope you can help me out with this, > > Best regards > Pepijn Kooij > _______________________________________ > Pepijn Kooij, PhD student > Centre for Social Evolution, Department of Biology > University of Copenhagen > Universitetsparken 15, bygning 12 > 2100 Copenhagen ?? > Denmark > > Tel: (+45) 35 32 13 41 > Email: pkooij at bio.ku.dk > CSE website: http://www.bi.ku.dk/cse > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] ------------------------------ _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor End of Bioconductor Digest, Vol 108, Issue 10

SNP Transcription Sequencing RNASeq miRNA Microarray Annotation Normalization GO Cancer • 4.7k views

ADD COMMENT • link updated 12.2 years ago by Wolfgang Huber ★ 13k • written 12.2 years ago by Polikepahad, Sumanth ▴ 10

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, First, please do not reply to the entirety of a "digest" email. Your email had ~ 5 sentences worth of a real, and about 10 pages worth of junk attached to it. Now: On Fri, Feb 10, 2012 at 6:19 PM, Polikepahad, Sumanth <polikepa at="" bcm.edu=""> wrote: > Hi, > > I have analyzed my deep sequencing data with DESeq and successfully generated a heatmap showing differentially expressed miRNAs by using gplots heatmap.2. However, I want to put different colors to the fonts of miRNA names that are shown on y-axis, depending on whether they are up or down regulated. ?For example, I want to show the names of all down-regulated miRNAs in blue and up-regulated in Red. How do I do this? I guess I must use if else statement, but not sure if its the right way to do it. I am very new to R and I really appreciate any help. It unfortunately looks like there is no easy way to change the colors of the labels using the heatmap.2 function. This is untested, but one thing you can do is copy the entirety of the function into your own, say "myheatmap.2" function, and add labRowCol and labColCol arguments which would be vectors of colors for each element in your row and columns. Then you'd have to find the line of code that looks like so: ## add labels axis(1, 1:nc, labels= labCol, las= 2, line= -0.5, tick= 0, cex.axis= cexCol) if(!is.null(xlab)) mtext(xlab, side = 1, line = margins[1] - 1.25) axis(4, iy, labels= labRow, las= 2, line= -0.5, tick= 0, cex.axis= cexRow) if(!is.null(ylab)) mtext(ylab, side = 4, line = margins[2] - 1.25) And change them to something like: axis(1, 1:nc, labels= labCol, las= 2, line= -0.5, tick= 0, cex.axis= cexCol, col=labColCol[colInd]) if(!is.null(xlab)) mtext(xlab, side = 1, line = margins[1] - 1.25) axis(4, iy, labels= labRow, las= 2, line= -0.5, tick= 0, cex.axis= cexRow, col=labRowCol[rowInd]) if(!is.null(ylab)) mtext(ylab, side = 4, line = margins[2] - 1.25) Or something close to that. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 12.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 16 days ago

EMBL European Molecular Biology Laborat…

Dear Sumanth, besides Steve's suggestion (*), you could also have a look at the 'ColSideColors', 'RowSideColors' arguments of 'heatmap.2'. That might even be easier to read than colored text. An example for that is in Fig.1 here: http://www- huber.embl.de/arrayQualityMetrics/Report_for_nCCl4_with_RIN/ On a more advanced note..: also have a look at the example (manual page) of the 'dendrogramGrob' function in the 'latticeExtra' package, this provides some elements out of which you can construct your own heatmap-like displays. (*) Besides copying, editing and sourcing a function's source code, you can also edit the function object itself, e.g. the function "trace" offers an interface to inject your own lines into "someone else's" function programmatically. This is probably not all that relevant for your immediate problem, but may be useful for some (e.g. if you want to modify the other function, rather than create a copy with a different name and/or in a different name space.) Hope this helps Wolfgang Feb/11/12 12:19 AM, Polikepahad, Sumanth scripsit:: > Hi, > > I have analyzed my deep sequencing data with DESeq and successfully > generated a heatmap showing differentially expressed miRNAs by using > gplots heatmap.2. However, I want to put different colors to the > fonts of miRNA names that are shown on y-axis, depending on whether > they are up or down regulated. For example, I want to show the names > of all down-regulated miRNAs in blue and up-regulated in Red. How do > I do this? I guess I must use if else statement, but not sure if its > the right way to do it. I am very new to R and I really appreciate > any help. > > Thanks in advance. Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 12.2 years ago Wolfgang Huber ★ 13k

Login before adding your answer.