BAM files to Genomic Ranges object

0

Entering edit mode

José Luis Lavín ▴ 280

@jose-luis-lavin-5529

Last seen 11.4 years ago

Hello everybody, I've been told about a very nice R package called Repitools. It has many interesting features but the use of some of them remains a little unclear to me yet. There are some subjects about this tool I'll have to ask about, in this list, but first of all I need to convert some BAM files into GRanges objects. To do so I read about BAM2GenomicRanges function. Reading the package documentation, I tried to convert 3 BAM files into GRanges. I used the following instructions: *(where "/path/to/ANALYSIS/folder/" is the directory where my BAM files are) bgr <- BAM2GRanges('/path/to/ANALYSIS/folder/', what = c("rname", "strand", "pos", "qwidth"), flag = scanBamFlag(isUnmappedQuery = FALSE, isDuplicate = FALSE), verbose = TRUE) And I get this error message: Reading BAM file AW-07. [bam_header_read] bgzf_check_EOF: Invalid argument [bam_header_read] invalid BAM binary header (this is not a BAM file). Error in open.BamFile(BamFile(file, index), "rb") : SAM/BAM header missing or empty file: '/path/to/ANALYSIS/folder/' The BAM files I'm using here where produced by a tophat alignment and used in R previously for other analysis with this R code: library(Rsamtools) library(GenomicFeatures) bamlist=list() src_files=list.files(pattern="*.bam") for(filename in src_files) { tmp=readBamGappedAlignments(filename) bamlist[[length(bamlist)+1]]=GRanges(seqnames=rname(tmp), ranges=IRanges(start=start(tmp),end=end(tmp)), strand=rep("*",length(tmp))) } names(bamlist)=src_files My question is, Why are now considered as invalid BAM files when they previously worked as correct BAM files? Am I mistaken about the meaning of the error message?Did I miss something in the code? Thanks in advance for your kind help. JL

Alignment convert Repitools Alignment convert Repitools • 8.5k views

ADD COMMENT • link updated 2.2 years ago by Marge • 0 • written 13.0 years ago by José Luis Lavín ▴ 280

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 22 hours ago

United States

On 01/09/2013 06:08 AM, jluis.lavin at unavarra.es wrote: > Hello everybody, > > I've been told about a very nice R package called Repitools. It has many > interesting features but the use of some of them remains a little unclear > to me yet. > There are some subjects about this tool I'll have to ask about, in this > list, but first of all I need to convert some BAM files into GRanges > objects. > To do so I read about BAM2GenomicRanges function. Reading the package > documentation, I tried to convert 3 BAM files into GRanges. I used the > following instructions: > > *(where "/path/to/ANALYSIS/folder/" is the directory where my BAM files are) > > bgr <- BAM2GRanges('/path/to/ANALYSIS/folder/', Hi -- I'm not a regular user of BAM2GRanges, but I think the first argument is supposed to be the path to the bam file, not to the directory in which the bam file is located. After > example(BAM2GRanges) compare > BAM2GRanges(tiny.BAM) with > BAM2GRanges(dirname(tiny.BAM)) Martin > what = c("rname", "strand", "pos", "qwidth"), > flag = scanBamFlag(isUnmappedQuery = FALSE, isDuplicate = FALSE), > verbose = TRUE) > > And I get this error message: > > Reading BAM file AW-07. > [bam_header_read] bgzf_check_EOF: Invalid argument > [bam_header_read] invalid BAM binary header (this is not a BAM file). > Error in open.BamFile(BamFile(file, index), "rb") : > SAM/BAM header missing or empty > file: '/path/to/ANALYSIS/folder/' > > The BAM files I'm using here where produced by a tophat alignment and used > in R previously for other analysis with this R code: > > library(Rsamtools) > library(GenomicFeatures) > > bamlist=list() > src_files=list.files(pattern="*.bam") > > for(filename in src_files) > { > tmp=readBamGappedAlignments(filename) > bamlist[[length(bamlist)+1]]=GRanges(seqnames=rname(tmp), > ranges=IRanges(start=start(tmp),end=end(tmp)), > strand=rep("*",length(tmp))) > } > > > names(bamlist)=src_files > > My question is, Why are now considered as invalid BAM files when they > previously worked as correct BAM files? Am I mistaken about the meaning of > the error message?Did I miss something in the code? > > Thanks in advance for your kind help. > > JL > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 13.0 years ago Martin Morgan 25k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 hours ago

Seattle, WA, United States

Hi JL, On 01/09/2013 06:08 AM, jluis.lavin at unavarra.es wrote: > Hello everybody, > > I've been told about a very nice R package called Repitools. It has many > interesting features but the use of some of them remains a little unclear > to me yet. > There are some subjects about this tool I'll have to ask about, in this > list, but first of all I need to convert some BAM files into GRanges > objects. > To do so I read about BAM2GenomicRanges function. Reading the package > documentation, I tried to convert 3 BAM files into GRanges. I used the > following instructions: > > *(where "/path/to/ANALYSIS/folder/" is the directory where my BAM files are) > > bgr <- BAM2GRanges('/path/to/ANALYSIS/folder/', > what = c("rname", "strand", "pos", "qwidth"), oops, using the qwidth for getting the widths of the genomic ranges is in general incorrect, unless one assumes that all the cigars are trivial (Ms only). But that's really a strong assumption and it will almost never be true, because AFAIK most aligners introduce indels by default during the alignment process. Tophat in particular, is one of them. Note that loading a BAM file into a GRanges object can easily be done with readGappedAlignments() (from the GenomicRanges package), followed by coercion of the returned GappedAlignments object to a GRanges object: gal <- readGappedAlignments("/path/to/your/file.bam") as(gal, "GRanges") It will infer the correct width of the genomic ranges based on the cigar. > flag = scanBamFlag(isUnmappedQuery = FALSE, isDuplicate = FALSE), > verbose = TRUE) > > And I get this error message: > > Reading BAM file AW-07. > [bam_header_read] bgzf_check_EOF: Invalid argument > [bam_header_read] invalid BAM binary header (this is not a BAM file). > Error in open.BamFile(BamFile(file, index), "rb") : > SAM/BAM header missing or empty > file: '/path/to/ANALYSIS/folder/' > > The BAM files I'm using here where produced by a tophat alignment and used > in R previously for other analysis with this R code: > > library(Rsamtools) > library(GenomicFeatures) > > bamlist=list() > src_files=list.files(pattern="*.bam") > > for(filename in src_files) > { > tmp=readBamGappedAlignments(filename) > bamlist[[length(bamlist)+1]]=GRanges(seqnames=rname(tmp), > ranges=IRanges(start=start(tmp),end=end(tmp)), > strand=rep("*",length(tmp))) > } > > > names(bamlist)=src_files That code looks correct, but I would rather do: bam_files <- list.files(pattern="*.bam") gr_list <- lapply(bam_files, function(bam_file) as(readGappedAlignments(bam_file), "GRanges")) names(gr_list) <- bam_files Cheers, H. > > My question is, Why are now considered as invalid BAM files when they > previously worked as correct BAM files? Am I mistaken about the meaning of > the error message?Did I miss something in the code? > > Thanks in advance for your kind help. > > JL > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.0 years ago Hervé Pagès 16k

0

Entering edit mode

First of all I want to thank Herve Pages and Martin Morgan for their kind answers. And I want to apologize for this delay answering back, but I was offline for a few days... I used Herve code to read my BAM files into GRanges objects, but I don't seem to be able to use that Granges object into the cpgDensityCalc function. This is how I try to do it: -(My files are in the working directory already) library(Mus.musculus) library(Repitools) library(GenomicRanges) library(Rsamtools) bam_files <- list.files(pattern="*.bam") gr_list <- lapply(bam_files, function(bam_file) as(readGappedAlignments(bam_file), "GRanges")) names(gr_list) <- bam_files require(BSgenome.Mmusculus.UCSC.mm9) cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, window =600) -And I get the following error: Error in function (classes, fdef, mtable) : unable to find an inherited method for function "cpgDensityCalc", for signature "character", "BSgenome" What does this error mean? Can't "gr_list" be used as input for cpgDensityCalc? Do my R installation lack any module required to perform this analysis? *I'm awfully sorry for my lack of intermediate-advance R scripting knowledge, I'm trying to fix that... Thanks in advance, once more JL sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Mus.musculus_1.0.0 [2] TxDb.Mmusculus.UCSC.mm10.ensGene_2.8.0 [3] org.Mm.eg.db_2.8.0 [4] GO.db_2.8.0 [5] RSQLite_0.11.2 [6] DBI_0.2-5 [7] OrganismDbi_1.0.2 [8] GenomicFeatures_1.10.1 [9] AnnotationDbi_1.20.3 [10] Biobase_2.18.0 [11] Rsamtools_1.10.2 [12] BSgenome.Hsapiens.UCSC.hg18_1.3.19 [13] BiocInstaller_1.8.3 [14] Repitools_1.4.0 [15] BSgenome.Mmusculus.UCSC.mm9_1.3.19 [16] BSgenome_1.26.1 [17] Biostrings_2.26.2 [18] GenomicRanges_1.10.5 [19] IRanges_1.16.4 [20] BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 bitops_1.0-5 edgeR_3.0.8 graph_1.36.1 [5] limma_3.14.3 parallel_2.15.1 RBGL_1.34.0 RCurl_1.95-3 [9] rtracklayer_1.18.2 stats4_2.15.1 tools_2.15.1 XML_3.95-0.1 [13] zlibbioc_1.4.0 -- Dr. Jos? Luis Lav?n Trueba Dpto. de Producci?n Agraria Grupo de Gen?tica y Microbiolog?a Universidad P?blica de Navarra 31006 Pamplona Navarra SPAIN

ADD REPLY • link 13.0 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Hi, On Mon, Jan 14, 2013 at 5:20 AM, <jluis.lavin at="" unavarra.es=""> wrote: > First of all I want to thank Herve Pages and Martin Morgan for their kind > answers. And I want to apologize for this delay answering back, but I was > offline for a few days... > > I used Herve code to read my BAM files into GRanges objects, but I don't > seem to be able to use that Granges object into the cpgDensityCalc > function. > This is how I try to do it: > > -(My files are in the working directory already) > > library(Mus.musculus) > library(Repitools) > library(GenomicRanges) > library(Rsamtools) > > bam_files <- list.files(pattern="*.bam") > gr_list <- lapply(bam_files, > function(bam_file) > as(readGappedAlignments(bam_file), "GRanges")) > names(gr_list) <- bam_files > > require(BSgenome.Mmusculus.UCSC.mm9) > > cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, window =600) > > -And I get the following error: > > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function "cpgDensityCalc", for > signature "character", "BSgenome" > > What does this error mean? It means that your `gr_list` isn't what you think it is -- it's a character, perhaps "try-error"? Look at `head(gr_list)` ... what do you see? > *I'm awfully sorry for my lack of intermediate-advance R scripting > knowledge, I'm trying to fix that... Reading through an Introduction to R might be helpful, if you haven't done so already. Bioconductor uses the S4 class system in R, and it would be helpful to understand some parts of that too. If you look. If you look at previous bioc courses online, you will often find that Martin does a quick intro to these things at the beginning of many sessions: http://bioconductor.org/help/course-materials/ Look for courses that were run at The Hutch (in Seattle) Perhaps the intro slides here will bear fruit: http://bioconductor.org/help/course-materials/2010/SeattleIntro/ -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi JL, On 01/14/2013 08:27 AM, Steve Lianoglou wrote: > Hi, > > On Mon, Jan 14, 2013 at 5:20 AM, <jluis.lavin at="" unavarra.es=""> wrote: >> First of all I want to thank Herve Pages and Martin Morgan for their kind >> answers. And I want to apologize for this delay answering back, but I was >> offline for a few days... >> >> I used Herve code to read my BAM files into GRanges objects, but I don't >> seem to be able to use that Granges object into the cpgDensityCalc >> function. >> This is how I try to do it: >> >> -(My files are in the working directory already) >> >> library(Mus.musculus) >> library(Repitools) >> library(GenomicRanges) >> library(Rsamtools) >> >> bam_files <- list.files(pattern="*.bam") >> gr_list <- lapply(bam_files, >> function(bam_file) >> as(readGappedAlignments(bam_file), "GRanges")) >> names(gr_list) <- bam_files >> >> require(BSgenome.Mmusculus.UCSC.mm9) >> >> cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, window =600) >> >> -And I get the following error: >> >> Error in function (classes, fdef, mtable) : >> unable to find an inherited method for function "cpgDensityCalc", for >> signature "character", "BSgenome" >> >> What does this error mean? > > It means that your `gr_list` isn't what you think it is -- it's a > character, perhaps "try-error"? When running your code, I get: > cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, window =600) Error in function (classes, fdef, mtable) : unable to find an inherited method for function ?cpgDensityCalc? for signature ?"list", "BSgenome"? Not exactly the same error you get: see "list" in the error message I get, versus "character" in the error message you get? Not sure why your 'gr_list' ends up being a character vector instead of a list. Anyway, having it being a list (an ordinary list) doesn't seem to work either. The reason the call to cpgDensityCalc() fails is because this is a generic function and there are no "cpgDensityCalc" methods that will accept the first argument to be a list: > cpgDensityCalc standardGeneric for "cpgDensityCalc" defined from package "Repitools" function (x, organism, ...) standardGeneric("cpgDensityCalc") <environment: 0x717c2a0=""> Methods may be defined for arguments: x, organism Use showMethods("cpgDensityCalc") for currently available ones. > showMethods("cpgDensityCalc") Function: cpgDensityCalc (package Repitools) x="data.frame", organism="BSgenome" x="GRanges", organism="BSgenome" x="GRangesList", organism="BSgenome" Note that the type of input expected by the function is explained in its man page (?cpgDensityCalc): x: A ?data.frame?, with columns ?chr? and ?position?, or columns ?chr?, ?start?, ?end?, and ?strand?. Also may be a ?GRangesList? object, or ?GRanges?. So once you've sorted out the reasons why you get a character instead of a list of GRanges (I've no clue how this could happen), you'll need to turn that ordinary list into a GRangesList object with: gr_list <- GRangesList(gr_list) before calling cpgDensityCalc(). Hope this helps, H. > > Look at `head(gr_list)` ... what do you see? > >> *I'm awfully sorry for my lack of intermediate-advance R scripting >> knowledge, I'm trying to fix that... > > Reading through an Introduction to R might be helpful, if you haven't > done so already. > > Bioconductor uses the S4 class system in R, and it would be helpful to > understand some parts of that too. If you look. If you look at > previous bioc courses online, you will often find that Martin does a > quick intro to these things at the beginning of many sessions: > > http://bioconductor.org/help/course-materials/ > > Look for courses that were run at The Hutch (in Seattle) > > Perhaps the intro slides here will bear fruit: > > http://bioconductor.org/help/course-materials/2010/SeattleIntro/ > > -steve > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.0 years ago Hervé Pagès 16k

0

Entering edit mode

First of all I want to thank Herve and Steve for their attention to my questions this time. @Steve, Thanks for the links to Bioconductor manuals, I'll try to read and understand that info as fast & accurately as I'm capable of, trying not to ask very elementary questions in this list ;) @Herve, Thank you once again for your answer. Once my files were converted into a GRangesList object, as you said, everything seemed to work fine. I also closed my previous R session and started a new one, which may have helped if the other session was corrupted somehow, returning the previous nosense error. Here's the final code that works fine in my computer, if its of any help for future users ;) library(Mus.musculus) library(Repitools) library(GenomicRanges) library(Rsamtools) bam_files <- list.files(pattern="*.bam") gr_list <- lapply(bam_files, function(bam_file) as(readGappedAlignments(bam_file), "GRanges")) names(gr_list) <- bam_files gr_list <- GRangesList(gr_list) require(BSgenome.Mmusculus.UCSC.mm9) cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, window =600) cpgplot <- cpgDensityPlot(gr_list, seq.len=300, organism=Mmusculus, lwd=4, verbose=TRUE) sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 [2] BSgenome_1.26.1 [3] Rsamtools_1.10.2 [4] Biostrings_2.26.2 [5] Repitools_1.4.0 [6] Mus.musculus_1.0.0 [7] TxDb.Mmusculus.UCSC.mm10.ensGene_2.8.0 [8] org.Mm.eg.db_2.8.0 [9] GO.db_2.8.0 [10] RSQLite_0.11.2 [11] DBI_0.2-5 [12] OrganismDbi_1.0.2 [13] GenomicFeatures_1.10.1 [14] GenomicRanges_1.10.5 [15] IRanges_1.16.4 [16] AnnotationDbi_1.20.3 [17] Biobase_2.18.0 [18] BiocGenerics_0.4.0 [19] BiocInstaller_1.8.3 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 bitops_1.0-5 edgeR_3.0.8 graph_1.36.1 [5] limma_3.14.3 parallel_2.15.1 RBGL_1.34.0 RCurl_1.95-3 [9] rtracklayer_1.18.2 stats4_2.15.1 tools_2.15.1 XML_3.95-0.1 [13] zlibbioc_1.4.0 With best wishes JL El Lun, 14 de Enero de 2013, 20:47, Hervé Pagès escribi?: > Hi JL, > > On 01/14/2013 08:27 AM, Steve Lianoglou wrote: >> Hi, >> >> On Mon, Jan 14, 2013 at 5:20 AM, <jluis.lavin at="" unavarra.es=""> wrote: >>> First of all I want to thank Herve Pages and Martin Morgan for their >>> kind >>> answers. And I want to apologize for this delay answering back, but I >>> was >>> offline for a few days... >>> >>> I used Herve code to read my BAM files into GRanges objects, but I >>> don't >>> seem to be able to use that Granges object into the cpgDensityCalc >>> function. >>> This is how I try to do it: >>> >>> -(My files are in the working directory already) >>> >>> library(Mus.musculus) >>> library(Repitools) >>> library(GenomicRanges) >>> library(Rsamtools) >>> >>> bam_files <- list.files(pattern="*.bam") >>> gr_list <- lapply(bam_files, >>> function(bam_file) >>> as(readGappedAlignments(bam_file), "GRanges")) >>> names(gr_list) <- bam_files >>> >>> require(BSgenome.Mmusculus.UCSC.mm9) >>> >>> cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, window =600) >>> >>> -And I get the following error: >>> >>> Error in function (classes, fdef, mtable) : >>> unable to find an inherited method for function "cpgDensityCalc", for >>> signature "character", "BSgenome" >>> >>> What does this error mean? >> >> It means that your `gr_list` isn't what you think it is -- it's a >> character, perhaps "try-error"? > > When running your code, I get: > > > cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, window =600) > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function ?cpgDensityCalc? > for signature ?"list", "BSgenome"? > > Not exactly the same error you get: see "list" in the error message > I get, versus "character" in the error message you get? Not sure why > your 'gr_list' ends up being a character vector instead of a list. > > Anyway, having it being a list (an ordinary list) doesn't seem to work > either. The reason the call to cpgDensityCalc() fails is because this > is a generic function and there are no "cpgDensityCalc" methods that > will accept the first argument to be a list: > > > cpgDensityCalc > standardGeneric for "cpgDensityCalc" defined from package "Repitools" > > function (x, organism, ...) > standardGeneric("cpgDensityCalc") > <environment: 0x717c2a0=""> > Methods may be defined for arguments: x, organism > Use showMethods("cpgDensityCalc") for currently available ones. > > > showMethods("cpgDensityCalc") > Function: cpgDensityCalc (package Repitools) > x="data.frame", organism="BSgenome" > x="GRanges", organism="BSgenome" > x="GRangesList", organism="BSgenome" > > Note that the type of input expected by the function is explained in > its man page (?cpgDensityCalc): > > x: A ?data.frame?, with columns ?chr? and ?position?, or columns > ?chr?, ?start?, ?end?, and ?strand?. Also may be a > ?GRangesList? object, or ?GRanges?. > > So once you've sorted out the reasons why you get a character instead of > a list of GRanges (I've no clue how this could happen), you'll need to > turn that ordinary list into a GRangesList object with: > > gr_list <- GRangesList(gr_list) > > before calling cpgDensityCalc(). > > Hope this helps, > H. > >> >> Look at `head(gr_list)` ... what do you see? >> >>> *I'm awfully sorry for my lack of intermediate-advance R scripting >>> knowledge, I'm trying to fix that... >> >> Reading through an Introduction to R might be helpful, if you haven't >> done so already. >> >> Bioconductor uses the S4 class system in R, and it would be helpful to >> understand some parts of that too. If you look. If you look at >> previous bioc courses online, you will often find that Martin does a >> quick intro to these things at the beginning of many sessions: >> >> http://bioconductor.org/help/course-materials/ >> >> Look for courses that were run at The Hutch (in Seattle) >> >> Perhaps the intro slides here will bear fruit: >> >> http://bioconductor.org/help/course-materials/2010/SeattleIntro/ >> >> -steve >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Dr. Jos? Luis Lav?n Trueba Dpto. de Producci?n Agraria Grupo de Gen?tica y Microbiolog?a Universidad P?blica de Navarra 31006 Pamplona Navarra SPAIN

ADD REPLY • link 13.0 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Dear all, I'm afraid I have another question derived from the topic of this post. I've used Repitools to calculate Cpg density from my samples, it worked correctly, but now when I try to write the density calculations to a file, I'm unable... Is it possible to do it? Here's how I tried to do it: cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, seq.len = 300) write.table(cpdens, file = "Cpg_dens.txt", append = FALSE, quote = TRUE, sep = "\t", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE) And this is the error message I get: Error in data.frame(c(0L, 1L, 2L, 3L, 3L, 3L, 1L, 2L, 14L, 14L, 3L, 1L, : arguments imply differing number of rows: 530842, 3401342 And I also tried: > write.csv(cpdens, file = "Cpg_dens.csv", quote = TRUE, eol = "\n", na = "NA", + row.names = TRUE) And got again this error: Error in data.frame(c(0L, 1L, 2L, 3L, 3L, 3L, 1L, 2L, 14L, 14L, 3L, 1L, : arguments imply differing number of rows: 530842, 3401342 Can somebody explain to me how to write the Cpg density calculations data to a file (tab separated file or csv if possible). Thanks in advance > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Mmusculus.UCSC.mm9_1.3.19 [2] BSgenome_1.26.1 [3] Rsamtools_1.10.2 [4] Biostrings_2.26.2 [5] Repitools_1.4.0 [6] Mus.musculus_1.0.0 [7] TxDb.Mmusculus.UCSC.mm10.ensGene_2.8.0 [8] org.Mm.eg.db_2.8.0 [9] GO.db_2.8.0 [10] RSQLite_0.11.2 [11] DBI_0.2-5 [12] OrganismDbi_1.0.2 [13] GenomicFeatures_1.10.1 [14] GenomicRanges_1.10.5 [15] IRanges_1.16.4 [16] AnnotationDbi_1.20.3 [17] Biobase_2.18.0 [18] BiocGenerics_0.4.0 [19] BiocInstaller_1.8.3 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 bitops_1.0-5 edgeR_3.0.8 graph_1.36.1 [5] limma_3.14.3 parallel_2.15.1 RBGL_1.34.0 RCurl_1.95-3 [9] rtracklayer_1.18.2 stats4_2.15.1 tools_2.15.1 XML_3.95-0.1 [13] zlibbioc_1.4.0 -- Dr. Jos? Luis Lav?n Trueba Dpto. de Producci?n Agraria Grupo de Gen?tica y Microbiolog?a Universidad P?blica de Navarra 31006 Pamplona Navarra SPAIN

ADD REPLY • link 13.0 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Hi, On Tue, Jan 22, 2013 at 10:14 AM, <jluis.lavin at="" unavarra.es=""> wrote: > Dear all, > > I'm afraid I have another question derived from the topic of this post. > I've used Repitools to calculate Cpg density from my samples, it worked > correctly, but now when I try to write the density calculations to a file, > I'm unable... > Is it possible to do it? > > Here's how I tried to do it: > > cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, seq.len = 300) > > write.table(cpdens, file = "Cpg_dens.txt", append = FALSE, quote = TRUE, > sep = "\t", eol = "\n", na = "NA", dec = ".", > row.names = TRUE, col.names = TRUE) I've never used Repitools, but: (1) What is the output of `is(cpdens)` ? If it's not a data.frame, this won't work (2) If it is a data.frame, can we see the output of `head(cpdens)` ? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hello Steve, (1) What is the output of `is(cpdens)` ? If it's not a data.frame, this won't work is(cpdens) [1] "list" "vector" "AssayData" "vectorORfactor" head(cpdens) . . . [99841] 1 3 3 3 4 4 2 2 2 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 [99865] 2 2 0 0 0 0 0 0 0 3 1 1 0 0 0 0 0 0 0 2 1 1 1 0 [99889] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 [99913] 0 3 3 3 3 3 3 3 0 0 0 3 1 13 0 3 2 2 1 0 0 0 1 1 [99937] 0 0 0 2 2 2 2 3 3 1 1 1 1 1 0 0 0 0 0 0 0 2 0 2 [99961] 0 1 0 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 [99985] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 [ reached getOption("max.print") -- omitted 3301343 entries ] It doesn't seem to be data.frame as you pointed out... Is it possible write this to an output file of some kind? Thanks JL El Mar, 22 de Enero de 2013, 16:36, Steve Lianoglou escribi?: > Hi, > > On Tue, Jan 22, 2013 at 10:14 AM, <jluis.lavin at="" unavarra.es=""> wrote: >> Dear all, >> >> I'm afraid I have another question derived from the topic of this post. >> I've used Repitools to calculate Cpg density from my samples, it worked >> correctly, but now when I try to write the density calculations to a >> file, >> I'm unable... >> Is it possible to do it? >> >> Here's how I tried to do it: >> >> cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, seq.len = 300) >> >> write.table(cpdens, file = "Cpg_dens.txt", append = FALSE, quote = TRUE, >> sep = "\t", eol = "\n", na = "NA", dec = ".", >> row.names = TRUE, col.names = TRUE) > > I've never used Repitools, but: > > (1) What is the output of `is(cpdens)` ? If it's not a data.frame, > this won't work > (2) If it is a data.frame, can we see the output of `head(cpdens)` ? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Dr. Jos? Luis Lav?n Trueba Dpto. de Producci?n Agraria Grupo de Gen?tica y Microbiolog?a Universidad P?blica de Navarra 31006 Pamplona Navarra SPAIN

ADD REPLY • link 13.0 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Hi, On Tue, Jan 22, 2013 at 10:54 AM, <jluis.lavin at="" unavarra.es=""> wrote: > Hello Steve, > > (1) What is the output of `is(cpdens)` ? If it's not a data.frame, this > won't work > > is(cpdens) > [1] "list" "vector" "AssayData" "vectorORfactor" > > head(cpdens) > . > . > . > > [99841] 1 3 3 3 4 4 2 2 2 1 0 0 0 0 0 1 1 1 1 1 1 1 > 0 0 > [99865] 2 2 0 0 0 0 0 0 0 3 1 1 0 0 0 0 0 0 0 2 1 1 > 1 0 > [99889] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 > 0 0 > [99913] 0 3 3 3 3 3 3 3 0 0 0 3 1 13 0 3 2 2 1 0 0 0 > 1 1 > [99937] 0 0 0 2 2 2 2 3 3 1 1 1 1 1 0 0 0 0 0 0 0 2 > 0 2 > [99961] 0 1 0 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 > 1 1 > [99985] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 > [ reached getOption("max.print") -- omitted 3301343 entries ] > > It doesn't seem to be data.frame as you pointed out... > Is it possible write this to an output file of some kind? Yes, of course. You just have to ask yourself how you want to serialize it. ?writeLines can be your friend, here, if you want to save it in some ASCII format. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi, On 01/22/2013 08:29 AM, Steve Lianoglou wrote: > Hi, > > On Tue, Jan 22, 2013 at 10:54 AM, <jluis.lavin at="" unavarra.es=""> wrote: >> Hello Steve, >> >> (1) What is the output of `is(cpdens)` ? If it's not a data.frame, this >> won't work >> >> is(cpdens) >> [1] "list" "vector" "AssayData" "vectorORfactor" >> >> head(cpdens) >> . >> . >> . >> >> [99841] 1 3 3 3 4 4 2 2 2 1 0 0 0 0 0 1 1 1 1 1 1 1 >> 0 0 >> [99865] 2 2 0 0 0 0 0 0 0 3 1 1 0 0 0 0 0 0 0 2 1 1 >> 1 0 >> [99889] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 >> 0 0 >> [99913] 0 3 3 3 3 3 3 3 0 0 0 3 1 13 0 3 2 2 1 0 0 0 >> 1 1 >> [99937] 0 0 0 2 2 2 2 3 3 1 1 1 1 1 0 0 0 0 0 0 0 2 >> 0 2 >> [99961] 0 1 0 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 >> 1 1 >> [99985] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 >> [ reached getOption("max.print") -- omitted 3301343 entries ] >> >> It doesn't seem to be data.frame as you pointed out... >> Is it possible write this to an output file of some kind? > > Yes, of course. > > You just have to ask yourself how you want to serialize it. > > ?writeLines can be your friend, here, if you want to save it in some > ASCII format. Or you can use some higher-level facilities like export() from the rtracklayer package to export to a BED file. That means that you first need to stick the numeric vectors of CpG densities back into the GRanges object they correspond to ('cpdens' should have the same shape as the original GRangesList you passed to cpgDensityCalc() i.e. it should be list of numeric vectors where the i-th list element has the length of the i-th GRangesList element). Then you export each of the GRanges object separately. Could be done with something like: library(rtracklayer) for (i in seq_along(gr_list)) { gr <- gr_list[[i]] ## Need to set the "score" metadata col (see ??export.bed). mcols(gr)$score <- cpdens[[i]] filepath <- paste("CpG", i, ".bed", sep="") export.bed(gr, filepath) } Granted that 'gr_list' was obtained by loading BAM files, you should end up with 1 BED file per original BAM file. HTH, H. > > -steve > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.0 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you very much Herv?, Your code really got the job done. And thanks also to Steve for his nice comments and Tim for his kind suggestions. With best wishes JL El Mie, 23 de Enero de 2013, 7:40, Hervé Pagès escribi?: > Hi, > > On 01/22/2013 08:29 AM, Steve Lianoglou wrote: >> Hi, >> >> On Tue, Jan 22, 2013 at 10:54 AM, <jluis.lavin at="" unavarra.es=""> wrote: >>> Hello Steve, >>> >>> (1) What is the output of `is(cpdens)` ? If it's not a data.frame, this >>> won't work >>> >>> is(cpdens) >>> [1] "list" "vector" "AssayData" "vectorORfactor" >>> >>> head(cpdens) >>> . >>> . >>> . >>> >>> [99841] 1 3 3 3 4 4 2 2 2 1 0 0 0 0 0 1 1 1 1 1 1 >>> 1 >>> 0 0 >>> [99865] 2 2 0 0 0 0 0 0 0 3 1 1 0 0 0 0 0 0 0 2 1 >>> 1 >>> 1 0 >>> [99889] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 >>> 0 >>> 0 0 >>> [99913] 0 3 3 3 3 3 3 3 0 0 0 3 1 13 0 3 2 2 1 0 0 >>> 0 >>> 1 1 >>> [99937] 0 0 0 2 2 2 2 3 3 1 1 1 1 1 0 0 0 0 0 0 0 >>> 2 >>> 0 2 >>> [99961] 0 1 0 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 >>> 1 >>> 1 1 >>> [99985] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 >>> [ reached getOption("max.print") -- omitted 3301343 entries ] >>> >>> It doesn't seem to be data.frame as you pointed out... >>> Is it possible write this to an output file of some kind? >> >> Yes, of course. >> >> You just have to ask yourself how you want to serialize it. >> >> ?writeLines can be your friend, here, if you want to save it in some >> ASCII format. > > Or you can use some higher-level facilities like export() from the > rtracklayer package to export to a BED file. That means that you > first need to stick the numeric vectors of CpG densities back into > the GRanges object they correspond to ('cpdens' should have the same > shape as the original GRangesList you passed to cpgDensityCalc() > i.e. it should be list of numeric vectors where the i-th list > element has the length of the i-th GRangesList element). Then > you export each of the GRanges object separately. Could be done > with something like: > > library(rtracklayer) > for (i in seq_along(gr_list)) { > gr <- gr_list[[i]] > ## Need to set the "score" metadata col (see ??export.bed). > mcols(gr)$score <- cpdens[[i]] > filepath <- paste("CpG", i, ".bed", sep="") > export.bed(gr, filepath) > } > > Granted that 'gr_list' was obtained by loading BAM files, you should > end up with 1 BED file per original BAM file. > > HTH, > H. > >> >> -steve >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Dr. Jos? Luis Lav?n Trueba Dpto. de Producci?n Agraria Grupo de Gen?tica y Microbiolog?a Universidad P?blica de Navarra 31006 Pamplona Navarra SPAIN

ADD REPLY • link 13.0 years ago José Luis Lavín ▴ 280

0

Entering edit mode

Dario Strbenac ★ 1.6k

@dario-strbenac-5916

Last seen 5 weeks ago

Australia

> the first argument is supposed to be the path to the bam file, not to the directory in which the bam file is located. That's right. Also, note the help page. path : A character vector of length 1. The path of the BAM file. Notice there is another method BAM2GRangesList described on the help page for processing multiple BAM file paths into a GRangesList. Hervé Pagès makes a good point. Although this package isn't for the analysis of RNA-seq data, and the output of Tophat is not relevant, this code was written when everyone was using the original Bowtie to do ChIP-seq alignments, which doesn't perform gapped alignments. This will be updated.

ADD COMMENT • link 13.0 years ago Dario Strbenac ★ 1.6k

0

Entering edit mode

Dario, On 01/09/2013 03:00 PM, Dario Strbenac wrote: >> the first argument is supposed to be the path to the bam file, not to the directory in which the bam file is located. > > That's right. Also, note the help page. > > path : A character vector of length 1. The path of the BAM file. > > Notice there is another method BAM2GRangesList described on the help page for processing multiple BAM file paths into a GRangesList. > > Hervé Pagès makes a good point. Although this package isn't for the analysis of RNA-seq data, and the output of Tophat is not relevant, this code was written when everyone was using the original Bowtie to do ChIP-seq alignments, which doesn't perform gapped alignments. Note that it's not just the gaps (i.e. N letters) that have an impact on the widths of the genomic ranges, but also the indels (I and D letters). And I think that Bowtie, like probably most aligners (even the early ones), introduces indels in the alignments. However, the impact of the indels is typically less than the impact of the gaps: in the range of 1-10 nucleotides for the former, hundreds of nucleotides for the latter. But still, depending on the kind of downstream analysis you're doing, being off by just 1 nucleotide can have big consequences (e.g. when translating). Not really worth taking that risk when it's easy to get this exactly right ;-) Cheers, H. > This will be updated. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.0 years ago Hervé Pagès 16k

0

Entering edit mode

Dario Strbenac ★ 1.6k

@dario-strbenac-5916

Last seen 5 weeks ago

Australia

>That code looks correct, but I would rather do: > > bam_files <- list.files(pattern="*.bam") > gr_list <- lapply(bam_files, > function(bam_file) > as(readGappedAlignments(bam_file), "GRanges")) > names(gr_list) <- bam_files Is it possible to add an option to the coercion to keep the metadata columns ? For example, if the user specifies they also want to read the quality scores, the coercion to GRanges currently discards that.

ADD COMMENT • link 13.0 years ago Dario Strbenac ★ 1.6k

0

Entering edit mode

Hi Dario, On 01/10/2013 05:00 PM, Dario Strbenac wrote: >> That code looks correct, but I would rather do: >> >> bam_files <- list.files(pattern="*.bam") >> gr_list <- lapply(bam_files, >> function(bam_file) >> as(readGappedAlignments(bam_file), "GRanges")) >> names(gr_list) <- bam_files > > Is it possible to add an option to the coercion to keep the metadata columns ? For example, if the user specifies they also want to read the quality scores, the coercion to GRanges currently discards that. Because of its syntax, coercion doesn't allow additional options. So I could modify the coercion method to propagate the metadata cols. Given that the names are propagated, it probably make sense to also propagate the metadata cols. Note that the coercion method from GappedAlignments to GRanges is just calling the granges() getter with no args. I think I should add the 'use.names' and 'use.mcols' args to this one to let the user control whether or not propagate the names and/or metadata cols. Cheers, H. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.0 years ago Hervé Pagès 16k

0

Entering edit mode

Hi, On Mon, Jan 14, 2013 at 12:10 PM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: [snip] > Because of its syntax, coercion doesn't allow additional options. I feel like I often find myself wishing that `as` couldn't have been defined as something like: function (object, Class, ..., .strict = TRUE, .ext = possibleExtends(thisClass, Class)) or something similar, as opposed to the current: function (object, Class, strict = TRUE, ext = possibleExtends(thisClass, Class)) I guess this is another one for R-aleph ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 13.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Dario Strbenac ★ 1.6k

@dario-strbenac-5916

Last seen 5 weeks ago

Australia

>cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, window =600) This example doesn't make biological sense to me. This creates a 600 base window around the start of each sequencing read. The window argument is relevant if you have a data.frame of genes, and you'd like to know the CpG density in a window around the TSS, for example. For your purpose, I would recommend cpdens <- cpgDensityCalc(gr_list, organism=Mmusculus, seq.len = aveLength) The average DNA fragment length that was sequenced is assigned to aveLength. Ask the biologist who prepared the sequencing library for this information. It is related to the position on the gel that they cut the DNA out. See http://nar.oxfordjournals.org/content/40/10/e72.full for more background. -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia

ADD COMMENT • link 13.0 years ago Dario Strbenac ★ 1.6k

0

Entering edit mode

First of all I want to thank Herve Pages and Martin Morgan for their kind

answers. And I want to apologize for this delay answering back, but I was offline for a few days...

I used Herve code to read my BAM files into GRanges objects, but I don't seem to be able to use that Granges object into the cpgDensityCalc (1) What is the output of is(cpdens) ? If it's not a data.frame, this won't work

is(cpdens) 1 "list" "vector" "AssayData" I think it is so good happy wheels

ADD REPLY • link 2.6 years ago guanliyun081 • 0

0

Entering edit mode

Thanks! That was very helpful!

The average DNA fragment length that was sequenced is assigned to aveLength. Ask the biologist who prepared the sequencing library for this information. It is related to the position on the gel that they cut the DNA out.

See http://nar.oxfordjournals.org/content/40/10/e72.full redactle unlimited for more background.

ADD REPLY • link 2.5 years ago Kidd • 0

0

Entering edit mode

A 600 fnaf security breach base window is formed at the beginning of each sequencing read in the first example. When examining the CpG density at the transcription start site, this timeframe is important. The use of a data.frame containing genes is mentioned.

ADD REPLY • link 2.2 years ago Marge • 0

Login before adding your answer.