creating GRanges and reading BAM files

0

Entering edit mode

David A. ▴ 70

@david-a-4430

Last seen 9.6 years ago

Hi list, sorry for a simple question, but I am a newbie a bit lost reading all the information on how to handle NGS data using R tools. I have a set of BAM files from Junior sequencer, one BAM per amplicon per sample (Roche's software does not output one single BAM file per sample including all amplicons). I also have a bed file with the features sequenced. At the moment I am only testing with two amplicons for one sample. My final aim is to get the coverage per amplicon per sample I am using R version 2.14.2 (2012-02-29) I read in the bed file and tried to create a GRanges object: > bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) > bed.frame chromosome start end 1 APOBex26_M13 1 478 2 APOBex29_M13 1 448 > gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end= bed.frame[,3])) > gr GRanges with 2 ranges and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] APOBex26_M13 [1, 478] * [2] APOBex29_M13 [1, 448] * --- seqlengths: APOBex26_M13 APOBex29_M13 NA NA I also read in the list of BAM files vand convert to BamFileList: > fls = list.files(pattern="*bam",full=TRUE) > fls [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" bfs <- BamFileList(fls) I then try to summarizeOverlaps but gives me the following error. >olap <-summarizeOverlaps(gr, bfs) Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported yet, sorry! What is the meaning of this error and how can I overcome it? Is the gr object not created properly? Is Metadata necessary? Why is not seqlengths filled automatically using the IRanges? Thanks for your help, Dave [[alternative HTML version deleted]]

Coverage convert Coverage convert • 4.0k views

ADD COMMENT • link updated 12.1 years ago by Sunny Yu Liu ▴ 80 • written 12.1 years ago by David A. ▴ 70

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

On Wed, Mar 7, 2012 at 4:51 AM, David A. <dasolexa@hotmail.com> wrote: > > Hi list, > > sorry for a simple question, but I am a newbie a bit lost reading all the > information on how to handle NGS data using R tools. I have a set of BAM > files from Junior sequencer, one BAM per amplicon per sample (Roche's > software does not output one single BAM file per sample including all > amplicons). I also have a bed file with the features sequenced. At the > moment I am only testing with two amplicons for one sample. My final aim is > to get the coverage per amplicon per sample > > I am using R version 2.14.2 (2012-02-29) > > I read in the bed file and tried to create a GRanges object: > > > bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) > > bed.frame > chromosome start end > 1 APOBex26_M13 1 478 > 2 APOBex29_M13 1 448 > This does not appear to be a valid BED file. There should not be any column names, and the intervals should be 0-based. But anyway, let's just say it is BED-like. > > > gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end= bed.frame[,3])) > > gr > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] APOBex26_M13 [1, 478] * > [2] APOBex29_M13 [1, 448] * > --- > seqlengths: > APOBex26_M13 APOBex29_M13 > NA NA > > > I also read in the list of BAM files vand convert to BamFileList: > > > fls = list.files(pattern="*bam",full=TRUE) > > fls > [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" > [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" > bfs <- BamFileList(fls) > > I then try to summarizeOverlaps but gives me the following error. > > >olap <-summarizeOverlaps(gr, bfs) > Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : > in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported > yet, sorry! > > What is the meaning of this error and how can I overcome it? > Well I would just take it at face value: the P cigar operation is not supported yet. Maybe it is time to start supporting it.. :) > Is the gr object not created properly? Is Metadata necessary? Why is not > seqlengths filled automatically using the IRanges? > > There is no way to know the seqlengths automatically. You've constructed a GRanges with features that lay on some typically longer sequence, but the sequence length was never specified. You can pass your "end" values as the "seqlengths" to the GRanges constructor. Michael > Thanks for your help, > > Dave > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 12.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks a lot Michael. I will modify my pseudo-bed file accordingly. Regarding padding not being supported, and while waiting to see if it ever gets support so that I can continue with analysis, is there any tool that will automatically unpad the alignment? or do people create their own scripts to remove the P flags? That is all it needs right? Dave Date: Wed, 7 Mar 2012 07:21:31 -0800 Subject: Re: [BioC] creating GRanges and reading BAM files From: lawrence.michael@gene.com To: dasolexa@hotmail.com CC: bioconductor@r-project.org On Wed, Mar 7, 2012 at 4:51 AM, David A. <dasolexa@hotmail.com> wrote: Hi list, sorry for a simple question, but I am a newbie a bit lost reading all the information on how to handle NGS data using R tools. I have a set of BAM files from Junior sequencer, one BAM per amplicon per sample (Roche's software does not output one single BAM file per sample including all amplicons). I also have a bed file with the features sequenced. At the moment I am only testing with two amplicons for one sample. My final aim is to get the coverage per amplicon per sample I am using R version 2.14.2 (2012-02-29) I read in the bed file and tried to create a GRanges object: > bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) > bed.frame chromosome start end 1 APOBex26_M13 1 478 2 APOBex29_M13 1 448 This does not appear to be a valid BED file. There should not be any column names, and the intervals should be 0-based. But anyway, let's just say it is BED-like. > gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end= bed.frame[,3])) > gr GRanges with 2 ranges and 0 elementMetadata values: seqnames ranges strand <rle> <iranges> <rle> [1] APOBex26_M13 [1, 478] * [2] APOBex29_M13 [1, 448] * --- seqlengths: APOBex26_M13 APOBex29_M13 NA NA I also read in the list of BAM files vand convert to BamFileList: > fls = list.files(pattern="*bam",full=TRUE) > fls [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" bfs <- BamFileList(fls) I then try to summarizeOverlaps but gives me the following error. >olap <-summarizeOverlaps(gr, bfs) Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported yet, sorry! What is the meaning of this error and how can I overcome it? Well I would just take it at face value: the P cigar operation is not supported yet. Maybe it is time to start supporting it.. :) Is the gr object not created properly? Is Metadata necessary? Why is not seqlengths filled automatically using the IRanges? There is no way to know the seqlengths automatically. You've constructed a GRanges with features that lay on some typically longer sequence, but the sequence length was never specified. You can pass your "end" values as the "seqlengths" to the GRanges constructor. Michael Thanks for your help, Dave [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago David A. ▴ 70

0

Entering edit mode

Dear all I am wondering how to estimate the genetic coding impact on variant calling (VCF). That is, how amino acid coding affects variant calling and if there is any package addressing this issue. Thanks. John [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago John linux-user ▴ 210

0

Entering edit mode

On Wed, Mar 7, 2012 at 12:23 PM, John linux-user <johnlinuxuser at="" yahoo.com=""> wrote: > Dear all > > I am wondering how to estimate the genetic coding? impact on variant calling (VCF). That is, how amino acid coding affects variant calling and if there is any package addressing this issue. Thanks. > library(VariantAnnotation) Sean

ADD REPLY • link 12.1 years ago Sean Davis 21k

0

Entering edit mode

Herve, do you think we could support the P in the CIGAR strings? Or does it complicate things too much? On Wed, Mar 7, 2012 at 9:03 AM, David A. <dasolexa@hotmail.com> wrote: > Thanks a lot Michael. > > I will modify my pseudo-bed file accordingly. > > If you do actually get your BED file into spec, then you could use rtracklayer::import to load it directly as a GRanges. You'll still need to set the seqlengths on it. > Regarding padding not being supported, and while waiting to see if it ever > gets support so that I can continue with analysis, is there any tool that > will automatically unpad the alignment? or do people create their own > scripts to remove the P flags? That is all it needs right? > > > > Dave > > ------------------------------ > Date: Wed, 7 Mar 2012 07:21:31 -0800 > Subject: Re: [BioC] creating GRanges and reading BAM files > From: lawrence.michael@gene.com > To: dasolexa@hotmail.com > CC: bioconductor@r-project.org > > > > > On Wed, Mar 7, 2012 at 4:51 AM, David A. <dasolexa@hotmail.com> wrote: > > > Hi list, > > sorry for a simple question, but I am a newbie a bit lost reading all the > information on how to handle NGS data using R tools. I have a set of BAM > files from Junior sequencer, one BAM per amplicon per sample (Roche's > software does not output one single BAM file per sample including all > amplicons). I also have a bed file with the features sequenced. At the > moment I am only testing with two amplicons for one sample. My final aim is > to get the coverage per amplicon per sample > > I am using R version 2.14.2 (2012-02-29) > > I read in the bed file and tried to create a GRanges object: > > > bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) > > bed.frame > chromosome start end > 1 APOBex26_M13 1 478 > 2 APOBex29_M13 1 448 > > > This does not appear to be a valid BED file. There should not be any > column names, and the intervals should be 0-based. But anyway, let's just > say it is BED-like. > > > > > gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end= bed.frame[,3])) > > gr > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] APOBex26_M13 [1, 478] * > [2] APOBex29_M13 [1, 448] * > --- > seqlengths: > APOBex26_M13 APOBex29_M13 > NA NA > > > I also read in the list of BAM files vand convert to BamFileList: > > > fls = list.files(pattern="*bam",full=TRUE) > > fls > [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" > [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" > bfs <- BamFileList(fls) > > I then try to summarizeOverlaps but gives me the following error. > > >olap <-summarizeOverlaps(gr, bfs) > Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : > in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported > yet, sorry! > > What is the meaning of this error and how can I overcome it? > > > Well I would just take it at face value: the P cigar operation is not > supported yet. Maybe it is time to start supporting it.. :) > > > Is the gr object not created properly? Is Metadata necessary? Why is not > seqlengths filled automatically using the IRanges? > > > There is no way to know the seqlengths automatically. You've constructed a > GRanges with features that lay on some typically longer sequence, but the > sequence length was never specified. You can pass your "end" values as the > "seqlengths" to the GRanges constructor. > > Michael > > > Thanks for your help, > > Dave > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, Dave, On 03/07/2012 09:23 AM, Michael Lawrence wrote: > Herve, do you think we could support the P in the CIGAR strings? Or does it > complicate things too much? According to the SAM spec: P: padding (silent deletion from padded reference) Not clear to me what they mean exactly. In particular, what's the "padded reference"? So I looked at the example provided in section 1.1 of the spec (please use a fixed-width font to display this message): Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGCCAT The corresponding SAM format is: @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0 r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * As you can see, P is used in the cigar of read r002. So yes the reference sequence is padded with 2 consecutive * but that's just a displaying trick here that makes it easier to display the 2-letter insertion in r001/1. So I'm confused about how relevant the 1P operation in the r002 cigar is. The same cigar but without the 1P operation would be 3S6M1I4M, and it tells me all I need to know about how the read is aligned to the reference (not to the *padded* reference, why should I care?). Am I missing something or are there situations where the P actually carries more interesting/important meaning? Otherwise, adapting the cigar utils in GenomicRanges will be easy: they'll just ignore the P's :-) But I'd really like to have a closer look at those BAM files produced by the Roche's software first. Dave do you think you can put this file (the one with P's), or a subset of it, somewhere online where I can access it? Thanks, H. > > On Wed, Mar 7, 2012 at 9:03 AM, David A.<dasolexa at="" hotmail.com=""> wrote: > >> Thanks a lot Michael. >> >> I will modify my pseudo-bed file accordingly. >> >> > If you do actually get your BED file into spec, then you could use > rtracklayer::import to load it directly as a GRanges. You'll still need to > set the seqlengths on it. > > >> Regarding padding not being supported, and while waiting to see if it ever >> gets support so that I can continue with analysis, is there any tool that >> will automatically unpad the alignment? or do people create their own >> scripts to remove the P flags? That is all it needs right? >> >> >> >> Dave >> >> ------------------------------ >> Date: Wed, 7 Mar 2012 07:21:31 -0800 >> Subject: Re: [BioC] creating GRanges and reading BAM files >> From: lawrence.michael at gene.com >> To: dasolexa at hotmail.com >> CC: bioconductor at r-project.org >> >> >> >> >> On Wed, Mar 7, 2012 at 4:51 AM, David A.<dasolexa at="" hotmail.com=""> wrote: >> >> >> Hi list, >> >> sorry for a simple question, but I am a newbie a bit lost reading all the >> information on how to handle NGS data using R tools. I have a set of BAM >> files from Junior sequencer, one BAM per amplicon per sample (Roche's >> software does not output one single BAM file per sample including all >> amplicons). I also have a bed file with the features sequenced. At the >> moment I am only testing with two amplicons for one sample. My final aim is >> to get the coverage per amplicon per sample >> >> I am using R version 2.14.2 (2012-02-29) >> >> I read in the bed file and tried to create a GRanges object: >> >>> bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) >>> bed.frame >> chromosome start end >> 1 APOBex26_M13 1 478 >> 2 APOBex29_M13 1 448 >> >> >> This does not appear to be a valid BED file. There should not be any >> column names, and the intervals should be 0-based. But anyway, let's just >> say it is BED-like. >> >> >>> >> gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end =bed.frame[,3])) >>> gr >> GRanges with 2 ranges and 0 elementMetadata values: >> seqnames ranges strand >> <rle> <iranges> <rle> >> [1] APOBex26_M13 [1, 478] * >> [2] APOBex29_M13 [1, 448] * >> --- >> seqlengths: >> APOBex26_M13 APOBex29_M13 >> NA NA >> >> >> I also read in the list of BAM files vand convert to BamFileList: >> >>> fls = list.files(pattern="*bam",full=TRUE) >>> fls >> [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" >> [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" >> bfs<- BamFileList(fls) >> >> I then try to summarizeOverlaps but gives me the following error. >> >>> olap<-summarizeOverlaps(gr, bfs) >> Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : >> in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported >> yet, sorry! >> >> What is the meaning of this error and how can I overcome it? >> >> >> Well I would just take it at face value: the P cigar operation is not >> supported yet. Maybe it is time to start supporting it.. :) >> >> >> Is the gr object not created properly? Is Metadata necessary? Why is not >> seqlengths filled automatically using the IRanges? >> >> >> There is no way to know the seqlengths automatically. You've constructed a >> GRanges with features that lay on some typically longer sequence, but the >> sequence length was never specified. You can pass your "end" values as the >> "seqlengths" to the GRanges constructor. >> >> Michael >> >> >> Thanks for your help, >> >> Dave >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.1 years ago Hervé Pagès 16k

0

Entering edit mode

Hi again, The latest version of the SAM spec (v1.4-r994), not-yet-released but accessible here: https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/sam- spec has a whole new and very enlightening section about the "padded SAM format" (3. Guide for Describing Assembly Sequences in SAM) where the P operation plays a central role. After reading it and also having a quick look at this thread: http://sourceforge.net/mailarchive/forum.php?thread_name=CAKVJ-_6Fpuin BN0-c9VtjaCBdiBWg%2Byzdn5mHTGGty0K__k0DQ%40mail.gmail.com&forum_name =samtools-devel my understanding is that the only purpose of the padding CIGAR operation 'P' is to allow conversions between the "padded SAM" and "unpadded SAM" formats with no loss of information. More precisely, and IIUC, P's can only show up in the "unpadded SAM" format (which is the most commonly used format) and allow the conversion from "unpadded SAM" to "padded SAM". Inversely, a "padded SAM" (typically used when dealing with assembly sequences) will never contain I's or P's and can easily be converted into "unpadded SAM" (there is a new pad2unpad tool in samtools for doing this). So to summarize, yes it looks like the cigar utils in GenomicRanges can safely ignore the P operations. I'll do that change. Cheers, H. On 03/08/2012 12:11 PM, Hervé Pagès wrote: > Hi Michael, Dave, > > On 03/07/2012 09:23 AM, Michael Lawrence wrote: >> Herve, do you think we could support the P in the CIGAR strings? Or >> does it >> complicate things too much? > > According to the SAM spec: > > P: padding (silent deletion from padded reference) > > Not clear to me what they mean exactly. In particular, what's > the "padded reference"? > > So I looked at the example provided in section 1.1 of the spec > (please use a fixed-width font to display this message): > > Coor 12345678901234 5678901234567890123456789012345 > ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT > +r001/1 TTAGATAAAGGATA*CTG > +r002 aaaAGATAA*GGATA > +r003 gcctaAGCTAA > +r004 ATAGCT..............TCAGC > -r003 ttagctTAGGC > -r001/2 CAGCGCCAT > > The corresponding SAM format is: > > @HD VN:1.3 SO:coordinate > @SQ SN:ref LN:45 > r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * > r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * > r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 > r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * > r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0 > r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * > > As you can see, P is used in the cigar of read r002. > > So yes the reference sequence is padded with 2 consecutive * but > that's just a displaying trick here that makes it easier to display > the 2-letter insertion in r001/1. > > So I'm confused about how relevant the 1P operation in the r002 > cigar is. The same cigar but without the 1P operation would be > 3S6M1I4M, and it tells me all I need to know about how the read > is aligned to the reference (not to the *padded* reference, why > should I care?). > > Am I missing something or are there situations where the P > actually carries more interesting/important meaning? Otherwise, > adapting the cigar utils in GenomicRanges will be easy: they'll > just ignore the P's :-) > > But I'd really like to have a closer look at those BAM files > produced by the Roche's software first. Dave do you think you > can put this file (the one with P's), or a subset of it, > somewhere online where I can access it? > > Thanks, > H. > >> >> On Wed, Mar 7, 2012 at 9:03 AM, David A.<dasolexa at="" hotmail.com=""> wrote: >> >>> Thanks a lot Michael. >>> >>> I will modify my pseudo-bed file accordingly. >>> >>> >> If you do actually get your BED file into spec, then you could use >> rtracklayer::import to load it directly as a GRanges. You'll still >> need to >> set the seqlengths on it. >> >> >>> Regarding padding not being supported, and while waiting to see if it >>> ever >>> gets support so that I can continue with analysis, is there any tool >>> that >>> will automatically unpad the alignment? or do people create their own >>> scripts to remove the P flags? That is all it needs right? >>> >>> >>> >>> Dave >>> >>> ------------------------------ >>> Date: Wed, 7 Mar 2012 07:21:31 -0800 >>> Subject: Re: [BioC] creating GRanges and reading BAM files >>> From: lawrence.michael at gene.com >>> To: dasolexa at hotmail.com >>> CC: bioconductor at r-project.org >>> >>> >>> >>> >>> On Wed, Mar 7, 2012 at 4:51 AM, David A.<dasolexa at="" hotmail.com=""> wrote: >>> >>> >>> Hi list, >>> >>> sorry for a simple question, but I am a newbie a bit lost reading all >>> the >>> information on how to handle NGS data using R tools. I have a set of BAM >>> files from Junior sequencer, one BAM per amplicon per sample (Roche's >>> software does not output one single BAM file per sample including all >>> amplicons). I also have a bed file with the features sequenced. At the >>> moment I am only testing with two amplicons for one sample. My final >>> aim is >>> to get the coverage per amplicon per sample >>> >>> I am using R version 2.14.2 (2012-02-29) >>> >>> I read in the bed file and tried to create a GRanges object: >>> >>>> bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) >>>> bed.frame >>> chromosome start end >>> 1 APOBex26_M13 1 478 >>> 2 APOBex29_M13 1 448 >>> >>> >>> This does not appear to be a valid BED file. There should not be any >>> column names, and the intervals should be 0-based. But anyway, let's >>> just >>> say it is BED-like. >>> >>> >>>> >>> gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],en d=bed.frame[,3])) >>> >>>> gr >>> GRanges with 2 ranges and 0 elementMetadata values: >>> seqnames ranges strand >>> <rle> <iranges> <rle> >>> [1] APOBex26_M13 [1, 478] * >>> [2] APOBex29_M13 [1, 448] * >>> --- >>> seqlengths: >>> APOBex26_M13 APOBex29_M13 >>> NA NA >>> >>> >>> I also read in the list of BAM files vand convert to BamFileList: >>> >>>> fls = list.files(pattern="*bam",full=TRUE) >>>> fls >>> [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" >>> [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" >>> bfs<- BamFileList(fls) >>> >>> I then try to summarizeOverlaps but gives me the following error. >>> >>>> olap<-summarizeOverlaps(gr, bfs) >>> Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : >>> in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported >>> yet, sorry! >>> >>> What is the meaning of this error and how can I overcome it? >>> >>> >>> Well I would just take it at face value: the P cigar operation is not >>> supported yet. Maybe it is time to start supporting it.. :) >>> >>> >>> Is the gr object not created properly? Is Metadata necessary? Why is not >>> seqlengths filled automatically using the IRanges? >>> >>> >>> There is no way to know the seqlengths automatically. You've >>> constructed a >>> GRanges with features that lay on some typically longer sequence, but >>> the >>> sequence length was never specified. You can pass your "end" values >>> as the >>> "seqlengths" to the GRanges constructor. >>> >>> Michael >>> >>> >>> Thanks for your help, >>> >>> Dave >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.1 years ago Hervé Pagès 16k

0

Entering edit mode

On 03/08/2012 01:41 PM, Hervé Pagès wrote: > Hi again, > > The latest version of the SAM spec (v1.4-r994), not-yet-released but > accessible here: > > https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/sam-spec > > has a whole new and very enlightening section about the "padded SAM > format" (3. Guide for Describing Assembly Sequences in SAM) where > the P operation plays a central role. > > After reading it and also having a quick look at this thread: > > > http://sourceforge.net/mailarchive/forum.php?thread_name=CAKVJ-_6Fpu inBN0-c9VtjaCBdiBWg%2Byzdn5mHTGGty0K__k0DQ%40mail.gmail.com&forum_name =samtools-devel > > > my understanding is that the only purpose of the padding CIGAR > operation 'P' is to allow conversions between the "padded SAM" > and "unpadded SAM" formats with no loss of information. > > More precisely, and IIUC, P's can only show up in the "unpadded SAM" > format (which is the most commonly used format) and allow the > conversion from "unpadded SAM" to "padded SAM". > > Inversely, a "padded SAM" (typically used when dealing with assembly > sequences) will never contain I's or P's and can easily be converted > into "unpadded SAM" (there is a new pad2unpad tool in samtools for > doing this). > > So to summarize, yes it looks like the cigar utils in GenomicRanges > can safely ignore the P operations. I'll do that change. Done in GenomicRanges 1.7.33 (latest devel version). With this version: > library(GenomicRanges) > cig1 <- "6M1P4M2D5M20N3I2M" > cig2 <- "10M2D5M20N3I2M" > cigarToIRanges(cig1) IRanges of length 2 start end width [1] 1 17 17 [2] 38 39 2 > cigarToIRanges(cig2) IRanges of length 2 start end width [1] 1 17 17 [2] 38 39 2 > cigarToQWidth(cig1) [1] 20 > cigarToQWidth(cig2) [1] 20 The "silent deletion from padded reference" is ignored by the 2 central low-level cigar utils cigarToIRanges() and cigarToQWidth(). All the other cigar utils and higher level operations like coercion from GappedAlignment to GRanges or GRangesList behave consistently with this. GenomicRanges 1.7.33 will become available via BiocInstaller::biocLite() and to R 2.15 users only in the next 24 hours. Cheers, H. > > Cheers, > H. > > > > On 03/08/2012 12:11 PM, Hervé Pagès wrote: >> Hi Michael, Dave, >> >> On 03/07/2012 09:23 AM, Michael Lawrence wrote: >>> Herve, do you think we could support the P in the CIGAR strings? Or >>> does it >>> complicate things too much? >> >> According to the SAM spec: >> >> P: padding (silent deletion from padded reference) >> >> Not clear to me what they mean exactly. In particular, what's >> the "padded reference"? >> >> So I looked at the example provided in section 1.1 of the spec >> (please use a fixed-width font to display this message): >> >> Coor 12345678901234 5678901234567890123456789012345 >> ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT >> +r001/1 TTAGATAAAGGATA*CTG >> +r002 aaaAGATAA*GGATA >> +r003 gcctaAGCTAA >> +r004 ATAGCT..............TCAGC >> -r003 ttagctTAGGC >> -r001/2 CAGCGCCAT >> >> The corresponding SAM format is: >> >> @HD VN:1.3 SO:coordinate >> @SQ SN:ref LN:45 >> r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * >> r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * >> r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 >> r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * >> r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0 >> r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * >> >> As you can see, P is used in the cigar of read r002. >> >> So yes the reference sequence is padded with 2 consecutive * but >> that's just a displaying trick here that makes it easier to display >> the 2-letter insertion in r001/1. >> >> So I'm confused about how relevant the 1P operation in the r002 >> cigar is. The same cigar but without the 1P operation would be >> 3S6M1I4M, and it tells me all I need to know about how the read >> is aligned to the reference (not to the *padded* reference, why >> should I care?). >> >> Am I missing something or are there situations where the P >> actually carries more interesting/important meaning? Otherwise, >> adapting the cigar utils in GenomicRanges will be easy: they'll >> just ignore the P's :-) >> >> But I'd really like to have a closer look at those BAM files >> produced by the Roche's software first. Dave do you think you >> can put this file (the one with P's), or a subset of it, >> somewhere online where I can access it? >> >> Thanks, >> H. >> >>> >>> On Wed, Mar 7, 2012 at 9:03 AM, David A.<dasolexa at="" hotmail.com=""> wrote: >>> >>>> Thanks a lot Michael. >>>> >>>> I will modify my pseudo-bed file accordingly. >>>> >>>> >>> If you do actually get your BED file into spec, then you could use >>> rtracklayer::import to load it directly as a GRanges. You'll still >>> need to >>> set the seqlengths on it. >>> >>> >>>> Regarding padding not being supported, and while waiting to see if it >>>> ever >>>> gets support so that I can continue with analysis, is there any tool >>>> that >>>> will automatically unpad the alignment? or do people create their own >>>> scripts to remove the P flags? That is all it needs right? >>>> >>>> >>>> >>>> Dave >>>> >>>> ------------------------------ >>>> Date: Wed, 7 Mar 2012 07:21:31 -0800 >>>> Subject: Re: [BioC] creating GRanges and reading BAM files >>>> From: lawrence.michael at gene.com >>>> To: dasolexa at hotmail.com >>>> CC: bioconductor at r-project.org >>>> >>>> >>>> >>>> >>>> On Wed, Mar 7, 2012 at 4:51 AM, David A.<dasolexa at="" hotmail.com=""> wrote: >>>> >>>> >>>> Hi list, >>>> >>>> sorry for a simple question, but I am a newbie a bit lost reading all >>>> the >>>> information on how to handle NGS data using R tools. I have a set of >>>> BAM >>>> files from Junior sequencer, one BAM per amplicon per sample (Roche's >>>> software does not output one single BAM file per sample including all >>>> amplicons). I also have a bed file with the features sequenced. At the >>>> moment I am only testing with two amplicons for one sample. My final >>>> aim is >>>> to get the coverage per amplicon per sample >>>> >>>> I am using R version 2.14.2 (2012-02-29) >>>> >>>> I read in the bed file and tried to create a GRanges object: >>>> >>>>> bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) >>>>> bed.frame >>>> chromosome start end >>>> 1 APOBex26_M13 1 478 >>>> 2 APOBex29_M13 1 448 >>>> >>>> >>>> This does not appear to be a valid BED file. There should not be any >>>> column names, and the intervals should be 0-based. But anyway, let's >>>> just >>>> say it is BED-like. >>>> >>>> >>>>> >>>> gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],e nd=bed.frame[,3])) >>>> >>>> >>>>> gr >>>> GRanges with 2 ranges and 0 elementMetadata values: >>>> seqnames ranges strand >>>> <rle> <iranges> <rle> >>>> [1] APOBex26_M13 [1, 478] * >>>> [2] APOBex29_M13 [1, 448] * >>>> --- >>>> seqlengths: >>>> APOBex26_M13 APOBex29_M13 >>>> NA NA >>>> >>>> >>>> I also read in the list of BAM files vand convert to BamFileList: >>>> >>>>> fls = list.files(pattern="*bam",full=TRUE) >>>>> fls >>>> [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" >>>> [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" >>>> bfs<- BamFileList(fls) >>>> >>>> I then try to summarizeOverlaps but gives me the following error. >>>> >>>>> olap<-summarizeOverlaps(gr, bfs) >>>> Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : >>>> in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported >>>> yet, sorry! >>>> >>>> What is the meaning of this error and how can I overcome it? >>>> >>>> >>>> Well I would just take it at face value: the P cigar operation is not >>>> supported yet. Maybe it is time to start supporting it.. :) >>>> >>>> >>>> Is the gr object not created properly? Is Metadata necessary? Why is >>>> not >>>> seqlengths filled automatically using the IRanges? >>>> >>>> >>>> There is no way to know the seqlengths automatically. You've >>>> constructed a >>>> GRanges with features that lay on some typically longer sequence, but >>>> the >>>> sequence length was never specified. You can pass your "end" values >>>> as the >>>> "seqlengths" to the GRanges constructor. >>>> >>>> Michael >>>> >>>> >>>> Thanks for your help, >>>> >>>> Dave >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.1 years ago Hervé Pagès 16k

0

Entering edit mode

Dear Michael and Hervé, thanks ever so much for your help. I also found this link http://davetang.org/wiki/tiki-index.php?page=SAM that states that just removing the P operation directly from the alignment will suffice to convert to unpadded format, as you also say. So I transformed form BAM to SAM, removed padding with a simple Perl oneliner, converted back to BAM and loaded into R session and this time it seems to work fine. Nevertheless, your workaround is better :). Thanks for your effort. Dave > Date: Thu, 8 Mar 2012 16:18:26 -0800 > From: hpages@fhcrc.org > To: lawrence.michael@gene.com > CC: bioconductor@r-project.org; dasolexa@hotmail.com > Subject: Re: [BioC] creating GRanges and reading BAM files > > On 03/08/2012 01:41 PM, Hervé Pagès wrote: > > Hi again, > > > > The latest version of the SAM spec (v1.4-r994), not-yet-released but > > accessible here: > > > > https://samtools.svn.sourceforge.net/svnroot/samtools/trunk/sam- spec > > > > has a whole new and very enlightening section about the "padded SAM > > format" (3. Guide for Describing Assembly Sequences in SAM) where > > the P operation plays a central role. > > > > After reading it and also having a quick look at this thread: > > > > > > http://sourceforge.net/mailarchive/forum.php?thread_name=CAKVJ-_6F puinBN0-c9VtjaCBdiBWg%2Byzdn5mHTGGty0K__k0DQ%40mail.gmail.com&forum_na me=samtools-devel > > > > > > my understanding is that the only purpose of the padding CIGAR > > operation 'P' is to allow conversions between the "padded SAM" > > and "unpadded SAM" formats with no loss of information. > > > > More precisely, and IIUC, P's can only show up in the "unpadded SAM" > > format (which is the most commonly used format) and allow the > > conversion from "unpadded SAM" to "padded SAM". > > > > Inversely, a "padded SAM" (typically used when dealing with assembly > > sequences) will never contain I's or P's and can easily be converted > > into "unpadded SAM" (there is a new pad2unpad tool in samtools for > > doing this). > > > > So to summarize, yes it looks like the cigar utils in GenomicRanges > > can safely ignore the P operations. I'll do that change. > > Done in GenomicRanges 1.7.33 (latest devel version). With this version: > > > library(GenomicRanges) > > cig1 <- "6M1P4M2D5M20N3I2M" > > cig2 <- "10M2D5M20N3I2M" > > cigarToIRanges(cig1) > IRanges of length 2 > start end width > [1] 1 17 17 > [2] 38 39 2 > > cigarToIRanges(cig2) > IRanges of length 2 > start end width > [1] 1 17 17 > [2] 38 39 2 > > cigarToQWidth(cig1) > [1] 20 > > cigarToQWidth(cig2) > [1] 20 > > The "silent deletion from padded reference" is ignored by the 2 central > low-level cigar utils cigarToIRanges() and cigarToQWidth(). All the > other cigar utils and higher level operations like coercion from > GappedAlignment to GRanges or GRangesList behave consistently with > this. > > GenomicRanges 1.7.33 will become available via BiocInstaller::biocLite() > and to R 2.15 users only in the next 24 hours. > > Cheers, > H. > > > > > Cheers, > > H. > > > > > > > > On 03/08/2012 12:11 PM, Hervé Pagès wrote: > >> Hi Michael, Dave, > >> > >> On 03/07/2012 09:23 AM, Michael Lawrence wrote: > >>> Herve, do you think we could support the P in the CIGAR strings? Or > >>> does it > >>> complicate things too much? > >> > >> According to the SAM spec: > >> > >> P: padding (silent deletion from padded reference) > >> > >> Not clear to me what they mean exactly. In particular, what's > >> the "padded reference"? > >> > >> So I looked at the example provided in section 1.1 of the spec > >> (please use a fixed-width font to display this message): > >> > >> Coor 12345678901234 5678901234567890123456789012345 > >> ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT > >> +r001/1 TTAGATAAAGGATA*CTG > >> +r002 aaaAGATAA*GGATA > >> +r003 gcctaAGCTAA > >> +r004 ATAGCT..............TCAGC > >> -r003 ttagctTAGGC > >> -r001/2 CAGCGCCAT > >> > >> The corresponding SAM format is: > >> > >> @HD VN:1.3 SO:coordinate > >> @SQ SN:ref LN:45 > >> r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * > >> r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * > >> r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 > >> r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * > >> r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0 > >> r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT * > >> > >> As you can see, P is used in the cigar of read r002. > >> > >> So yes the reference sequence is padded with 2 consecutive * but > >> that's just a displaying trick here that makes it easier to display > >> the 2-letter insertion in r001/1. > >> > >> So I'm confused about how relevant the 1P operation in the r002 > >> cigar is. The same cigar but without the 1P operation would be > >> 3S6M1I4M, and it tells me all I need to know about how the read > >> is aligned to the reference (not to the *padded* reference, why > >> should I care?). > >> > >> Am I missing something or are there situations where the P > >> actually carries more interesting/important meaning? Otherwise, > >> adapting the cigar utils in GenomicRanges will be easy: they'll > >> just ignore the P's :-) > >> > >> But I'd really like to have a closer look at those BAM files > >> produced by the Roche's software first. Dave do you think you > >> can put this file (the one with P's), or a subset of it, > >> somewhere online where I can access it? > >> > >> Thanks, > >> H. > >> > >>> > >>> On Wed, Mar 7, 2012 at 9:03 AM, David A.<dasolexa@hotmail.com> wrote: > >>> > >>>> Thanks a lot Michael. > >>>> > >>>> I will modify my pseudo-bed file accordingly. > >>>> > >>>> > >>> If you do actually get your BED file into spec, then you could use > >>> rtracklayer::import to load it directly as a GRanges. You'll still > >>> need to > >>> set the seqlengths on it. > >>> > >>> > >>>> Regarding padding not being supported, and while waiting to see if it > >>>> ever > >>>> gets support so that I can continue with analysis, is there any tool > >>>> that > >>>> will automatically unpad the alignment? or do people create their own > >>>> scripts to remove the P flags? That is all it needs right? > >>>> > >>>> > >>>> > >>>> Dave > >>>> > >>>> ------------------------------ > >>>> Date: Wed, 7 Mar 2012 07:21:31 -0800 > >>>> Subject: Re: [BioC] creating GRanges and reading BAM files > >>>> From: lawrence.michael@gene.com > >>>> To: dasolexa@hotmail.com > >>>> CC: bioconductor@r-project.org > >>>> > >>>> > >>>> > >>>> > >>>> On Wed, Mar 7, 2012 at 4:51 AM, David A.<dasolexa@hotmail.com> wrote: > >>>> > >>>> > >>>> Hi list, > >>>> > >>>> sorry for a simple question, but I am a newbie a bit lost reading all > >>>> the > >>>> information on how to handle NGS data using R tools. I have a set of > >>>> BAM > >>>> files from Junior sequencer, one BAM per amplicon per sample (Roche's > >>>> software does not output one single BAM file per sample including all > >>>> amplicons). I also have a bed file with the features sequenced. At the > >>>> moment I am only testing with two amplicons for one sample. My final > >>>> aim is > >>>> to get the coverage per amplicon per sample > >>>> > >>>> I am using R version 2.14.2 (2012-02-29) > >>>> > >>>> I read in the bed file and tried to create a GRanges object: > >>>> > >>>>> bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) > >>>>> bed.frame > >>>> chromosome start end > >>>> 1 APOBex26_M13 1 478 > >>>> 2 APOBex29_M13 1 448 > >>>> > >>>> > >>>> This does not appear to be a valid BED file. There should not be any > >>>> column names, and the intervals should be 0-based. But anyway, let's > >>>> just > >>>> say it is BED-like. > >>>> > >>>> > >>>>> > >>>> gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2] ,end=bed.frame[,3])) > >>>> > >>>> > >>>>> gr > >>>> GRanges with 2 ranges and 0 elementMetadata values: > >>>> seqnames ranges strand > >>>> <rle> <iranges> <rle> > >>>> [1] APOBex26_M13 [1, 478] * > >>>> [2] APOBex29_M13 [1, 448] * > >>>> --- > >>>> seqlengths: > >>>> APOBex26_M13 APOBex29_M13 > >>>> NA NA > >>>> > >>>> > >>>> I also read in the list of BAM files vand convert to BamFileList: > >>>> > >>>>> fls = list.files(pattern="*bam",full=TRUE) > >>>>> fls > >>>> [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" > >>>> [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" > >>>> bfs<- BamFileList(fls) > >>>> > >>>> I then try to summarizeOverlaps but gives me the following error. > >>>> > >>>>> olap<-summarizeOverlaps(gr, bfs) > >>>> Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : > >>>> in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported > >>>> yet, sorry! > >>>> > >>>> What is the meaning of this error and how can I overcome it? > >>>> > >>>> > >>>> Well I would just take it at face value: the P cigar operation is not > >>>> supported yet. Maybe it is time to start supporting it.. :) > >>>> > >>>> > >>>> Is the gr object not created properly? Is Metadata necessary? Why is > >>>> not > >>>> seqlengths filled automatically using the IRanges? > >>>> > >>>> > >>>> There is no way to know the seqlengths automatically. You've > >>>> constructed a > >>>> GRanges with features that lay on some typically longer sequence, but > >>>> the > >>>> sequence length was never specified. You can pass your "end" values > >>>> as the > >>>> "seqlengths" to the GRanges constructor. > >>>> > >>>> Michael > >>>> > >>>> > >>>> Thanks for your help, > >>>> > >>>> Dave > >>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > >>>> > >>>> > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > > > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago David A. ▴ 70

0

Entering edit mode

Sunny Yu Liu ▴ 80

@sunny-yu-liu-4982

Last seen 9.6 years ago

Cool, and how to transfer GRanges or GRanges list to Bed format. Thanks! -----Original Message----- From: bioconductor-bounces@r-project.org on behalf of Michael Lawrence Sent: Wed 3/7/2012 10:21 AM To: David A. Cc: Bioconductor Subject: Re: [BioC] creating GRanges and reading BAM files On Wed, Mar 7, 2012 at 4:51 AM, David A. <dasolexa@hotmail.com> wrote: > > Hi list, > > sorry for a simple question, but I am a newbie a bit lost reading all the > information on how to handle NGS data using R tools. I have a set of BAM > files from Junior sequencer, one BAM per amplicon per sample (Roche's > software does not output one single BAM file per sample including all > amplicons). I also have a bed file with the features sequenced. At the > moment I am only testing with two amplicons for one sample. My final aim is > to get the coverage per amplicon per sample > > I am using R version 2.14.2 (2012-02-29) > > I read in the bed file and tried to create a GRanges object: > > > bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) > > bed.frame > chromosome start end > 1 APOBex26_M13 1 478 > 2 APOBex29_M13 1 448 > This does not appear to be a valid BED file. There should not be any column names, and the intervals should be 0-based. But anyway, let's just say it is BED-like. > > > gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end= bed.frame[,3])) > > gr > GRanges with 2 ranges and 0 elementMetadata values: > seqnames ranges strand > <rle> <iranges> <rle> > [1] APOBex26_M13 [1, 478] * > [2] APOBex29_M13 [1, 448] * > --- > seqlengths: > APOBex26_M13 APOBex29_M13 > NA NA > > > I also read in the list of BAM files vand convert to BamFileList: > > > fls = list.files(pattern="*bam",full=TRUE) > > fls > [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" > [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" > bfs <- BamFileList(fls) > > I then try to summarizeOverlaps but gives me the following error. > > >olap <-summarizeOverlaps(gr, bfs) > Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : > in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported > yet, sorry! > > What is the meaning of this error and how can I overcome it? > Well I would just take it at face value: the P cigar operation is not supported yet. Maybe it is time to start supporting it.. :) > Is the gr object not created properly? Is Metadata necessary? Why is not > seqlengths filled automatically using the IRanges? > > There is no way to know the seqlengths automatically. You've constructed a GRanges with features that lay on some typically longer sequence, but the sequence length was never specified. You can pass your "end" values as the "seqlengths" to the GRanges constructor. Michael > Thanks for your help, > > Dave > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD COMMENT • link 12.1 years ago Sunny Yu Liu ▴ 80

0

Entering edit mode

On Fri, Mar 9, 2012 at 9:24 AM, Sunny Yu Liu <sunny@lunenfeld.ca> wrote: > ** > > Cool, and how to transfer GRanges or GRanges list to Bed format. Thanks! > > Might be better as a separate thread, but see rtracklayer::export, e.g., export(gr, "my.bed"). > > > -----Original Message----- > From: bioconductor-bounces@r-project.org on behalf of Michael Lawrence > Sent: Wed 3/7/2012 10:21 AM > To: David A. > Cc: Bioconductor > Subject: Re: [BioC] creating GRanges and reading BAM files > > On Wed, Mar 7, 2012 at 4:51 AM, David A. <dasolexa@hotmail.com> wrote: > > > > > Hi list, > > > > sorry for a simple question, but I am a newbie a bit lost reading all the > > information on how to handle NGS data using R tools. I have a set of BAM > > files from Junior sequencer, one BAM per amplicon per sample (Roche's > > software does not output one single BAM file per sample including all > > amplicons). I also have a bed file with the features sequenced. At the > > moment I am only testing with two amplicons for one sample. My final aim > is > > to get the coverage per amplicon per sample > > > > I am using R version 2.14.2 (2012-02-29) > > > > I read in the bed file and tried to create a GRanges object: > > > > > bed.frame=read.table("bed.frame",sep="\t",as.is=TRUE, header=TRUE) > > > bed.frame > > chromosome start end > > 1 APOBex26_M13 1 478 > > 2 APOBex29_M13 1 448 > > > > This does not appear to be a valid BED file. There should not be any column > names, and the intervals should be 0-based. But anyway, let's just say it > is BED-like. > > > > > > > > gr<-GRanges(seqnames=bed.frame[,1],ranges=IRanges(bed.frame[,2],end= bed.frame[,3])) > > > gr > > GRanges with 2 ranges and 0 elementMetadata values: > > seqnames ranges strand > > <rle> <iranges> <rle> > > [1] APOBex26_M13 [1, 478] * > > [2] APOBex29_M13 [1, 448] * > > --- > > seqlengths: > > APOBex26_M13 APOBex29_M13 > > NA NA > > > > > > I also read in the list of BAM files vand convert to BamFileList: > > > > > fls = list.files(pattern="*bam",full=TRUE) > > > fls > > [1] "./Sample_Mult_1_MID-151_vs_APOBex26_M13.bam" > > [2] "./Sample_Mult_1_MID-151_vs_APOBex29_M13.bam" > > bfs <- BamFileList(fls) > > > > I then try to summarizeOverlaps but gives me the following error. > > > > >olap <-summarizeOverlaps(gr, bfs) > > Error in .Call2("cigar_to_width", cigar, PACKAGE = "GenomicRanges") : > > in 'cigar' element 1: CIGAR operation 'P' (at char 7) is not supported > > yet, sorry! > > > > What is the meaning of this error and how can I overcome it? > > > > Well I would just take it at face value: the P cigar operation is not > supported yet. Maybe it is time to start supporting it.. :) > > > > Is the gr object not created properly? Is Metadata necessary? Why is not > > seqlengths filled automatically using the IRanges? > > > > > There is no way to know the seqlengths automatically. You've constructed a > GRanges with features that lay on some typically longer sequence, but the > sequence length was never specified. You can pass your "end" values as the > "seqlengths" to the GRanges constructor. > > Michael > > > > Thanks for your help, > > > > Dave > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago Michael Lawrence ★ 11k

Login before adding your answer.