Reading paired-end data into GRangesList

0

Entering edit mode

Hubert Rehrauer ▴ 100

@hubert-rehrauer-3823

Last seen 11.3 years ago

Hi I want to load paired-end data from Bam-Files into R in order to do expression counting. The complicating thing is, that the first read was aligned using a gapped alignment (i.e. the cigar string contains Ns). How can this be done? I thought this would be a quite common task but I did not find any function that would do this. Neither scanBam nor readBamGappedAlignments are directly helpful with this. For me the most obvious thing would be to hold the alignment of such a read as a GRangesList. In order to get there I would use scanBam to load the first read. Parse the cigar string to identify the gaps, build a GRangesList and then add the alignment of the second read to the List. Do you have any better ideas? best regards, hubert

Alignment Alignment • 2.3k views

ADD COMMENT • link updated 14.2 years ago by Paul Leo ▴ 970 • written 14.2 years ago by Hubert Rehrauer ▴ 100

0

Entering edit mode

Cory Barr ▴ 60

@cory-barr-4429

Last seen 11.3 years ago

Hi Hubert, Try the grglist function on a GappedAlignments object. gapped_alignments <- readBamGappedAlignments(bam_file) granges_list <- grglist(gapped_alignments) Then to combine each end of the read from the different ends into the same element of the GRangesList, try the "split" function on the names of the reads. -Cory On Wed, Oct 26, 2011 at 6:51 AM, Hubert Rehrauer <hubert.rehrauer at="" fgcz.ethz.ch=""> wrote: > Hi > > I want to load paired-end data from Bam-Files into R in order to do > expression counting. The complicating thing is, that the first read was > aligned using a gapped alignment (i.e. the cigar string contains Ns). > > How can this be done? I thought this would be a quite common task but I did > not find any function that would do this. Neither scanBam nor > readBamGappedAlignments are directly helpful with this. > > For me the most obvious thing would be to hold the alignment of such a read > as a GRangesList. In order to get there I would use scanBam to load the > first read. Parse the ?cigar string to identify the gaps, build a > GRangesList and then add the alignment of the second read to the List. Do > you have any better ideas? > > > best regards, > hubert > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.2 years ago Cory Barr ▴ 60

0

Entering edit mode

Paul Leo ▴ 970

@paul-leo-2092

Last seen 11.3 years ago

Hi Huber, Think I would just use readBamGappedAlignments from this you will lose the paired end info. Yes you will add the Ns as coverage but the difference here is very small in most cases. Note also you probably want to use tag counts rather that bg coverage anyway... There is another issue ( I think) ; let say for example some NNs were due to an "insertion" in this "samples genome" compared to the reference . When you go to normalise the signal counts per exon or counts per bp whatever ... will you use the exon length/ genome length for that individual or the reference exon length? You will use the reference obviously so its a bit grey what the true answer is... However I believe you can get the exact coverage from the CIGAR if you wish see... http://stuff.mit.edu/afs/athena/software/r/current/lib/R/library/Genom icRanges/html/cigar-utils.htm irl <- cigarToIRangesListByRName(bam$cigar, bam$rname, bam$pos) irl <- irl[elementLengths(irl) != 0] reads <- as(irl,"GRanges") reads1 <- as(irl,"RangedData") gl <- coverage(reads1) probably a bit slower... however probably a bit old now......... The GenomicRanges package has some new documentation on "countGenomicOverlaps" which may sort this out for you as it's designed to make input for edgeR Deseq etc... Cheers Paul -----Original Message----- From: Hubert Rehrauer <hubert.rehrauer@fgcz.ethz.ch> To: bioconductor at r-project.org <bioconductor at="" r-project.org=""> Subject: [BioC] Reading paired-end data into GRangesList Date: Wed, 26 Oct 2011 23:51:47 +1000 Hi I want to load paired-end data from Bam-Files into R in order to do expression counting. The complicating thing is, that the first read was aligned using a gapped alignment (i.e. the cigar string contains Ns). How can this be done? I thought this would be a quite common task but I did not find any function that would do this. Neither scanBam nor readBamGappedAlignments are directly helpful with this. For me the most obvious thing would be to hold the alignment of such a read as a GRangesList. In order to get there I would use scanBam to load the first read. Parse the cigar string to identify the gaps, build a GRangesList and then add the alignment of the second read to the List. Do you have any better ideas? best regards, hubert _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 14.2 years ago Paul Leo ▴ 970

0

Entering edit mode

Hi Paul thanks. Yes, the readBamGappedAlignmenbts will not allow me to pair the reads after loading because the read-names will be discarded. I will try, the cigarToIRangesListByRName. I had overlooked this option, but I guess it's gonna be slow. The issue about length normalization is not really problem, since the reason why I have gaps on the genome is that the reads have been mapped to the transcriptome, without gaps, and then projected to the genome. So all gaps are due to introns. regards, hubert Paul Leo wrote: > Hi Huber, > Think I would just use readBamGappedAlignments from this you will lose > the paired end info. Yes you will add the Ns as coverage but the > difference here is very small in most cases. Note also you probably want > to use tag counts rather that bg coverage anyway... > > There is another issue ( I think) ; let say for example some NNs were > due to an "insertion" in this "samples genome" compared to the > reference . When you go to normalise the signal counts per exon or > counts per bp whatever ... will you use the exon length/ genome length > for that individual or the reference exon length? You will use the > reference obviously so its a bit grey what the true answer is... > > However I believe you can get the exact coverage from the CIGAR if you > wish see... > > http://stuff.mit.edu/afs/athena/software/r/current/lib/R/library/Gen omicRanges/html/cigar-utils.htm > > > irl <- cigarToIRangesListByRName(bam$cigar, bam$rname, bam$pos) > irl <- irl[elementLengths(irl) != 0] > reads <- as(irl,"GRanges") > reads1 <- as(irl,"RangedData") > gl <- coverage(reads1) > > probably a bit slower... however probably a bit old now......... > > The GenomicRanges package has some new documentation on > "countGenomicOverlaps" which may sort this out for you as it's designed > to make input for edgeR Deseq etc... > > Cheers > Paul > > > > -----Original Message----- > From: Hubert Rehrauer <hubert.rehrauer at="" fgcz.ethz.ch=""> > To: bioconductor at r-project.org <bioconductor at="" r-project.org=""> > Subject: [BioC] Reading paired-end data into GRangesList > Date: Wed, 26 Oct 2011 23:51:47 +1000 > > Hi > > I want to load paired-end data from Bam-Files into R in order to do > expression counting. The complicating thing is, that the first read was > aligned using a gapped alignment (i.e. the cigar string contains Ns). > > How can this be done? I thought this would be a quite common task but I > did not find any function that would do this. Neither scanBam nor > readBamGappedAlignments are directly helpful with this. > > For me the most obvious thing would be to hold the alignment of such a > read as a GRangesList. In order to get there I would use scanBam to load > the first read. Parse the cigar string to identify the gaps, build a > GRangesList and then add the alignment of the second read to the List. > Do you have any better ideas? > > > best regards, > hubert > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 14.2 years ago Hubert Rehrauer ▴ 100

0

Entering edit mode

readBamGappedAlignments will retain read names if the use.names arg is set to TRUE. In general, I find grglist more high-level and convenient to use than cigarToIRangesListByRName. -Cory On Fri, Oct 28, 2011 at 8:56 AM, Hubert Rehrauer <hubert.rehrauer at="" fgcz.ethz.ch=""> wrote: > Hi Paul > > thanks. Yes, the readBamGappedAlignmenbts will not allow me to pair the > reads after loading because the read-names will be discarded. > > I will try, the cigarToIRangesListByRName. I had overlooked this option, but > I guess it's gonna be slow. > > The issue about length normalization is not really problem, since the reason > why I have gaps on the genome is that the reads have been mapped to the > transcriptome, without gaps, and then projected to the genome. So all gaps > are due to introns. > > regards, > hubert > > > Paul Leo wrote: >> >> Hi Huber, >> Think I would just use readBamGappedAlignments from this you will lose >> the paired end info. Yes you will add the Ns as coverage but ?the >> difference here is very small in most cases. Note also you probably want >> to use tag counts rather that bg coverage anyway... >> >> There is another issue ( I think) ; let say for example some NNs were >> due to an "insertion" in this "samples genome" compared to the >> reference . When you go to normalise the signal counts per exon or >> counts per bp whatever ... will you use the exon length/ genome length >> for that individual or the reference exon length? ?You will use the >> reference obviously so its a bit grey what the true answer is... >> >> However I believe you can get the exact coverage from the CIGAR if you >> wish see... >> >> http://stuff.mit.edu/afs/athena/software/r/current/lib/R/library/Ge nomicRanges/html/cigar-utils.htm >> >> irl <- cigarToIRangesListByRName(bam$cigar, bam$rname, bam$pos) >> irl <- irl[elementLengths(irl) != 0] >> reads <- as(irl,"GRanges") >> reads1 <- as(irl,"RangedData") >> gl <- coverage(reads1) >> >> probably a bit slower... however probably ?a bit old now......... >> >> The GenomicRanges package has some new documentation on >> "countGenomicOverlaps" which may sort this out for you as it's designed >> to make input for edgeR Deseq etc... >> >> Cheers >> Paul >> >> >> >> -----Original Message----- >> From: Hubert Rehrauer <hubert.rehrauer at="" fgcz.ethz.ch=""> >> To: bioconductor at r-project.org <bioconductor at="" r-project.org=""> >> Subject: [BioC] Reading paired-end data into GRangesList >> Date: Wed, 26 Oct 2011 23:51:47 +1000 >> >> Hi >> >> I want to load paired-end data from Bam-Files into R in order to do >> expression counting. The complicating thing is, that the first read was >> aligned using a gapped alignment (i.e. the cigar string contains Ns). >> >> How can this be done? I thought this would be a quite common task but I >> did not find any function that would do this. Neither scanBam nor >> readBamGappedAlignments are directly helpful with this. >> >> For me the most obvious thing would be to hold the alignment of such a >> read as a GRangesList. In order to get there I would use scanBam to load the >> first read. Parse the ?cigar string to identify the gaps, build a >> GRangesList and then add the alignment of the second read to the List. Do >> you have any better ideas? >> >> >> best regards, >> hubert >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 14.2 years ago Cory Barr ▴ 60

0

Entering edit mode

many thanks, that will do the job for now, hubert Cory Barr wrote: > readBamGappedAlignments will retain read names if the use.names arg is > set to TRUE. In general, I find grglist more high-level and > convenient to use than cigarToIRangesListByRName. > > -Cory > > On Fri, Oct 28, 2011 at 8:56 AM, Hubert Rehrauer > <hubert.rehrauer at="" fgcz.ethz.ch=""> wrote: > >> Hi Paul >> >> thanks. Yes, the readBamGappedAlignmenbts will not allow me to pair the >> reads after loading because the read-names will be discarded. >> >> I will try, the cigarToIRangesListByRName. I had overlooked this option, but >> I guess it's gonna be slow. >> >> The issue about length normalization is not really problem, since the reason >> why I have gaps on the genome is that the reads have been mapped to the >> transcriptome, without gaps, and then projected to the genome. So all gaps >> are due to introns. >> >> regards, >> hubert >> >> >> Paul Leo wrote: >> >>> Hi Huber, >>> Think I would just use readBamGappedAlignments from this you will lose >>> the paired end info. Yes you will add the Ns as coverage but the >>> difference here is very small in most cases. Note also you probably want >>> to use tag counts rather that bg coverage anyway... >>> >>> There is another issue ( I think) ; let say for example some NNs were >>> due to an "insertion" in this "samples genome" compared to the >>> reference . When you go to normalise the signal counts per exon or >>> counts per bp whatever ... will you use the exon length/ genome length >>> for that individual or the reference exon length? You will use the >>> reference obviously so its a bit grey what the true answer is... >>> >>> However I believe you can get the exact coverage from the CIGAR if you >>> wish see... >>> >>> http://stuff.mit.edu/afs/athena/software/r/current/lib/R/library/G enomicRanges/html/cigar-utils.htm >>> >>> irl <- cigarToIRangesListByRName(bam$cigar, bam$rname, bam$pos) >>> irl <- irl[elementLengths(irl) != 0] >>> reads <- as(irl,"GRanges") >>> reads1 <- as(irl,"RangedData") >>> gl <- coverage(reads1) >>> >>> probably a bit slower... however probably a bit old now......... >>> >>> The GenomicRanges package has some new documentation on >>> "countGenomicOverlaps" which may sort this out for you as it's designed >>> to make input for edgeR Deseq etc... >>> >>> Cheers >>> Paul >>> >>> >>> >>> -----Original Message----- >>> From: Hubert Rehrauer <hubert.rehrauer at="" fgcz.ethz.ch=""> >>> To: bioconductor at r-project.org <bioconductor at="" r-project.org=""> >>> Subject: [BioC] Reading paired-end data into GRangesList >>> Date: Wed, 26 Oct 2011 23:51:47 +1000 >>> >>> Hi >>> >>> I want to load paired-end data from Bam-Files into R in order to do >>> expression counting. The complicating thing is, that the first read was >>> aligned using a gapped alignment (i.e. the cigar string contains Ns). >>> >>> How can this be done? I thought this would be a quite common task but I >>> did not find any function that would do this. Neither scanBam nor >>> readBamGappedAlignments are directly helpful with this. >>> >>> For me the most obvious thing would be to hold the alignment of such a >>> read as a GRangesList. In order to get there I would use scanBam to load the >>> first read. Parse the cigar string to identify the gaps, build a >>> GRangesList and then add the alignment of the second read to the List. Do >>> you have any better ideas? >>> >>> >>> best regards, >>> hubert >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>

ADD REPLY • link 14.2 years ago Hubert Rehrauer ▴ 100

0

Entering edit mode

On 10/28/2011 08:56 AM, Hubert Rehrauer wrote: > Hi Paul > > thanks. Yes, the readBamGappedAlignmenbts will not allow me to pair the > reads after loading because the read-names will be discarded. Rsamtools in the next (as in next week!) release supports adding arbitrary fields to the GappedAlignment, via ScanBamParam(what=...). In the current release it's still straight-forward to scanBam() with approriate ScanBamParam, construct a GappedAlignments(), and add additional info values(galn)[["rname"]] = bam[[1]][["rname"]] Martin > > I will try, the cigarToIRangesListByRName. I had overlooked this option, > but I guess it's gonna be slow. > > The issue about length normalization is not really problem, since the > reason why I have gaps on the genome is that the reads have been mapped > to the transcriptome, without gaps, and then projected to the genome. So > all gaps are due to introns. > > regards, > hubert > > > Paul Leo wrote: >> Hi Huber, >> Think I would just use readBamGappedAlignments from this you will lose >> the paired end info. Yes you will add the Ns as coverage but the >> difference here is very small in most cases. Note also you probably want >> to use tag counts rather that bg coverage anyway... >> >> There is another issue ( I think) ; let say for example some NNs were >> due to an "insertion" in this "samples genome" compared to the >> reference . When you go to normalise the signal counts per exon or >> counts per bp whatever ... will you use the exon length/ genome length >> for that individual or the reference exon length? You will use the >> reference obviously so its a bit grey what the true answer is... >> >> However I believe you can get the exact coverage from the CIGAR if you >> wish see... >> http://stuff.mit.edu/afs/athena/software/r/current/lib/R/library/Ge nomicRanges/html/cigar-utils.htm >> >> >> irl <- cigarToIRangesListByRName(bam$cigar, bam$rname, bam$pos) >> irl <- irl[elementLengths(irl) != 0] >> reads <- as(irl,"GRanges") >> reads1 <- as(irl,"RangedData") >> gl <- coverage(reads1) >> >> probably a bit slower... however probably a bit old now......... >> >> The GenomicRanges package has some new documentation on >> "countGenomicOverlaps" which may sort this out for you as it's designed >> to make input for edgeR Deseq etc... >> >> Cheers >> Paul >> >> >> >> -----Original Message----- >> From: Hubert Rehrauer <hubert.rehrauer at="" fgcz.ethz.ch=""> >> To: bioconductor at r-project.org <bioconductor at="" r-project.org=""> >> Subject: [BioC] Reading paired-end data into GRangesList >> Date: Wed, 26 Oct 2011 23:51:47 +1000 >> >> Hi >> >> I want to load paired-end data from Bam-Files into R in order to do >> expression counting. The complicating thing is, that the first read >> was aligned using a gapped alignment (i.e. the cigar string contains Ns). >> >> How can this be done? I thought this would be a quite common task but >> I did not find any function that would do this. Neither scanBam nor >> readBamGappedAlignments are directly helpful with this. >> >> For me the most obvious thing would be to hold the alignment of such a >> read as a GRangesList. In order to get there I would use scanBam to >> load the first read. Parse the cigar string to identify the gaps, >> build a GRangesList and then add the alignment of the second read to >> the List. Do you have any better ideas? >> >> >> best regards, >> hubert >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 14.2 years ago Martin Morgan 25k

Login before adding your answer.