Getting Introns Expression at a Per Gene Level
1
0
Entering edit mode
@fong-chun-chan-5706
Last seen 9.6 years ago
Hi Valarie, Thanks for the reply. I was testing this and it doesn't seem to work with the summarizeOverlaps() function. When I try it, I get this error: countsForIntrons <- summarizeOverlaps(intronsByGene, reads); Error in function (classes, fdef, mtable) : unable to find an inherited method for function ‘summarizeOverlaps’ for signature ‘"GRangesList", "GRanges"’ Is it supposed to be countOverlaps() instead of summarizeOverlap()? Thanks, Fong On Wed, Jan 30, 2013 at 12:23 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Hi Fong, > > We don't have a single function call to get introns by gene but you can > put one together in a few steps. > > library(TxDb.Hsapiens.UCSC.**hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.**knownGene > > Specifying use.names = TRUE puts the transcript names on the GRangesList. > > introns <- intronsByTranscript(txdb, use.names=TRUE) > > Unlist and remove duplicate introns. > > ulst <- unlist(introns) > intronsNoDups <- ulst[!duplicated(ulst)] > > The select() function can map txid to geneid (and others). See ?select. > > txnames <- names(intronsNoDups) > map <- select(txdb, keys=txnames, keytype='TXNAME', cols='GENEID') > > Not all transcripts are associated with a gene id and some are associated > with more than one. > > > head(map, 10) > TXNAME GENEID > 1 uc001aaa.3 <na> > 2 uc001aaa.3 <na> > 3 uc010nxq.1 <na> > 4 uc010nxq.1 <na> > 5 uc010nxr.1 <na> > 6 uc010nxr.1 <na> > 7 uc009vjk.2 100133331 > 8 uc009vjk.2 100133331 > 9 uc001aau.3 100132287 > 10 uc021oeh.1 100133331 > > > To get a GRangesList of introns by gene, split the introns on the > associated gene id. > > idx <- map$GENEID[!is.na(map$GENEID)] > intronsByGene <- split(intronsNoDups[!is.na(**map$GENEID)], idx) > names(intronsByGene) <- unique(idx) > > > The list names are geneid and the element rownames are txid. > > > intronsByGene > GRangesList of length 19784: > $100133331 > GRanges with 13 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > uc002qsd.4 chr19 [58858396, 58858718] - > uc002qsd.4 chr19 [58859007, 58861735] - > uc002qsd.4 chr19 [58862018, 58862756] - > ... > > $100132287 > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > uc003wyw.1 chr8 [18248856, 18257507] + > > > To count reads against this annotation you can use summarizeOverlaps(). > The 'reads' can be Bam files, GappedAlignments or GappedAlignmentPairs. The > 'features' argument will be the GRangesList of introns by gene. The result > will be counts per gene. > > To count just introns with no grouping you can still use > summarizeOverlaps() but use the 'intronsNoDups' GRanges object as > 'features'. > > > Valerie > > > > On 01/29/2013 11:38 AM, Fong Chun Chan wrote: > >> Hi, >> >> I am using the R package, Genomic Features, to get Intron Expression and >> the function that I am using is the intronsByTranscript(). While this >> function is useful, it returns the number of raw reads that align to each >> intron grouped at the transcript level. Is there an easy way to get it to >> group it by a gene such that I am grabbing all the introns that fall in >> all >> the transcripts belong to a gene (and without double counting the intronic >> space). >> >> Similar, is there a way to easily get the raw read count at an individual >> intron level rather than having it grouped? The introns() function seems >> to >> be defunct now as it recommends that we use the intronsByTranscript() >> function. >> >> Thanks, >> >> Fong >> >> [[alternative HTML version deleted]] >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > [[alternative HTML version deleted]]
Annotation Annotation • 2.4k views
ADD COMMENT
0
Entering edit mode

Hi Valerie,

I am trying to do the same analysis as Fong i.e. count introns. I made a GRangesList of introns by gene as you described and also made a GRangesList of exons by gene using "exonsByGene <- exonsBy(txdb, by = c("gene"))." I wanted to test out counting using summarizeOverlaps and so I tested it out by counting exonsByGene and comparing the results I got with the count results obtained from HTSeq-count. My results are different despite using the same .gtf annotation file. The command I used for summarizeOverlaps is:

exon_counts <- summarizeOverlaps(exonsByGene, reads = pathtoFiles, singleEnd = FALSE, ignore.strand = FALSE, fragments=FALSE)

I also tried using the option ignore.strand=TRUE.

and my HTSeq-counts command is:

htseq-count -m union -s reverse -i gene_id  -f bam -r pos pathtobamfile annotation.gtf > outputfile

How can I setup summarizeOverlaps so that the read counts result is the same as the HTSeq-count result?

I apologize for latching onto this thread, I didn't know if proper etiquette requires to ask a new question or just follow up on a pre-existing one.

ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
The error is saying the 'reads' argument cannot be a GRanges object. The reason for this is because GRanges does not keep track of read gaps. When counting with summarizeOverlaps we want to consider the read gaps so we use the GappedAlignments and GappedAlignmentPairs objects which are 'gap-aware' and store the CIGARs. The summarizeOverlaps man page indicates what are valid objects for the 'reads' argument, ?summarizeOverlaps The reads can also be in a Bam file. Read more about that method at library(Rsamtools) library(GenomicRanges) ?'summarizeOverlaps,GRanges,BamFileList-method' The countOverlaps (and findOverlaps) functions will double count reads, summarizeOverlaps does not. See the example on the ?summarizeOverlaps man page for a comparison between findOverlaps and summarizeOverlaps. The man page also describes the different 'modes' in summarizeOverlaps that resolve reads that do hit multiple annotation features. Valerie On 02/04/2013 10:35 AM, Fong Chun Chan wrote: > Hi Valarie, > > Thanks for the reply. I was testing this and it doesn't seem to work > with the summarizeOverlaps() function. When I try it, I get this error: > > countsForIntrons <- summarizeOverlaps(intronsByGene, reads); > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function ?summarizeOverlaps? > for signature ?"GRangesList", "GRanges"? > > Is it supposed to be countOverlaps() instead of summarizeOverlap()? > > Thanks, > > Fong > > > On Wed, Jan 30, 2013 at 12:23 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > Hi Fong, > > We don't have a single function call to get introns by gene but you > can put one together in a few steps. > > library(TxDb.Hsapiens.UCSC.__hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.__knownGene > > Specifying use.names = TRUE puts the transcript names on the > GRangesList. > > introns <- intronsByTranscript(txdb, use.names=TRUE) > > Unlist and remove duplicate introns. > > ulst <- unlist(introns) > intronsNoDups <- ulst[!duplicated(ulst)] > > The select() function can map txid to geneid (and others). See ?select. > > txnames <- names(intronsNoDups) > map <- select(txdb, keys=txnames, keytype='TXNAME', cols='GENEID') > > Not all transcripts are associated with a gene id and some are > associated with more than one. > > > head(map, 10) > TXNAME GENEID > 1 uc001aaa.3 <na> > 2 uc001aaa.3 <na> > 3 uc010nxq.1 <na> > 4 uc010nxq.1 <na> > 5 uc010nxr.1 <na> > 6 uc010nxr.1 <na> > 7 uc009vjk.2 100133331 > 8 uc009vjk.2 100133331 > 9 uc001aau.3 100132287 > 10 uc021oeh.1 100133331 > > > To get a GRangesList of introns by gene, split the introns on the > associated gene id. > > idx <- map$GENEID[!is.na <http: is.na="">(map$GENEID)] > intronsByGene <- split(intronsNoDups[!is.na > <http: is.na="">(__map$GENEID)], idx) > names(intronsByGene) <- unique(idx) > > > The list names are geneid and the element rownames are txid. > > > intronsByGene > GRangesList of length 19784: > $100133331 > GRanges with 13 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > uc002qsd.4 chr19 [58858396, 58858718] - > uc002qsd.4 chr19 [58859007, 58861735] - > uc002qsd.4 chr19 [58862018, 58862756] - > ... > > $100132287 > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > uc003wyw.1 chr8 [18248856, 18257507] + > > > To count reads against this annotation you can use > summarizeOverlaps(). The 'reads' can be Bam files, GappedAlignments > or GappedAlignmentPairs. The 'features' argument will be the > GRangesList of introns by gene. The result will be counts per gene. > > To count just introns with no grouping you can still use > summarizeOverlaps() but use the 'intronsNoDups' GRanges object as > 'features'. > > > Valerie > > > > On 01/29/2013 11:38 AM, Fong Chun Chan wrote: > > Hi, > > I am using the R package, Genomic Features, to get Intron > Expression and > the function that I am using is the intronsByTranscript(). While > this > function is useful, it returns the number of raw reads that > align to each > intron grouped at the transcript level. Is there an easy way to > get it to > group it by a gene such that I am grabbing all the introns that > fall in all > the transcripts belong to a gene (and without double counting > the intronic > space). > > Similar, is there a way to easily get the raw read count at an > individual > intron level rather than having it grouped? The introns() > function seems to > be defunct now as it recommends that we use the > intronsByTranscript() > function. > > Thanks, > > Fong > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > >
ADD COMMENT
0
Entering edit mode
HI Valerie, Thanks for the information. I've corrected this now. If I have paired- end read libraries and I am strictly interested in just getting the raw reads that summarize over a exon, will there be a significant difference between using readBamGappedAlignments() and readBamGappedAlignmentPairs() to grab my reads? Based on my understanding of readBamGappedAlignmentPairs() it will simply treat the paired-end reads as a single entity and so if both reads align then we would be counting them as one unit. However if I use readBamGappedAlignments() then all the reads are treated a single-end reads. And so even if one read does not align, the other read is counted correct? I would suspect the results to be quite similar at the end or am I mistaken and missing something crucial. Thanks On Mon, Feb 4, 2013 at 12:46 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > The error is saying the 'reads' argument cannot be a GRanges object. The > reason for this is because GRanges does not keep track of read gaps. When > counting with summarizeOverlaps we want to consider the read gaps so we use > the GappedAlignments and GappedAlignmentPairs objects which are 'gap-aware' > and store the CIGARs. The summarizeOverlaps man page indicates what are > valid objects for the 'reads' argument, > > ?summarizeOverlaps > > The reads can also be in a Bam file. Read more about that method at > > library(Rsamtools) > library(GenomicRanges) > ?'summarizeOverlaps,GRanges,**BamFileList-method' > > > The countOverlaps (and findOverlaps) functions will double count reads, > summarizeOverlaps does not. See the example on the > > ?summarizeOverlaps > > man page for a comparison between findOverlaps and summarizeOverlaps. The > man page also describes the different 'modes' in summarizeOverlaps that > resolve reads that do hit multiple annotation features. > > Valerie > > > > On 02/04/2013 10:35 AM, Fong Chun Chan wrote: > >> Hi Valarie, >> >> Thanks for the reply. I was testing this and it doesn't seem to work >> with the summarizeOverlaps() function. When I try it, I get this error: >> >> countsForIntrons <- summarizeOverlaps(**intronsByGene, reads); >> Error in function (classes, fdef, mtable) : >> unable to find an inherited method for function ‘summarizeOverlaps’ >> for signature ‘"GRangesList", "GRanges"’ >> >> Is it supposed to be countOverlaps() instead of summarizeOverlap()? >> >> Thanks, >> >> Fong >> >> >> On Wed, Jan 30, 2013 at 12:23 PM, Valerie Obenchain <vobencha@fhcrc.org>> <mailto:vobencha@fhcrc.org>> wrote: >> >> Hi Fong, >> >> We don't have a single function call to get introns by gene but you >> can put one together in a few steps. >> >> library(TxDb.Hsapiens.UCSC.__**hg19.knownGene) >> txdb <- TxDb.Hsapiens.UCSC.hg19.__**knownGene >> >> >> Specifying use.names = TRUE puts the transcript names on the >> GRangesList. >> >> introns <- intronsByTranscript(txdb, use.names=TRUE) >> >> Unlist and remove duplicate introns. >> >> ulst <- unlist(introns) >> intronsNoDups <- ulst[!duplicated(ulst)] >> >> The select() function can map txid to geneid (and others). See >> ?select. >> >> txnames <- names(intronsNoDups) >> map <- select(txdb, keys=txnames, keytype='TXNAME', >> cols='GENEID') >> >> Not all transcripts are associated with a gene id and some are >> associated with more than one. >> >> > head(map, 10) >> TXNAME GENEID >> 1 uc001aaa.3 <na> >> 2 uc001aaa.3 <na> >> 3 uc010nxq.1 <na> >> 4 uc010nxq.1 <na> >> 5 uc010nxr.1 <na> >> 6 uc010nxr.1 <na> >> 7 uc009vjk.2 100133331 >> 8 uc009vjk.2 100133331 >> 9 uc001aau.3 100132287 >> 10 uc021oeh.1 100133331 >> >> >> To get a GRangesList of introns by gene, split the introns on the >> associated gene id. >> >> idx <- map$GENEID[!is.na <http: is.na="">(map$GENEID)] >> intronsByGene <- split(intronsNoDups[!is.na >> <http: is.na="">(__map$GENEID)], idx) >> >> names(intronsByGene) <- unique(idx) >> >> >> The list names are geneid and the element rownames are txid. >> >> > intronsByGene >> GRangesList of length 19784: >> $100133331 >> GRanges with 13 ranges and 0 metadata columns: >> seqnames ranges strand >> <rle> <iranges> <rle> >> uc002qsd.4 chr19 [58858396, 58858718] - >> uc002qsd.4 chr19 [58859007, 58861735] - >> uc002qsd.4 chr19 [58862018, 58862756] - >> ... >> >> $100132287 >> GRanges with 1 range and 0 metadata columns: >> seqnames ranges strand >> uc003wyw.1 chr8 [18248856, 18257507] + >> >> >> To count reads against this annotation you can use >> summarizeOverlaps(). The 'reads' can be Bam files, GappedAlignments >> or GappedAlignmentPairs. The 'features' argument will be the >> GRangesList of introns by gene. The result will be counts per gene. >> >> To count just introns with no grouping you can still use >> summarizeOverlaps() but use the 'intronsNoDups' GRanges object as >> 'features'. >> >> >> Valerie >> >> >> >> On 01/29/2013 11:38 AM, Fong Chun Chan wrote: >> >> Hi, >> >> I am using the R package, Genomic Features, to get Intron >> Expression and >> the function that I am using is the intronsByTranscript(). While >> this >> function is useful, it returns the number of raw reads that >> align to each >> intron grouped at the transcript level. Is there an easy way to >> get it to >> group it by a gene such that I am grabbing all the introns that >> fall in all >> the transcripts belong to a gene (and without double counting >> the intronic >> space). >> >> Similar, is there a way to easily get the raw read count at an >> individual >> intron level rather than having it grouped? The introns() >> function seems to >> be defunct now as it recommends that we use the >> intronsByTranscript() >> function. >> >> Thanks, >> >> Fong >> >> [[alternative HTML version deleted]] >> >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org> >> > >> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor<http s:="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https="" :="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> >> >> >> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
On 02/04/13 13:57, Fong Chun Chan wrote: > HI Valerie, > > Thanks for the information. I've corrected this now. If I have > paired-end read libraries and I am strictly interested in just getting > the raw reads that summarize over a exon, will there be a significant > difference between using readBamGappedAlignments() > and readBamGappedAlignmentPairs() to grab my reads? This depends on the reads and annotation. > > Based on my understanding of readBamGappedAlignmentPairs() it will > simply treat the paired-end reads as a single entity and so if both > reads align then we would be counting them as one unit. However if I > use readBamGappedAlignments() then all the reads are treated a > single-end reads. And so even if one read does not align, the other > read is counted correct? I would suspect the results to be quite > similar at the end or am I mistaken and missing something crucial. You are correct in that counting/overlap routines treat the reads (ranges) in GappedAlignmentPairs as one entity. If your data are paired-end and you read into GappedAlignments the number of reads are effectively doubled (perhaps more with possibility of unmated reads). Now when you count, each read fragment is considered an independent read. Whether or not the results would be similar with GappedAlignments and GappedAlignmentPairs depends on the data. If there are many reads that have a portion of both read pairs aligned to the annotation then the counts will be quite different. Valerie > > Thanks > > > On Mon, Feb 4, 2013 at 12:46 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> <mailto:vobencha at="" fhcrc.org="">> wrote: > > The error is saying the 'reads' argument cannot be a GRanges > object. The reason for this is because GRanges does not keep track > of read gaps. When counting with summarizeOverlaps we want to > consider the read gaps so we use the GappedAlignments and > GappedAlignmentPairs objects which are 'gap-aware' and store the > CIGARs. The summarizeOverlaps man page indicates what are valid > objects for the 'reads' argument, > > ?summarizeOverlaps > > The reads can also be in a Bam file. Read more about that method at > > library(Rsamtools) > library(GenomicRanges) > ?'summarizeOverlaps,GRanges,BamFileList-method' > > > The countOverlaps (and findOverlaps) functions will double count > reads, summarizeOverlaps does not. See the example on the > > ?summarizeOverlaps > > man page for a comparison between findOverlaps and > summarizeOverlaps. The man page also describes the different > 'modes' in summarizeOverlaps that resolve reads that do hit > multiple annotation features. > > Valerie > > > > On 02/04/2013 10:35 AM, Fong Chun Chan wrote: > > Hi Valarie, > > Thanks for the reply. I was testing this and it doesn't seem > to work > with the summarizeOverlaps() function. When I try it, I get > this error: > > countsForIntrons <- summarizeOverlaps(intronsByGene, reads); > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function > ?summarizeOverlaps? > for signature ?"GRangesList", "GRanges"? > > Is it supposed to be countOverlaps() instead of > summarizeOverlap()? > > Thanks, > > Fong > > > On Wed, Jan 30, 2013 at 12:23 PM, Valerie Obenchain > <vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org=""> > <mailto:vobencha at="" fhcrc.org="" <mailto:vobencha="" at="" fhcrc.org="">>> wrote: > > Hi Fong, > > We don't have a single function call to get introns by > gene but you > can put one together in a few steps. > > library(TxDb.Hsapiens.UCSC.__hg19.knownGene) > txdb <- TxDb.Hsapiens.UCSC.hg19.__knownGene > > > Specifying use.names = TRUE puts the transcript names on the > GRangesList. > > introns <- intronsByTranscript(txdb, use.names=TRUE) > > Unlist and remove duplicate introns. > > ulst <- unlist(introns) > intronsNoDups <- ulst[!duplicated(ulst)] > > The select() function can map txid to geneid (and others). > See ?select. > > txnames <- names(intronsNoDups) > map <- select(txdb, keys=txnames, keytype='TXNAME', > cols='GENEID') > > Not all transcripts are associated with a gene id and some are > associated with more than one. > > > head(map, 10) > TXNAME GENEID > 1 uc001aaa.3 <na> > 2 uc001aaa.3 <na> > 3 uc010nxq.1 <na> > 4 uc010nxq.1 <na> > 5 uc010nxr.1 <na> > 6 uc010nxr.1 <na> > 7 uc009vjk.2 100133331 > 8 uc009vjk.2 100133331 > 9 uc001aau.3 100132287 > 10 uc021oeh.1 100133331 > > > To get a GRangesList of introns by gene, split the introns > on the > associated gene id. > > idx <- map$GENEID[!is.na <http: is.na=""> > <http: is.na="">(map$GENEID)] > intronsByGene <- split(intronsNoDups[!is.na > <http: is.na=""> > <http: is.na="">(__map$GENEID)], idx) > > names(intronsByGene) <- unique(idx) > > > The list names are geneid and the element rownames are txid. > > > intronsByGene > GRangesList of length 19784: > $100133331 > GRanges with 13 ranges and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > uc002qsd.4 chr19 [58858396, 58858718] - > uc002qsd.4 chr19 [58859007, 58861735] - > uc002qsd.4 chr19 [58862018, 58862756] - > ... > > $100132287 > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > uc003wyw.1 chr8 [18248856, 18257507] + > > > To count reads against this annotation you can use > summarizeOverlaps(). The 'reads' can be Bam files, > GappedAlignments > or GappedAlignmentPairs. The 'features' argument will be the > GRangesList of introns by gene. The result will be counts > per gene. > > To count just introns with no grouping you can still use > summarizeOverlaps() but use the 'intronsNoDups' GRanges > object as > 'features'. > > > Valerie > > > > On 01/29/2013 11:38 AM, Fong Chun Chan wrote: > > Hi, > > I am using the R package, Genomic Features, to get Intron > Expression and > the function that I am using is the > intronsByTranscript(). While > this > function is useful, it returns the number of raw reads > that > align to each > intron grouped at the transcript level. Is there an > easy way to > get it to > group it by a gene such that I am grabbing all the > introns that > fall in all > the transcripts belong to a gene (and without double > counting > the intronic > space). > > Similar, is there a way to easily get the raw read > count at an > individual > intron level rather than having it grouped? The introns() > function seems to > be defunct now as it recommends that we use the > intronsByTranscript() > function. > > Thanks, > > Fong > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > <mailto:bioconductor at="" r-project.org=""> <mailto:bioconductor at="" r-project.org="">> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > >
ADD REPLY

Login before adding your answer.

Traffic: 496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6