1
0
Entering edit mode
Sam McInturf ▴ 300
@sam-mcinturf-5291
Last seen 7.0 years ago
United States
Bioconductors, I am working on an Arabidopsis RNA sequencing experiment and have run into some trouble. I found that I was able to map ~94% of my ~35 million reads / library to the Arabidopsis TAIR10 genome using tophat2 (100 bp single end reads). After I counted my mapped reads I found that only about 47% of my mapped reads were counted. I found out there is a strong possibility that there may be genomic DNA contamination, and I would like to verify this hypothesis. I used GenomicRanges and rtracklayer to do my read counting: library(GenomicRanges) library(rtracklayer) library(Rsamtools) setwd("/myDir") myBamFiles <- list.files(pattern = ".bam$") myBamIndex <- list.files(pattern = ".bai$") bfl <- BamFileList(myBamFiles, index = myBamIndex) gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf", asRangedData = FALSE) cds <- subset(gff0, type == "CDS") cds_gene <- reduce(split(cds, cds$gene_id)) olap <- summarizeOverlaps(cds_gene, bfl, mode = "IntersectionNotEmpty") My intention with subsetting for the CDS and then splitting on gene_id is to create a GRangesList where each element is a unique AGI number and has all of the features. To extract the number of reads counted I use: colSums(assays(olap)$counts) The output of colSums(...) is how I have found that less than half my reads mapped to transcripts. The first question becomes "Is this how to count the number of reads mapped?" The second question becomes "how do I figure out where the reads went?" My initial inclination would be to make a GRangesList that contained intervals between genes and then do the same thing as before and see if I can find my other reads. Something in the vein of: maxLimits <- t(as.dataframe(lapply(cds_gene, function(x) c(min(start(ranges(x))), max(end( ranges(x))) ) ) )) which is just an elaborate lapply call to get the smallest start coordinate and the largest end coordinate for each unique gene_id and then transform it to a pretty data frame. From there I could find a way to get the 'in-betweens' of these features, but I am not sure that is the most prudent thing to do. I looked at some of the ShortRead and GenomicRanges functions for a guiding hand and found the disjointBins() function may be useful in establishing features to map over, but am at a bit of a loss as I don't have the expertise to figure this out straight away. Any suggestions or resources to find mapped but uncounted reads would be helpful! Thanks -- Sam McInturf [[alternative HTML version deleted]]
0
Entering edit mode
@valerie-obenchain-4275
Last seen 8 months ago
United States
Hi Sam, ## counting reads If you haven't done this already I'd confirm counts on a single file. A bam can be read in with readGAlignments(). If you have paired-end data you can use ?readGAlignmentPairs or ?readGAlignmentsList. bf <- BamFile("pathToOneBam") reads <- readGAlignments(bf) ## devel reads <- readGappedAlignments(bf) ## release Take a look at the makeTranscriptDbFromGFF() function in GenomicFeatures. It creates a TranscriptDb object from gff or gtf. The advantage of working with a TxDb is that you get the accessors such as cdsBy(..., "gene"), etc. for free (you may know this, just an fyi). Get an idea of how many reads hit each list element of the GRangesList. This method does double count so it's possible for the sum of 'olap' to be greater than the length of 'reads'. olap <- countOverlaps(cds_gene, reads) If the numbers in 'olap' are very small (and this is unexpected) you may have a problem with how you've created the 'cds_gene' annotation. If the numbers make sense then try summarizeOverlaps. se <- summarizeOverlaps(cds_gene, bf, mode="IntersectionNotEmpty") If the number of hits in assays(se)$counts are greatly reduced you may have many reads overlapping each other or a single feature in 'cds_gene'. This can be checked by looking at the numbers in 'olap'. If you are using the devel version (GenomicRanges >= 1.13.9) the 'inter.feature' argument allows reads to be counted for each feature they overlap. ## mapped vs unmapped reads readGAlignments() only imports mapped reads; unmapped are dropped. If you are interested in the number of mapped vs unmapped and some details about them you can use scanBam(). If your file is large you can use the 'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for param details. All reads in the file, mapped and unmapped. scn <- scanBam(bf) names(scn[[1]]) Unmapped only. flag1 <- scanBamFlag(isUnmappedQuery=TRUE) param1 <- ScanBamParam(what=scanBamWhat(), flag=flag1) unmapped <- (bf, param1) Mapped only. flag2 <- scanBamFlag(isUnmappedQuery=FALSE) param1 <- ScanBamParam(what=scanBamWhat(), flag=flag2) mapped <- (bf, param2) Valerie On 08/28/2013 01:50 PM, Sam McInturf wrote: > Bioconductors, > I am working on an Arabidopsis RNA sequencing experiment and have run > into some trouble. I found that I was able to map ~94% of my ~35 million > reads / library to the Arabidopsis TAIR10 genome using tophat2 (100 bp > single end reads). After I counted my mapped reads I found that only about > 47% of my mapped reads were counted. I found out there is a strong > possibility that there may be genomic DNA contamination, and I would like > to verify this hypothesis. > I used GenomicRanges and rtracklayer to do my read counting: > > library(GenomicRanges) > library(rtracklayer) > library(Rsamtools) > setwd("/myDir") > myBamFiles <- list.files(pattern = ".bam$") > myBamIndex <- list.files(pattern = ".bai$") > > bfl <- BamFileList(myBamFiles, index = myBamIndex) > gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf", > asRangedData = FALSE) > cds <- subset(gff0, type == "CDS") > cds_gene <- reduce(split(cds, cds$gene_id)) > olap <- summarizeOverlaps(cds_gene, bfl, mode = "IntersectionNotEmpty") > > My intention with subsetting for the CDS and then splitting on gene_id is > to create a GRangesList where each element is a unique AGI number and has > all of the features. > To extract the number of reads counted I use: > > colSums(assays(olap)$counts) > > The output of colSums(...) is how I have found that less than half my reads > mapped to transcripts. The first question becomes "Is this how to count > the number of reads mapped?" The second question becomes "how do I figure > out where the reads went?" > > My initial inclination would be to make a GRangesList that contained > intervals between genes and then do the same thing as before and see if I > can find my other reads. Something in the vein of: > > maxLimits <- t(as.dataframe(lapply(cds_gene, function(x) > c(min(start(ranges(x))), > max(end( ranges(x))) > ) > ) > )) > which is just an elaborate lapply call to get the smallest start coordinate > and the largest end coordinate for each unique gene_id and then transform > it to a pretty data frame. From there I could find a way to get the > 'in-betweens' of these features, but I am not sure that is the most prudent > thing to do. I looked at some of the ShortRead and GenomicRanges functions > for a guiding hand and found the disjointBins() function may be useful in > establishing features to map over, but am at a bit of a loss as I don't > have the expertise to figure this out straight away. > Any suggestions or resources to find mapped but uncounted reads would be > helpful! > > > Thanks > ADD COMMENT 0 Entering edit mode Hi, On Wed, Aug 28, 2013 at 3:15 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: > Hi Sam, > > ## counting reads > > If you haven't done this already I'd confirm counts on a single file. A bam > can be read in with readGAlignments(). If you have paired-end data you can > use ?readGAlignmentPairs or ?readGAlignmentsList. > > bf <- BamFile("pathToOneBam") > reads <- readGAlignments(bf) ## devel > reads <- readGappedAlignments(bf) ## release > > Take a look at the makeTranscriptDbFromGFF() function in GenomicFeatures. It > creates a TranscriptDb object from gff or gtf. The advantage of working with > a TxDb is that you get the accessors such as cdsBy(..., "gene"), etc. for > free (you may know this, just an fyi). UTRs are exons, too! > Get an idea of how many reads hit each list element of the GRangesList. This > method does double count so it's possible for the sum of 'olap' to be > greater than the length of 'reads'. > > olap <- countOverlaps(cds_gene, reads) > > If the numbers in 'olap' are very small (and this is unexpected) you may > have a problem with how you've created the 'cds_gene' annotation. If the > numbers make sense then try summarizeOverlaps. > > se <- summarizeOverlaps(cds_gene, bf, mode="IntersectionNotEmpty") > > If the number of hits in assays(se)$counts are greatly reduced you may have > many reads overlapping each other or a single feature in 'cds_gene'. This > can be checked by looking at the numbers in 'olap'. If you are using the > devel version (GenomicRanges >= 1.13.9) the 'inter.feature' argument allows > reads to be counted for each feature they overlap. > > > ## mapped vs unmapped reads I don't think the issue was mapped vs. unmapped, but rather "genic" vs "intergenic". If the OP took the TxDb route, a first approximation to the number of intergenic reads could be counted by doing something like: tx <- transcripts(txdb) strand(tx) <- '*'; txs <- reduce(tx) o <- countOverlaps(txs, reads, ignore.strand=TRUE) ... or something similar. This treats intronic reads as valid as reads that map to exons, but again -- a first approximation to answer the "DNA contamination". If you want to include intronic reads as "bad" reads, then do exons(txdb) instead of transcripts(txdb) and proceed as above. HTH, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech
0
Entering edit mode
After a second read I think I may have missed what you're really after. Specifically the question about 'where did the reads go'? A simple check of whether or not the mapped reads hit any part of the annotation would be to overlap them with the 'gff0' GRanges. olap <- findOverlaps(gff0, reads) The result is a 'Hits' object that tells you what query overlapped what subject. See ?Hits for details. ## number of hits for each row of gff0 countQueryHits(olap) ## number of hits for each read countSubjectHits(olap) Valerie On 08/28/2013 03:15 PM, Valerie Obenchain wrote: > Hi Sam, > > ## counting reads > > If you haven't done this already I'd confirm counts on a single file. A > bam can be read in with readGAlignments(). If you have paired-end data > you can use ?readGAlignmentPairs or ?readGAlignmentsList. > > bf <- BamFile("pathToOneBam") > reads <- readGAlignments(bf) ## devel > reads <- readGappedAlignments(bf) ## release > > Take a look at the makeTranscriptDbFromGFF() function in > GenomicFeatures. It creates a TranscriptDb object from gff or gtf. The > advantage of working with a TxDb is that you get the accessors such as > cdsBy(..., "gene"), etc. for free (you may know this, just an fyi). > > Get an idea of how many reads hit each list element of the GRangesList. > This method does double count so it's possible for the sum of 'olap' to > be greater than the length of 'reads'. > > olap <- countOverlaps(cds_gene, reads) > > If the numbers in 'olap' are very small (and this is unexpected) you may > have a problem with how you've created the 'cds_gene' annotation. If the > numbers make sense then try summarizeOverlaps. > > se <- summarizeOverlaps(cds_gene, bf, mode="IntersectionNotEmpty") > > If the number of hits in assays(se)$counts are greatly reduced you may > have many reads overlapping each other or a single feature in > 'cds_gene'. This can be checked by looking at the numbers in 'olap'. If > you are using the devel version (GenomicRanges >= 1.13.9) the > 'inter.feature' argument allows reads to be counted for each feature > they overlap. > > > ## mapped vs unmapped reads > > readGAlignments() only imports mapped reads; unmapped are dropped. If > you are interested in the number of mapped vs unmapped and some details > about them you can use scanBam(). If your file is large you can use the > 'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for > param details. > > All reads in the file, mapped and unmapped. > scn <- scanBam(bf) > names(scn[[1]]) > > Unmapped only. > flag1 <- scanBamFlag(isUnmappedQuery=TRUE) > param1 <- ScanBamParam(what=scanBamWhat(), flag=flag1) > unmapped <- (bf, param1) > > Mapped only. > flag2 <- scanBamFlag(isUnmappedQuery=FALSE) > param1 <- ScanBamParam(what=scanBamWhat(), flag=flag2) > mapped <- (bf, param2) > > > Valerie > > On 08/28/2013 01:50 PM, Sam McInturf wrote: >> Bioconductors, >> I am working on an Arabidopsis RNA sequencing experiment and have >> run >> into some trouble. I found that I was able to map ~94% of my ~35 million >> reads / library to the Arabidopsis TAIR10 genome using tophat2 (100 bp >> single end reads). After I counted my mapped reads I found that only >> about >> 47% of my mapped reads were counted. I found out there is a strong >> possibility that there may be genomic DNA contamination, and I would like >> to verify this hypothesis. >> I used GenomicRanges and rtracklayer to do my read counting: >> >> library(GenomicRanges) >> library(rtracklayer) >> library(Rsamtools) >> setwd("/myDir") >> myBamFiles <- list.files(pattern = ".bam$") >> myBamIndex <- list.files(pattern = ".bai$") >> >> bfl <- BamFileList(myBamFiles, index = myBamIndex) >> gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf", >> asRangedData = FALSE) >> cds <- subset(gff0, type == "CDS") >> cds_gene <- reduce(split(cds, cds$gene_id)) >> olap <- summarizeOverlaps(cds_gene, bfl, mode = "IntersectionNotEmpty") >> >> My intention with subsetting for the CDS and then splitting on gene_id is >> to create a GRangesList where each element is a unique AGI number and has >> all of the features. >> To extract the number of reads counted I use: >> >> colSums(assays(olap)$counts) >> >> The output of colSums(...) is how I have found that less than half my >> reads >> mapped to transcripts. The first question becomes "Is this how to count >> the number of reads mapped?" The second question becomes "how do I >> figure >> out where the reads went?" >> >> My initial inclination would be to make a GRangesList that contained >> intervals between genes and then do the same thing as before and see if I >> can find my other reads. Something in the vein of: >> >> maxLimits <- t(as.dataframe(lapply(cds_gene, function(x) >> c(min(start(ranges(x))), >> max(end( ranges(x))) >> ) >> ) >> )) >> which is just an elaborate lapply call to get the smallest start >> coordinate >> and the largest end coordinate for each unique gene_id and then transform >> it to a pretty data frame. From there I could find a way to get the >> 'in-betweens' of these features, but I am not sure that is the most >> prudent >> thing to do. I looked at some of the ShortRead and GenomicRanges >> functions >> for a guiding hand and found the disjointBins() function may be useful in >> establishing features to map over, but am at a bit of a loss as I don't >> have the expertise to figure this out straight away. >> Any suggestions or resources to find mapped but uncounted reads >> would be >> helpful! >> >> >> Thanks >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor ADD REPLY 0 Entering edit mode Valarie, By calling: olap <- findOverlaps(gff0, reads) foo <- countSubjectHits(olap) countSubjectHits(x): Counts the number of hits for each subject, returning an integer vector. foo is a vector with a length equal to the number of reads mapped? so: length(foo[foo == 0]) would tell me how many reads did not map to my annotated features. can you recommend a method that would allow me to map to the regions outside of my current features? I have been trying to devise a way to make a new set of features that are the spaces between my current features, but I have been unable to implement it successfully. Maybe a more genomics oriented approach and look at the coverage over each chromosome to try and identify regions that have a higher 'background' that could be then looked at in a genome browser like IGV to see the reads that did not map to a transcript? Thanks again, Sam On Wed, Aug 28, 2013 at 5:38 PM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > After a second read I think I may have missed what you're really after. > Specifically the question about 'where did the reads go'? > > A simple check of whether or not the mapped reads hit any part of the > annotation would be to overlap them with the 'gff0' GRanges. > > olap <- findOverlaps(gff0, reads) > > The result is a 'Hits' object that tells you what query overlapped what > subject. See ?Hits for details. > > ## number of hits for each row of gff0 > countQueryHits(olap) > ## number of hits for each read > countSubjectHits(olap) > > > Valerie > > > On 08/28/2013 03:15 PM, Valerie Obenchain wrote: > >> Hi Sam, >> >> ## counting reads >> >> If you haven't done this already I'd confirm counts on a single file. A >> bam can be read in with readGAlignments(). If you have paired-end data >> you can use ?readGAlignmentPairs or ?readGAlignmentsList. >> >> bf <- BamFile("pathToOneBam") >> reads <- readGAlignments(bf) ## devel >> reads <- readGappedAlignments(bf) ## release >> >> Take a look at the makeTranscriptDbFromGFF() function in >> GenomicFeatures. It creates a TranscriptDb object from gff or gtf. The >> advantage of working with a TxDb is that you get the accessors such as >> cdsBy(..., "gene"), etc. for free (you may know this, just an fyi). >> >> Get an idea of how many reads hit each list element of the GRangesList. >> This method does double count so it's possible for the sum of 'olap' to >> be greater than the length of 'reads'. >> >> olap <- countOverlaps(cds_gene, reads) >> >> If the numbers in 'olap' are very small (and this is unexpected) you may >> have a problem with how you've created the 'cds_gene' annotation. If the >> numbers make sense then try summarizeOverlaps. >> >> se <- summarizeOverlaps(cds_gene, bf, mode="IntersectionNotEmpty") >> >> If the number of hits in assays(se)$counts are greatly reduced you may >> have many reads overlapping each other or a single feature in >> 'cds_gene'. This can be checked by looking at the numbers in 'olap'. If >> you are using the devel version (GenomicRanges >= 1.13.9) the >> 'inter.feature' argument allows reads to be counted for each feature >> they overlap. >> >> >> ## mapped vs unmapped reads >> >> readGAlignments() only imports mapped reads; unmapped are dropped. If >> you are interested in the number of mapped vs unmapped and some details >> about them you can use scanBam(). If your file is large you can use the >> 'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for >> param details. >> >> All reads in the file, mapped and unmapped. >> scn <- scanBam(bf) >> names(scn[[1]]) >> >> Unmapped only. >> flag1 <- scanBamFlag(isUnmappedQuery=**TRUE) >> param1 <- ScanBamParam(what=scanBamWhat(**), flag=flag1) >> unmapped <- (bf, param1) >> >> Mapped only. >> flag2 <- scanBamFlag(isUnmappedQuery=**FALSE) >> param1 <- ScanBamParam(what=scanBamWhat(**), flag=flag2) >> mapped <- (bf, param2) >> >> >> Valerie >> >> On 08/28/2013 01:50 PM, Sam McInturf wrote: >> >>> Bioconductors, >>> I am working on an Arabidopsis RNA sequencing experiment and have >>> run >>> into some trouble. I found that I was able to map ~94% of my ~35 million >>> reads / library to the Arabidopsis TAIR10 genome using tophat2 (100 bp >>> single end reads). After I counted my mapped reads I found that only >>> about >>> 47% of my mapped reads were counted. I found out there is a strong >>> possibility that there may be genomic DNA contamination, and I would like >>> to verify this hypothesis. >>> I used GenomicRanges and rtracklayer to do my read counting: >>> >>> library(GenomicRanges) >>> library(rtracklayer) >>> library(Rsamtools) >>> setwd("/myDir") >>> myBamFiles <- list.files(pattern = ".bam$") >>> myBamIndex <- list.files(pattern = ".bai$") >>> >>> bfl <- BamFileList(myBamFiles, index = myBamIndex) >>> gff0 <- import("/myOtherDir/**Arabidopsis_thaliana.TAIR10.**17.gtf", >>> asRangedData = FALSE) >>> cds <- subset(gff0, type == "CDS") >>> cds_gene <- reduce(split(cds, cds$gene_id)) >>> olap <- summarizeOverlaps(cds_gene, bfl, mode = "IntersectionNotEmpty") >>> >>> My intention with subsetting for the CDS and then splitting on gene_id is >>> to create a GRangesList where each element is a unique AGI number and has >>> all of the features. >>> To extract the number of reads counted I use: >>> >>> colSums(assays(olap)$counts) >>> >>> The output of colSums(...) is how I have found that less than half my >>> reads >>> mapped to transcripts. The first question becomes "Is this how to count >>> the number of reads mapped?" The second question becomes "how do I >>> figure >>> out where the reads went?" >>> >>> My initial inclination would be to make a GRangesList that contained >>> intervals between genes and then do the same thing as before and see if I >>> can find my other reads. Something in the vein of: >>> >>> maxLimits <- t(as.dataframe(lapply(cds_**gene, function(x) >>> c(min(start(ranges(x))), >>> max(end( ranges(x))) >>> ) >>> ) >>> )) >>> which is just an elaborate lapply call to get the smallest start >>> coordinate >>> and the largest end coordinate for each unique gene_id and then transform >>> it to a pretty data frame. From there I could find a way to get the >>> 'in-betweens' of these features, but I am not sure that is the most >>> prudent >>> thing to do. I looked at some of the ShortRead and GenomicRanges >>> functions >>> for a guiding hand and found the disjointBins() function may be useful in >>> establishing features to map over, but am at a bit of a loss as I don't >>> have the expertise to figure this out straight away. >>> Any suggestions or resources to find mapped but uncounted reads >>> would be >>> helpful! >>> >>> >>> Thanks >>> >>> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.**science.biology.informatics.**conduct or<http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> >> > > -- Sam McInturf [[alternative HTML version deleted]]
0
Entering edit mode
Hi, On Fri, Aug 30, 2013 at 9:38 AM, Sam McInturf <smcinturf at="" gmail.com=""> wrote: > Valarie, > > By calling: > olap <- findOverlaps(gff0, reads) > foo <- countSubjectHits(olap) > > countSubjectHits(x): Counts the number of hits for each subject, returning > an integer vector. > foo is a vector with a length equal to the number of reads mapped? so: > > length(foo[foo == 0]) > > would tell me how many reads did not map to my annotated features. > > can you recommend a method that would allow me to map to the regions > outside of my current features? I have been trying to devise a way to make > a new set of features that are the spaces between my current features, but > I have been unable to implement it successfully. > > Maybe a more genomics oriented approach and look at the coverage over each > chromosome to try and identify regions that have a higher 'background' that > could be then looked at in a genome browser like IGV to see the reads that > did not map to a transcript? I had previously offered a suggestion, like so: tx <- transcripts(txdb) strand(tx) <- '*'; txs <- reduce(tx) o <- countOverlaps(txs, reads, ignore.strand=TRUE) But I was missing a crucial step, which was to use the gaps function, eg: tx <- transcripts(txdb) strand(tx) <- '*'; txs <- reduce(tx) nontx <- gaps(txs) o <- countOverlaps(nontx, reads, ignore.strand=TRUE) nontx should be a GenomicRanges object with ranges that are created from the spaces between the transcripts (the "gaps") -- which should be a good approximation to the intergenic regions you are looking for. -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech