Bioconductors,
I am working on an Arabidopsis RNA sequencing experiment and have
run
into some trouble. I found that I was able to map ~94% of my ~35
million
reads / library to the Arabidopsis TAIR10 genome using tophat2 (100 bp
single end reads). After I counted my mapped reads I found that only
about
47% of my mapped reads were counted. I found out there is a strong
possibility that there may be genomic DNA contamination, and I would
like
to verify this hypothesis.
I used GenomicRanges and rtracklayer to do my read counting:
library(GenomicRanges)
library(rtracklayer)
library(Rsamtools)
setwd("/myDir")
myBamFiles <- list.files(pattern = ".bam$")
myBamIndex <- list.files(pattern = ".bai$")
bfl <- BamFileList(myBamFiles, index = myBamIndex)
gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf",
asRangedData = FALSE)
cds <- subset(gff0, type == "CDS")
cds_gene <- reduce(split(cds, cds$gene_id))
olap <- summarizeOverlaps(cds_gene, bfl, mode =
"IntersectionNotEmpty")
My intention with subsetting for the CDS and then splitting on gene_id
is
to create a GRangesList where each element is a unique AGI number and
has
all of the features.
To extract the number of reads counted I use:
colSums(assays(olap)$counts)
The output of colSums(...) is how I have found that less than half my
reads
mapped to transcripts. The first question becomes "Is this how to
count
the number of reads mapped?" The second question becomes "how do I
figure
out where the reads went?"
My initial inclination would be to make a GRangesList that contained
intervals between genes and then do the same thing as before and see
if I
can find my other reads. Something in the vein of:
maxLimits <- t(as.dataframe(lapply(cds_gene, function(x)
c(min(start(ranges(x))),
max(end( ranges(x)))
)
)
))
which is just an elaborate lapply call to get the smallest start
coordinate
and the largest end coordinate for each unique gene_id and then
transform
it to a pretty data frame. From there I could find a way to get the
'in-betweens' of these features, but I am not sure that is the most
prudent
thing to do. I looked at some of the ShortRead and GenomicRanges
functions
for a guiding hand and found the disjointBins() function may be useful
in
establishing features to map over, but am at a bit of a loss as I
don't
have the expertise to figure this out straight away.
Any suggestions or resources to find mapped but uncounted reads
would be
helpful!
Thanks
--
Sam McInturf
[[alternative HTML version deleted]]
Hi Sam,
## counting reads
If you haven't done this already I'd confirm counts on a single file.
A
bam can be read in with readGAlignments(). If you have paired-end data
you can use ?readGAlignmentPairs or ?readGAlignmentsList.
bf <- BamFile("pathToOneBam")
reads <- readGAlignments(bf) ## devel
reads <- readGappedAlignments(bf) ## release
Take a look at the makeTranscriptDbFromGFF() function in
GenomicFeatures. It creates a TranscriptDb object from gff or gtf. The
advantage of working with a TxDb is that you get the accessors such as
cdsBy(..., "gene"), etc. for free (you may know this, just an fyi).
Get an idea of how many reads hit each list element of the
GRangesList.
This method does double count so it's possible for the sum of 'olap'
to
be greater than the length of 'reads'.
olap <- countOverlaps(cds_gene, reads)
If the numbers in 'olap' are very small (and this is unexpected) you
may
have a problem with how you've created the 'cds_gene' annotation. If
the
numbers make sense then try summarizeOverlaps.
se <- summarizeOverlaps(cds_gene, bf,
mode="IntersectionNotEmpty")
If the number of hits in assays(se)$counts are greatly reduced you may
have many reads overlapping each other or a single feature in
'cds_gene'. This can be checked by looking at the numbers in 'olap'.
If
you are using the devel version (GenomicRanges >= 1.13.9) the
'inter.feature' argument allows reads to be counted for each feature
they overlap.
## mapped vs unmapped reads
readGAlignments() only imports mapped reads; unmapped are dropped. If
you are interested in the number of mapped vs unmapped and some
details
about them you can use scanBam(). If your file is large you can use
the
'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for
param details.
All reads in the file, mapped and unmapped.
scn <- scanBam(bf)
names(scn[[1]])
Unmapped only.
flag1 <- scanBamFlag(isUnmappedQuery=TRUE)
param1 <- ScanBamParam(what=scanBamWhat(), flag=flag1)
unmapped <- (bf, param1)
Mapped only.
flag2 <- scanBamFlag(isUnmappedQuery=FALSE)
param1 <- ScanBamParam(what=scanBamWhat(), flag=flag2)
mapped <- (bf, param2)
Valerie
On 08/28/2013 01:50 PM, Sam McInturf wrote:
> Bioconductors,
> I am working on an Arabidopsis RNA sequencing experiment and
have run
> into some trouble. I found that I was able to map ~94% of my ~35
million
> reads / library to the Arabidopsis TAIR10 genome using tophat2 (100
bp
> single end reads). After I counted my mapped reads I found that
only about
> 47% of my mapped reads were counted. I found out there is a strong
> possibility that there may be genomic DNA contamination, and I would
like
> to verify this hypothesis.
> I used GenomicRanges and rtracklayer to do my read counting:
>
> library(GenomicRanges)
> library(rtracklayer)
> library(Rsamtools)
> setwd("/myDir")
> myBamFiles <- list.files(pattern = ".bam$")
> myBamIndex <- list.files(pattern = ".bai$")
>
> bfl <- BamFileList(myBamFiles, index = myBamIndex)
> gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf",
> asRangedData = FALSE)
> cds <- subset(gff0, type == "CDS")
> cds_gene <- reduce(split(cds, cds$gene_id))
> olap <- summarizeOverlaps(cds_gene, bfl, mode =
"IntersectionNotEmpty")
>
> My intention with subsetting for the CDS and then splitting on
gene_id is
> to create a GRangesList where each element is a unique AGI number
and has
> all of the features.
> To extract the number of reads counted I use:
>
> colSums(assays(olap)$counts)
>
> The output of colSums(...) is how I have found that less than half
my reads
> mapped to transcripts. The first question becomes "Is this how to
count
> the number of reads mapped?" The second question becomes "how do I
figure
> out where the reads went?"
>
> My initial inclination would be to make a GRangesList that contained
> intervals between genes and then do the same thing as before and see
if I
> can find my other reads. Something in the vein of:
>
> maxLimits <- t(as.dataframe(lapply(cds_gene, function(x)
> c(min(start(ranges(x))),
> max(end( ranges(x)))
> )
> )
> ))
> which is just an elaborate lapply call to get the smallest start
coordinate
> and the largest end coordinate for each unique gene_id and then
transform
> it to a pretty data frame. From there I could find a way to get the
> 'in-betweens' of these features, but I am not sure that is the most
prudent
> thing to do. I looked at some of the ShortRead and GenomicRanges
functions
> for a guiding hand and found the disjointBins() function may be
useful in
> establishing features to map over, but am at a bit of a loss as I
don't
> have the expertise to figure this out straight away.
> Any suggestions or resources to find mapped but uncounted reads
would be
> helpful!
>
>
> Thanks
>
Hi,
On Wed, Aug 28, 2013 at 3:15 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote:
> Hi Sam,
>
> ## counting reads
>
> If you haven't done this already I'd confirm counts on a single
file. A bam
> can be read in with readGAlignments(). If you have paired-end data
you can
> use ?readGAlignmentPairs or ?readGAlignmentsList.
>
> bf <- BamFile("pathToOneBam")
> reads <- readGAlignments(bf) ## devel
> reads <- readGappedAlignments(bf) ## release
>
> Take a look at the makeTranscriptDbFromGFF() function in
GenomicFeatures. It
> creates a TranscriptDb object from gff or gtf. The advantage of
working with
> a TxDb is that you get the accessors such as cdsBy(..., "gene"),
etc. for
> free (you may know this, just an fyi).
UTRs are exons, too!
> Get an idea of how many reads hit each list element of the
GRangesList. This
> method does double count so it's possible for the sum of 'olap' to
be
> greater than the length of 'reads'.
>
> olap <- countOverlaps(cds_gene, reads)
>
> If the numbers in 'olap' are very small (and this is unexpected) you
may
> have a problem with how you've created the 'cds_gene' annotation. If
the
> numbers make sense then try summarizeOverlaps.
>
> se <- summarizeOverlaps(cds_gene, bf,
mode="IntersectionNotEmpty")
>
> If the number of hits in assays(se)$counts are greatly reduced you
may have
> many reads overlapping each other or a single feature in 'cds_gene'.
This
> can be checked by looking at the numbers in 'olap'. If you are using
the
> devel version (GenomicRanges >= 1.13.9) the 'inter.feature' argument
allows
> reads to be counted for each feature they overlap.
>
>
> ## mapped vs unmapped reads
I don't think the issue was mapped vs. unmapped, but rather "genic" vs
"intergenic".
If the OP took the TxDb route, a first approximation to the number of
intergenic reads could be counted by doing something like:
tx <- transcripts(txdb)
strand(tx) <- '*';
txs <- reduce(tx)
o <- countOverlaps(txs, reads, ignore.strand=TRUE)
... or something similar. This treats intronic reads as valid as reads
that map to exons, but again -- a first approximation to answer the
"DNA contamination". If you want to include intronic reads as "bad"
reads, then do `exons(txdb)` instead of `transcripts(txdb)` and
proceed as above.
HTH,
-steve
--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech
After a second read I think I may have missed what you're really
after.
Specifically the question about 'where did the reads go'?
A simple check of whether or not the mapped reads hit any part of the
annotation would be to overlap them with the 'gff0' GRanges.
olap <- findOverlaps(gff0, reads)
The result is a 'Hits' object that tells you what query overlapped
what
subject. See ?Hits for details.
## number of hits for each row of gff0
countQueryHits(olap)
## number of hits for each read
countSubjectHits(olap)
Valerie
On 08/28/2013 03:15 PM, Valerie Obenchain wrote:
> Hi Sam,
>
> ## counting reads
>
> If you haven't done this already I'd confirm counts on a single
file. A
> bam can be read in with readGAlignments(). If you have paired-end
data
> you can use ?readGAlignmentPairs or ?readGAlignmentsList.
>
> bf <- BamFile("pathToOneBam")
> reads <- readGAlignments(bf) ## devel
> reads <- readGappedAlignments(bf) ## release
>
> Take a look at the makeTranscriptDbFromGFF() function in
> GenomicFeatures. It creates a TranscriptDb object from gff or gtf.
The
> advantage of working with a TxDb is that you get the accessors such
as
> cdsBy(..., "gene"), etc. for free (you may know this, just an fyi).
>
> Get an idea of how many reads hit each list element of the
GRangesList.
> This method does double count so it's possible for the sum of 'olap'
to
> be greater than the length of 'reads'.
>
> olap <- countOverlaps(cds_gene, reads)
>
> If the numbers in 'olap' are very small (and this is unexpected) you
may
> have a problem with how you've created the 'cds_gene' annotation. If
the
> numbers make sense then try summarizeOverlaps.
>
> se <- summarizeOverlaps(cds_gene, bf,
mode="IntersectionNotEmpty")
>
> If the number of hits in assays(se)$counts are greatly reduced you
may
> have many reads overlapping each other or a single feature in
> 'cds_gene'. This can be checked by looking at the numbers in 'olap'.
If
> you are using the devel version (GenomicRanges >= 1.13.9) the
> 'inter.feature' argument allows reads to be counted for each feature
> they overlap.
>
>
> ## mapped vs unmapped reads
>
> readGAlignments() only imports mapped reads; unmapped are dropped.
If
> you are interested in the number of mapped vs unmapped and some
details
> about them you can use scanBam(). If your file is large you can use
the
> 'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for
> param details.
>
> All reads in the file, mapped and unmapped.
> scn <- scanBam(bf)
> names(scn[[1]])
>
> Unmapped only.
> flag1 <- scanBamFlag(isUnmappedQuery=TRUE)
> param1 <- ScanBamParam(what=scanBamWhat(), flag=flag1)
> unmapped <- (bf, param1)
>
> Mapped only.
> flag2 <- scanBamFlag(isUnmappedQuery=FALSE)
> param1 <- ScanBamParam(what=scanBamWhat(), flag=flag2)
> mapped <- (bf, param2)
>
>
> Valerie
>
> On 08/28/2013 01:50 PM, Sam McInturf wrote:
>> Bioconductors,
>> I am working on an Arabidopsis RNA sequencing experiment and
have
>> run
>> into some trouble. I found that I was able to map ~94% of my ~35
million
>> reads / library to the Arabidopsis TAIR10 genome using tophat2 (100
bp
>> single end reads). After I counted my mapped reads I found that
only
>> about
>> 47% of my mapped reads were counted. I found out there is a strong
>> possibility that there may be genomic DNA contamination, and I
would like
>> to verify this hypothesis.
>> I used GenomicRanges and rtracklayer to do my read counting:
>>
>> library(GenomicRanges)
>> library(rtracklayer)
>> library(Rsamtools)
>> setwd("/myDir")
>> myBamFiles <- list.files(pattern = ".bam$")
>> myBamIndex <- list.files(pattern = ".bai$")
>>
>> bfl <- BamFileList(myBamFiles, index = myBamIndex)
>> gff0 <- import("/myOtherDir/Arabidopsis_thaliana.TAIR10.17.gtf",
>> asRangedData = FALSE)
>> cds <- subset(gff0, type == "CDS")
>> cds_gene <- reduce(split(cds, cds$gene_id))
>> olap <- summarizeOverlaps(cds_gene, bfl, mode =
"IntersectionNotEmpty")
>>
>> My intention with subsetting for the CDS and then splitting on
gene_id is
>> to create a GRangesList where each element is a unique AGI number
and has
>> all of the features.
>> To extract the number of reads counted I use:
>>
>> colSums(assays(olap)$counts)
>>
>> The output of colSums(...) is how I have found that less than half
my
>> reads
>> mapped to transcripts. The first question becomes "Is this how to
count
>> the number of reads mapped?" The second question becomes "how do I
>> figure
>> out where the reads went?"
>>
>> My initial inclination would be to make a GRangesList that
contained
>> intervals between genes and then do the same thing as before and
see if I
>> can find my other reads. Something in the vein of:
>>
>> maxLimits <- t(as.dataframe(lapply(cds_gene, function(x)
>> c(min(start(ranges(x))),
>> max(end( ranges(x)))
>> )
>> )
>> ))
>> which is just an elaborate lapply call to get the smallest start
>> coordinate
>> and the largest end coordinate for each unique gene_id and then
transform
>> it to a pretty data frame. From there I could find a way to get
the
>> 'in-betweens' of these features, but I am not sure that is the most
>> prudent
>> thing to do. I looked at some of the ShortRead and GenomicRanges
>> functions
>> for a guiding hand and found the disjointBins() function may be
useful in
>> establishing features to map over, but am at a bit of a loss as I
don't
>> have the expertise to figure this out straight away.
>> Any suggestions or resources to find mapped but uncounted reads
>> would be
>> helpful!
>>
>>
>> Thanks
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
Valarie,
By calling:
olap <- findOverlaps(gff0, reads)
foo <- countSubjectHits(olap)
`countSubjectHits(x): Counts the number of hits for each subject,
returning
an integer vector.`
foo is a vector with a length equal to the number of reads mapped? so:
length(foo[foo == 0])
would tell me how many reads did not map to my annotated features.
can you recommend a method that would allow me to map to the regions
outside of my current features? I have been trying to devise a way to
make
a new set of features that are the spaces between my current features,
but
I have been unable to implement it successfully.
Maybe a more genomics oriented approach and look at the coverage over
each
chromosome to try and identify regions that have a higher 'background'
that
could be then looked at in a genome browser like IGV to see the reads
that
did not map to a transcript?
Thanks again,
Sam
On Wed, Aug 28, 2013 at 5:38 PM, Valerie Obenchain
<vobencha@fhcrc.org>wrote:
> After a second read I think I may have missed what you're really
after.
> Specifically the question about 'where did the reads go'?
>
> A simple check of whether or not the mapped reads hit any part of
the
> annotation would be to overlap them with the 'gff0' GRanges.
>
> olap <- findOverlaps(gff0, reads)
>
> The result is a 'Hits' object that tells you what query overlapped
what
> subject. See ?Hits for details.
>
> ## number of hits for each row of gff0
> countQueryHits(olap)
> ## number of hits for each read
> countSubjectHits(olap)
>
>
> Valerie
>
>
> On 08/28/2013 03:15 PM, Valerie Obenchain wrote:
>
>> Hi Sam,
>>
>> ## counting reads
>>
>> If you haven't done this already I'd confirm counts on a single
file. A
>> bam can be read in with readGAlignments(). If you have paired-end
data
>> you can use ?readGAlignmentPairs or ?readGAlignmentsList.
>>
>> bf <- BamFile("pathToOneBam")
>> reads <- readGAlignments(bf) ## devel
>> reads <- readGappedAlignments(bf) ## release
>>
>> Take a look at the makeTranscriptDbFromGFF() function in
>> GenomicFeatures. It creates a TranscriptDb object from gff or gtf.
The
>> advantage of working with a TxDb is that you get the accessors such
as
>> cdsBy(..., "gene"), etc. for free (you may know this, just an fyi).
>>
>> Get an idea of how many reads hit each list element of the
GRangesList.
>> This method does double count so it's possible for the sum of
'olap' to
>> be greater than the length of 'reads'.
>>
>> olap <- countOverlaps(cds_gene, reads)
>>
>> If the numbers in 'olap' are very small (and this is unexpected)
you may
>> have a problem with how you've created the 'cds_gene' annotation.
If the
>> numbers make sense then try summarizeOverlaps.
>>
>> se <- summarizeOverlaps(cds_gene, bf,
mode="IntersectionNotEmpty")
>>
>> If the number of hits in assays(se)$counts are greatly reduced you
may
>> have many reads overlapping each other or a single feature in
>> 'cds_gene'. This can be checked by looking at the numbers in
'olap'. If
>> you are using the devel version (GenomicRanges >= 1.13.9) the
>> 'inter.feature' argument allows reads to be counted for each
feature
>> they overlap.
>>
>>
>> ## mapped vs unmapped reads
>>
>> readGAlignments() only imports mapped reads; unmapped are dropped.
If
>> you are interested in the number of mapped vs unmapped and some
details
>> about them you can use scanBam(). If your file is large you can use
the
>> 'yieldSize' argument to BamFile. See ?BamFile and ?ScanBamParam for
>> param details.
>>
>> All reads in the file, mapped and unmapped.
>> scn <- scanBam(bf)
>> names(scn[[1]])
>>
>> Unmapped only.
>> flag1 <- scanBamFlag(isUnmappedQuery=**TRUE)
>> param1 <- ScanBamParam(what=scanBamWhat(**), flag=flag1)
>> unmapped <- (bf, param1)
>>
>> Mapped only.
>> flag2 <- scanBamFlag(isUnmappedQuery=**FALSE)
>> param1 <- ScanBamParam(what=scanBamWhat(**), flag=flag2)
>> mapped <- (bf, param2)
>>
>>
>> Valerie
>>
>> On 08/28/2013 01:50 PM, Sam McInturf wrote:
>>
>>> Bioconductors,
>>> I am working on an Arabidopsis RNA sequencing experiment and
have
>>> run
>>> into some trouble. I found that I was able to map ~94% of my ~35
million
>>> reads / library to the Arabidopsis TAIR10 genome using tophat2
(100 bp
>>> single end reads). After I counted my mapped reads I found that
only
>>> about
>>> 47% of my mapped reads were counted. I found out there is a strong
>>> possibility that there may be genomic DNA contamination, and I
would like
>>> to verify this hypothesis.
>>> I used GenomicRanges and rtracklayer to do my read counting:
>>>
>>> library(GenomicRanges)
>>> library(rtracklayer)
>>> library(Rsamtools)
>>> setwd("/myDir")
>>> myBamFiles <- list.files(pattern = ".bam$")
>>> myBamIndex <- list.files(pattern = ".bai$")
>>>
>>> bfl <- BamFileList(myBamFiles, index = myBamIndex)
>>> gff0 <-
import("/myOtherDir/**Arabidopsis_thaliana.TAIR10.**17.gtf",
>>> asRangedData = FALSE)
>>> cds <- subset(gff0, type == "CDS")
>>> cds_gene <- reduce(split(cds, cds$gene_id))
>>> olap <- summarizeOverlaps(cds_gene, bfl, mode =
"IntersectionNotEmpty")
>>>
>>> My intention with subsetting for the CDS and then splitting on
gene_id is
>>> to create a GRangesList where each element is a unique AGI number
and has
>>> all of the features.
>>> To extract the number of reads counted I use:
>>>
>>> colSums(assays(olap)$counts)
>>>
>>> The output of colSums(...) is how I have found that less than half
my
>>> reads
>>> mapped to transcripts. The first question becomes "Is this how to
count
>>> the number of reads mapped?" The second question becomes "how do
I
>>> figure
>>> out where the reads went?"
>>>
>>> My initial inclination would be to make a GRangesList that
contained
>>> intervals between genes and then do the same thing as before and
see if I
>>> can find my other reads. Something in the vein of:
>>>
>>> maxLimits <- t(as.dataframe(lapply(cds_**gene, function(x)
>>> c(min(start(ranges(x))),
>>> max(end( ranges(x)))
>>> )
>>> )
>>> ))
>>> which is just an elaborate lapply call to get the smallest start
>>> coordinate
>>> and the largest end coordinate for each unique gene_id and then
transform
>>> it to a pretty data frame. From there I could find a way to get
the
>>> 'in-betweens' of these features, but I am not sure that is the
most
>>> prudent
>>> thing to do. I looked at some of the ShortRead and GenomicRanges
>>> functions
>>> for a guiding hand and found the disjointBins() function may be
useful in
>>> establishing features to map over, but am at a bit of a loss as I
don't
>>> have the expertise to figure this out straight away.
>>> Any suggestions or resources to find mapped but uncounted reads
>>> would be
>>> helpful!
>>>
>>>
>>> Thanks
>>>
>>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the archives:
>> http://news.gmane.org/gmane.**science.biology.informatics.**conduct
or<http: news.gmane.org="" gmane.science.biology.informatics.conductor="">
>>
>
>
--
Sam McInturf
[[alternative HTML version deleted]]
Hi,
On Fri, Aug 30, 2013 at 9:38 AM, Sam McInturf <smcinturf at="" gmail.com="">
wrote:
> Valarie,
>
> By calling:
> olap <- findOverlaps(gff0, reads)
> foo <- countSubjectHits(olap)
>
> `countSubjectHits(x): Counts the number of hits for each subject,
returning
> an integer vector.`
> foo is a vector with a length equal to the number of reads mapped?
so:
>
> length(foo[foo == 0])
>
> would tell me how many reads did not map to my annotated features.
>
> can you recommend a method that would allow me to map to the
regions
> outside of my current features? I have been trying to devise a way
to make
> a new set of features that are the spaces between my current
features, but
> I have been unable to implement it successfully.
>
> Maybe a more genomics oriented approach and look at the coverage
over each
> chromosome to try and identify regions that have a higher
'background' that
> could be then looked at in a genome browser like IGV to see the
reads that
> did not map to a transcript?
I had previously offered a suggestion, like so:
tx <- transcripts(txdb)
strand(tx) <- '*';
txs <- reduce(tx)
o <- countOverlaps(txs, reads, ignore.strand=TRUE)
But I was missing a crucial step, which was to use the `gaps`
function, eg:
tx <- transcripts(txdb)
strand(tx) <- '*';
txs <- reduce(tx)
nontx <- gaps(txs)
o <- countOverlaps(nontx, reads, ignore.strand=TRUE)
`nontx` should be a GenomicRanges object with ranges that are created
from the spaces between the transcripts (the "gaps") -- which should
be a good approximation to the intergenic regions you are looking for.
-steve
--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech