exonsBy does not order exons by ascending rank, is it a bug or outdated documentation?
1
0
Entering edit mode
Jiping Wang ▴ 90
@jiping-wang-6687
Last seen 3 months ago
United States

I tried to use coverageByTranscript function from GenomicAlignments package to calculate the RNA-seq coverage along transcripts. This function requires the exons be sorted in ascending rank (i.e. the first Exxon in the transcript should be listed in the GRanges object first, and so forth, see reference below

My understanding is that if a gene is in the reference strand, exons should be sorted in ascending order genomic coordinate. If on the minus strand, the first exon in the transcript, but with highest genomic coordinate should be listed first. The documentation above said that "If transcripts was obtained with exonsBy, then the exons are guaranteed to be ordered by ascending rank. See ?exonsBy for more information". This statement appears to be incorrect. I have a gif file named genes.gtf. I used the following command to build the Granges list for all genes as follows:

  txdb <- makeTxDbFromGFF(gtf_file)
all_genes <- exonsBy(txdb, by="gene")


However the returned GRanges list does not order exons in the ascending rank for minus-strand genes. For example:

>all_genes[[3]]
GRanges object with 16 ranges and 2 metadata columns:
seqnames            ranges strand |   exon_id   exon_name
<Rle>         <IRanges>  <Rle> | <integer> <character>
[1]    chr10 52559169-52566640      - |    126742        <NA>
[2]    chr10 52569654-52569802      - |    126743        <NA>
[3]    chr10 52570800-52570936      - |    126744        <NA>
[4]    chr10 52573617-52573798      - |    126745        <NA>


This does cause problems for coverageByTranscript in calculating the coverage scores. It's messed up. Is my understanding about exonsBy function incorrect or the documentation of coverageByTranscript is outdate or incorrect? Thanks.

0
Entering edit mode
@herve-pages-1542
Last seen 11 hours ago
Seattle, WA, United States

Hi Jiping,

You're providing a link to a very outdated version of the GenomicFeatures package (1.24.4, current version is 1.38.0). Note that only the current version is supported and hopefully you're using that. Also the recommended way to access a man page is by simply doing ?functionname. This guarantees that the man page you see is that of the version of the package currently in use.

So doing ?exonsBy in my session I see:

Details:

When using ‘exonsBy’ or ‘cdsBy’ with ‘by = "tx"’, the returned
exons or CDS are ordered by ascending rank for each transcript,
that is, by their position in the transcript.  In all other
cases, the ranges will be ordered by chromosome, strand,
start, and end values.


In other words, don't expect exonsBy(txdb, by="gene") to return the exons ordered by ascending rank on their corresponding transcript. As we all know there is no notion of exon rank at the gene level in general (unless for genes that have only 1 transcript but exonsBy(txdb, by="gene") won't give these genes special treatment).

Hope this helps,

H.

> sessionInfo()
R version 3.6.0 Patched (2019-05-02 r76454)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-3.6.r76454/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.6.r76454/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicFeatures_1.38.0 AnnotationDbi_1.48.0   Biobase_2.46.0
[4] GenomicRanges_1.39.1   GenomeInfoDb_1.22.0    IRanges_2.21.1
[7] S4Vectors_0.24.0       BiocGenerics_0.32.0

loaded via a namespace (and not attached):
[1] SummarizedExperiment_1.16.0 progress_1.2.2
[3] tidyselect_0.2.5            purrr_0.3.3
[5] lattice_0.20-38             vctrs_0.2.0
[7] BiocFileCache_1.10.0        rtracklayer_1.46.0
[9] blob_1.2.0                  XML_3.98-1.20
[11] rlang_0.4.1                 pillar_1.4.2
[13] glue_1.3.1                  DBI_1.0.0
[15] BiocParallel_1.20.0         rappdirs_0.3.1
[17] bit64_0.9-7                 dbplyr_1.4.2
[19] matrixStats_0.55.0          GenomeInfoDbData_1.2.2
[21] stringr_1.4.0               zlibbioc_1.32.0
[23] Biostrings_2.54.0           memoise_1.1.0
[25] biomaRt_2.42.0              curl_4.2
[27] Rcpp_1.0.2                  openssl_1.4.1
[29] backports_1.1.5             DelayedArray_0.12.0
[31] XVector_0.26.0              bit_1.1-14
[33] Rsamtools_2.2.0             hms_0.5.2
[37] stringi_1.4.3               dplyr_0.8.3
[39] grid_3.6.0                  tools_3.6.0
[41] bitops_1.0-6                magrittr_1.5
[43] RCurl_1.95-4.12             tibble_2.1.3
[45] RSQLite_2.1.2               crayon_1.3.4
[47] pkgconfig_2.0.3             zeallot_0.1.0
[49] Matrix_1.2-17               prettyunits_1.0.2
[51] assertthat_0.2.1            httr_1.4.1
[53] R6_2.4.0                    GenomicAlignments_1.22.0
[55] compiler_3.6.0

0
Entering edit mode

Thanks so much! I indeed ignored the fine details under exonsBy. Is there a quick way to sort exon in ascending rank if they are on minus strand? It takes extremely long time if I do the following:

for(i in 1:length(all_genes)){
if(runValue(strand(all_genes[[i]]))=="-"){
all_genes[[i]]=rev(all_genes[[i]])
}
}


However if I try endoapply, its faster than above loop, but not very fast. Is there any easy more efficient way to do this? Thanks again!

all_genes=endoapply(all_genes,
function(x) if(runValue(strand(x))=="-") {return(rev(x))}else{return(x)})


0
Entering edit mode

I'm confused as to why you need to do this. You cannot use the GRangesList object returned by exonsBy(txdb, by="gene") as input for coverageByTranscript(). That's because the exons in this object are grouped by gene and not by transcript, which means that exons for all the transcripts in a gene are mixed up together. There is no re-ordering that will fix that. Instead you must use a GRangesList object where the exons are grouped by transcript and ordered by their rank in the transcript, which is exactly what exonsBy(txdb, by="tx") will give you.

H.

0
Entering edit mode

I am not to distinguish between different transcripts as it's just not possible to deconvolve them based on reads count. The reads count from different splicing forms will be just piled up, and repeatedly counted on the shared axons if we do the coverage for transcripts. Instead I am listing all exons from the same gene, and then use pcoverageByTranscript to calculate the coverage on every exons of the same gene. This does work out as I need so far.

0
Entering edit mode

If you only want the coverage for each exon, no need to order anything. Also it no longer matters how the exons are grouped (by gene or by transcript or by any other criteria):

library(pasillaBamSubset)
library(GenomicAlignments)

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene

ex_by_gene <- exonsBy(txdb, by="gene")  # grouping could be
# anything
all_exons <- unlist(ex_by_gene, use.names=FALSE)
all_exons <- as(all_exons, "GRangesList")  # a GRangesList object
# with 1 exon per group

names(cvg_by_ex) <- mcols(all_exons)$exon_id  cvg_by_ex is an RleList object parallel to all_exons i.e. with one Rle vector per exon in all_exons. The names on cvg_by_ex are the exon ids. Subsetting the object to show only exons with a non-zero coverage: > cvg_by_ex[sum(cvg_by_ex) != 0] RleList of length 822$63641
integer-Rle of length 563 with 62 runs
Lengths:  2  6  1  7  1  3  1  1  3  7  1 ... 27  6  1  1  3 31  8 33 15 17
Values : 25 28 29 27 24 23 26 28 29 27 28 ...  5  4  3  2  3  2  1  2  1  2

$63642 integer-Rle of length 120 with 45 runs Lengths: 2 4 3 1 7 1 1 2 1 3 2 ... 3 1 1 6 2 2 4 1 3 4 Values : 29 30 29 27 26 25 23 21 22 23 22 ... 29 28 27 28 27 28 31 30 29 26$63643
integer-Rle of length 292 with 81 runs
Lengths:  1  6 14  7  3  4  4 10  9  1  1 ...  2  3  2  5  1  1  2  5 11  6
Values :  2  6  5  6 10 13 14 13 15 16 17 ... 15 14 21 20 21 20 21 22 24 23

$63644 integer-Rle of length 770 with 171 runs Lengths: 9 2 11 12 2 3 6 1 1 5 8 ... 10 4 2 5 1 10 3 1 10 14 Values : 2 3 4 5 6 9 10 11 12 14 15 ... 13 14 13 12 11 10 9 8 5 3$63645
integer-Rle of length 142 with 11 runs
Lengths:  1 11 19 17 11  8  1  8 51  1 14
Values :  5  6  4  3  2  1  2  3  2  3  4

...
<817 more elements>


You can map some summarization of this back to the corresponding gene with something like:

ex_mean_cvg <- sum(cvg_by_ex) / lengths(cvg_by_ex)  # still parallel
# to 'all_exons'
ex_mean_cvg_by_gene <- relist(ex_mean_cvg, ex_by_gene)


IMPORTANT NOTE: The relist operation above is very fast and works only because ex_mean_cvg is parallel to unlist(ex_by_gene)!

ex_mean_cvg_by_gene is a NumericList object with one list element per gene. Each list element is a named numeric vector. The names on each numeric vector are the exon ids for that gene. The values are the mean coverage for the exons in that gene:

> ex_mean_cvg_by_gene
NumericList of length 15682
[["FBgn0000003"]] 45123=0
[["FBgn0000008"]] 20314=0 20315=0 20316=0 20317=0 ... 20328=0 20329=0 20330=0
[["FBgn0000014"]] 57683=0 57684=0 57685=0 57686=0 ... 57692=0 57693=0 57694=0
[["FBgn0000015"]] 57710=0 57711=0 57712=0 57713=0 ... 57719=0 57720=0 57721=0
[["FBgn0000017"]] 42204=0 42205=0 42206=0 42207=0 ... 42213=0 42214=0 42215=0
[["FBgn0000018"]] 11021=0 11022=0
[["FBgn0000022"]] 63754=0
[["FBgn0000024"]] 56491=0 56492=0 56493=0 56494=0 ... 56500=0 56501=0 56502=0
[["FBgn0000028"]] 73898=0 73899=0 73900=0 73901=0 ... 73910=0 73911=0 73912=0
[["FBgn0000032"]] 61983=0 61984=0 61985=0 61986=0 61987=0 61988=0 61989=0 61990=0
...
<15672 more elements>


Hope this helps,

H.

0
Entering edit mode

That makes sense. I will give it a try. Thanks again for great help!

But there is one potential issue: if a read partially overlaps with an exons, that will be counted in the coverage score? I need to judge whether a read or a pair of reads are compatible with the features (exons for the same gene) using something like IntersectionStrict.

0
Entering edit mode

Of course reads that partially overlap with an exon will be counted in the coverage score. The "1. A SIMPLE ARTIFICIAL EXAMPLE WITH ONLY ONE TRANSCRIPT" section in the examples of the ?coverageByTranscript man page emphasizes this behavior of coverageByTranscript() by walking you thru an example that shows exactly that. In particular it says:

When computing the coverage on a transcript, only the part
of the read that overlaps with the transcript participates
to the coverage.


If that's not what you want and you'd prefer something that behaves like IntersectionStrict instead, then you can do something like this:

1. Keep reads that fall within the exons of a single gene. We'll call them "good reads":

nhits <- countOverlaps(reads, ex_by_gene, type="within")

2. Group the "good reads" by exon. Note that a "good read" is allowed to fall within 2 or more exons as long as these exons belong to the same gene:

all_exons <- unlist(ex_by_gene, use.names=FALSE)
exon2reads <- as(t(hits), "List")  # IntegerList parallel
# to 'all_exons'
names(exon2reads) <- mcols(all_exons)$exon_id reads_by_exon <- extractList(good_reads, exon2reads)  reads_by_exon is a GAlignmentsList object that contains reads grouped by exon. More precisely it is parallel to all_exons i.e. it has one list element per exon in all_exons. Each list element is a GAlignments object containing the reads that fall within that exon. 3. Proceed as previously except that now you need to use pcoverageByTranscript() instead of coverageByTranscript(): all_exons <- as(all_exons, "GRangesList") cvg_by_ex <- pcoverageByTranscript(reads_by_exon, all_exons) names(cvg_by_ex) <- mcols(all_exons)$exon_id
cvg_by_ex[sum(cvg_by_ex) != 0]
# RleList of length 703
# $63641 # integer-Rle of length 563 with 51 runs # Lengths: 2 6 8 4 1 1 10 4 3 1 1 ... 31 6 27 6 1 1 3 31 41 32 # Values : 0 3 4 5 9 11 12 13 14 15 16 ... 6 7 5 4 3 2 3 2 1 0 # #$63642
# integer-Rle of length 120 with 17 runs
#   Lengths:  2 19  1  7  4  4  1  3 36 19  1  7  4  4  1  3  4
#  Values :  0  1  2  3  4  5  7  8 11 10  9  8  7  6  4  3  0
#
# $63643 # integer-Rle of length 292 with 70 runs # Lengths: 1 20 7 3 4 14 9 1 1 2 6 ... 18 8 5 2 2 5 6 14 5 6 # Values : 0 4 5 9 12 13 15 16 17 18 19 ... 9 8 7 6 5 4 3 2 1 0 # #$63644
# integer-Rle of length 770 with 164 runs
#   Lengths:  9  2 11 12  2  3  6  1  1  5 10 ... 14  2  5  1  2  8  3  1 10 14
#   Values :  0  1  2  3  4  7  8  9 10 12 13 ... 12 11 10  9  8  7  6  5  2  0
#
# \$63645
# integer-Rle of length 142 with 4 runs
#   Lengths:  1 66  9 66
#   Values :  0  1  2  1
#
# ...
# <698 more elements>


etc...

At this point we are pretty far from the topic of your original post ("exonsBy does not order exons by ascending rank, is it a bug or outdated documentation?") so I'd strongly suggest that you start a new post if you decide to follow up on this.

H.

0
Entering edit mode

Thanks. I had another post earlier, in which you basically have helped me out in the same way(very grateful). It worked. I want to follow the IntersectionStrict mode as HTSeq requires to calculate the coverage curve. if one read align to both genes (i.e. genes may overlap), that read should not be counted to avoid confusion. I assume IntersectionStrict does this job. I just found when I treat the ordered all exons from the same gene as one transcript to use pcoverageByTranscript to calculate the coverage per gene (all exons within the same gene), I have to have exons sorted in appropriate order. I overlooked the fine print of exonsBy in the first place (this post). If I reverse the order or exons for minus strand genes, everything works well now. Also I compared with the HTSeq output in terms of read pair count, they are very consistent. Thanks again. ~~~~~~~ https://support.bioconductor.org/p/123463/