Hi,
I imported a gtf file and got a GRangesList with all of the cds by gene using the cdsBy() command. I'd like to remove the first 45 nucleotides of cds region from each gene. I initially was going to just use lapply() to resize the first cds exon to width()-45.
lapply(grl, function(x) resize(x[1],width(x[1])-45))
However, a number of first cds ranges are less than 45nt. I could set up a number of conditions by segregating genes by the length of their first few cds ranges, but I was wondering if there is a better/easier way to do this?
Thanks

On the devel package website GenomicFeatures is only up to 1.21.18 (https://www.bioconductor.org/packages/devel/bioc/html/GenomicFeatures.html). Do you have any estimate for when 1.21.20 will be available? Thanks
Hi Michael,
Problem with this is that it doesn't trim the CDS located on the minus strand on the correct side. It also re-orders the CDS, which is problematic:
x <- GRangesList(GRanges("chr1", IRanges(c(201, 101), width=20), "-"), GRanges("chr1", IRanges(501, width=35), "+")) x # GRangesList object of length 2: # [[1]] # GRanges object with 2 ranges and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr1 [201, 220] - # [2] chr1 [101, 120] - # # [[2]] # GRanges object with 1 range and 0 metadata columns: # seqnames ranges strand # [1] chr1 [501, 535] + psetdiff(x, pmapFromTranscripts(IRanges(1, 5), x)) # GRangesList object of length 2: # [[1]] # GRanges object with 2 ranges and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr1 [101, 120] - # [2] chr1 [206, 220] - # # [[2]] # GRanges object with 1 range and 0 metadata columns: # seqnames ranges strand # [1] chr1 [506, 535] +trimCdsHead()below corrects this:## The CDS in 'x' must be ordered by ascending rank for each transcript, ## that is, by their position in the transcript. This means that, for a ## transcript located on the minus strand, the CDS should typically be ## ordered by descending position on the reference genome. If 'x' was ## obtained with GenomicFeatures::cdsBy() then the CDS are guaranteed to ## be ordered by ascending rank. See '?cdsBy' for more information. trimCdsHead <- function(x, n=0) { stopifnot( is(x, "GRangesList"), is.numeric(n), length(n) == 1L || length(n) == length(x), all(!is.na(n)), all(n >= 0) ) if (!is.integer(n)) n <- as.integer(n) keep_range <- cumsum(width(x)) > n ans <- x[keep_range] n <- n - sum(width(x)[!keep_range]) cds1_strand <- as.character(phead(strand(ans), n=1L)) if (any(n != 0L & cds1_strand %in% "*")) stop("some CDS to trim are on the \"*\" strand") unlisted_ans <- unlist(ans) plus_idx <- which(cds1_strand %in% "+") if (length(plus_idx) != 0L) { pidx <- start(PartitioningByEnd(ans))[plus_idx] start(unlisted_ans)[pidx] <- start(unlisted_ans)[pidx] + n[plus_idx] } minus_idx <- which(cds1_strand %in% "-") if (length(minus_idx) != 0L) { midx <- start(PartitioningByEnd(ans))[minus_idx] end(unlisted_ans)[midx] <- end(unlisted_ans)[midx] - n[minus_idx] } relist(unlisted_ans, ans) }Then:
H.
The
pmapFromTranscripts()call should map the first 45bp according to the direction of transcription, right? I agree thepsetdiffmight re-order the exons (left to right), but the regions should be correct. If not, then we should fix the mapping functions.Thanks. However, for my case if the amount that I want to trim is longer than the first cds region, trimCdsHead() throws an error and doesn't work.
trimCdsHead(x, 25) GRangesList object of length 2: [[1]] GRanges object with 1 range and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [101, 115] - [[2]] GRanges object with 1 range and 0 metadata columns: seqnames ranges strand [1] chr1 [501, 535] + ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths Warning message: In x@start[] <- S4Vectors:::numeric2integer(value) : number of items to replace is not a multiple of replacement lengthHi Jake,
Oops...
PartitioningByEnd()needed to be called onans, not onx, sorry. I edited the definition oftrimCdsHead()in my previous answer.Also I forgot to mention that
ncan be an integer vector parallel tox(i.e. of the same length asx) so you can specify the trimming you want for each CDS.Cheers,
H.
Hi Herve,
My
cdsBy()function is not returning cds in the correct order. They are sorted by position along the reference genome rather than position within the transcript.GRangesList object of length 22458: $ENSMUSG00000000001.4 GRanges object with 8 ranges and 2 metadata columns: seqnames ranges strand | cds_id cds_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr3 [108109422, 108109612] - | 39069 <NA> [2] chr3 [108111935, 108112088] - | 39070 <NA> [3] chr3 [108112473, 108112602] - | 39071 <NA> [4] chr3 [108115763, 108115891] - | 39072 <NA> [5] chr3 [108118301, 108118458] - | 39073 <NA> [6] chr3 [108123542, 108123683] - | 39074 <NA> [7] chr3 [108123795, 108123837] - | 39075 <NA> [8] chr3 [108145888, 108146005] - | 39076 <NA> $ENSMUSG00000000003.13 GRanges object with 6 ranges and 2 metadata columns: seqnames ranges strand | cds_id cds_name [1] chrX [77841883, 77841911] - | 205797 <NA> [2] chrX [77842515, 77842616] - | 205798 <NA> [3] chrX [77842897, 77843007] - | 205799 <NA> [4] chrX [77845019, 77845086] - | 205800 <NA> [5] chrX [77847975, 77848114] - | 205801 <NA> [6] chrX [77853409, 77853483] - | 205802 <NA> ... <22456 more elements>I made a TxDb using the following command:
gencodeBasicvM6 <- makeTxDbFromGFF('~/Downloads/gencode.vM6.basic.annotation.gtf', format='gtf') cdsBy(gencode, by='gene')Below is my sessionInfo()
There is no way the CDS can be ordered by position within each transcript if you group them by gene. Of course they need to be grouped by transcript:
See
?cdsByfor more information about this.H.
Makes sense. If I want to eventually do RNA-Seq counts over genes rather than transcripts, is the best way to edit the start of each cds by transcript, then combine all these cds by gene, and then use the
reduce()command?That sounds reasonable to me except that I've some hesitations about the
reduce()step. Why do you need it? Note that depending on the counting mode you choose when callingsummarizeOverlaps(), thereduce()step can impact the counting. For example, after reducing the CDS you'll have more reads that fall completely "within" a given feature:library(GenomicAlignments) features <- GRangesList( GRanges("chr1", IRanges(c(8, 12), c(14, 16))), GRanges("chr1", IRanges(100, 300))) reads <- GRanges("chr1", IRanges(11, 15))Then:
My understanding is that the default counting mode (Union) should not be affected by the
reduce()step. See?summarizeOverlapsfor more information about the counting modes.Anyway, unless you know what you are doing and have a good reason for performing the
reduce()step, I would recommend you avoid it.Hope this makes sense,
H.
Thanks. I count with featureCounts which counts via union as well. I'm pretty sure featureCounts accounts for double counting, but I just wanted to
reduce()to be sure that a read that fell into two overlapping features was not double counted.I should have added this: extracting the CDS grouped by transcripts, then editing the start of the CDS within each transcript, and then re-grouping the edited CDS by gene will certainly introduce duplicated CDS within some genes. OTOH when you do
cdsBy(, by="gene")you don't get any duplicated CDS within a given gene. So maybe this is something that you are worried about and that you intended to fix with thereduce()step. However, I would argue that the right way to get rid of these duplicates is withunique(), notreduce():features <- GRangesList( GRanges("chr1", IRanges(c(8, 12, 8), c(14, 16, 14))), GRanges("chr1", IRanges(100, 300))) features # GRangesList object of length 2: # [[1]] # GRanges object with 3 ranges and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr1 [ 8, 14] * # [2] chr1 [12, 16] * # [3] chr1 [ 8, 14] * # # [[2]] # GRanges object with 1 range and 0 metadata columns: # seqnames ranges strand # [1] chr1 [100, 300] * unique(features) # GRangesList object of length 2: # [[1]] # GRanges object with 2 ranges and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr1 [ 8, 14] * # [2] chr1 [12, 16] * # # [[2]] # GRanges object with 1 range and 0 metadata columns: # seqnames ranges strand # [1] chr1 [100, 300] *Anyway my understanding is that these duplicates should not affect the result of
GenomicAlignments::summarizeOverlaps()and I would be surprised if they affected the output ofRsubread::featureCounts(). Furthermore, it seems to me that no read counting tool should count more than 1 hit for a given read, whatever the counting mode is (the underlying assumption being that a given read comes from at most 1 place on the reference genome). But don't take my word for it...H.