Hi
I am wondering which way could allow me to retrieve the first exons from genome with the information of the exon length.
the purpose is to retrieve the first exons for gRNA design.
Thanks,
Xin
Hi Hervé,
Thank you very much. Sorry for the unclear description. Actually, I want to retrieve the first exons for each gene where RNA polymerase begins transcribing. The reason that I will use the retrieved exon for CRISPR gRNA design as targeting on these regions would give high efficient knockout.
ex_by_tx is a named GRangesList object with 1 GRanges object per known transcript. The names on ex_by_tx are the transcript internal ids (these ids are internal to the TxDb object and have no meaning outside it). Each GRanges object in ex_by_tx contains the exons for the corresponding transcript. The exons are ordered by rank with respect to their transcript. See ?exonsBy for more information.
To keep the 1st exon for each transcript we can use heads(), a version of head() that works on a list-like object and keeps the first n elements within each list element (note that heads() is a Bioconductor extension and is not part of base R):
This final version of first_exons contains the genomic location of the 1st exon of each known transcript and each exon is mapped to the corresponding transcript name, gene name, and transcript length.
I should add that the fact that some genomic ranges are repeated multiple times in first_exons (e.g. the first 3 exons are the same) simply reflects the fact that several transcripts in a gene can share the same first exon. So if you're interested in knowing the location of the first exon for each gene, you can start by getting rid of the repeated exons with:
Even after doing this first_exons can still contain more than one "first exon" per gene (e.g. gene FBgn0067779 contains 4 "first exons"). To keep only one "first exon" per gene, I guess the natural thing to do is to go after "the most upstream first exon" for each gene, that is, the exon with the smaller start position for genes on the + strand and the exon with the greatest end position for genes on the minus strand. Here is one way to do this:
This method is actually doing the right thing if the goal is to extract the most upstream exon for each gene in the TxDb object. Unlike in very_first_exon_in_gene, all the known genes (i.e. all the genes in the TxDb object) are represented in very_first_exon_in_each_gene.
The previous method was loosing some genes along the way due to the fact that some genes can share the same most upstream exon and that we were calling unique() at some point. For example genes FBgn0000055 and FBgn0000056 share the same most upstream exon:
Hi Hervé,
I am trying to design the genome-wide gRNA libraries for several different species. One paper described a pipeline for the design. Basically, they first select the exonic guide sites fitting the pattern G(N16–19)NGG, and then annotate these guide sites as targeting Ensembl GRCh37 genes models to generate candidate guides towards each gene. which R package would you recommend for this purpose.
Thanks
Hi Hervé, Thank you very much. Sorry for the unclear description. Actually, I want to retrieve the first exons for each gene where RNA polymerase begins transcribing. The reason that I will use the retrieved exon for CRISPR gRNA design as targeting on these regions would give high efficient knockout.
A gene has one TSS per transcript. You can easily get the 1st exon of each known transcript by extracting the exons grouped by transcript:
ex_by_tx
is a named GRangesList object with 1 GRanges object per known transcript. The names onex_by_tx
are the transcript internal ids (these ids are internal to the TxDb object and have no meaning outside it). Each GRanges object inex_by_tx
contains the exons for the corresponding transcript. The exons are ordered by rank with respect to their transcript. See?exonsBy
for more information.To keep the 1st exon for each transcript we can use
heads()
, a version ofhead()
that works on a list-like object and keeps the firstn
elements within each list element (note thatheads()
is a Bioconductor extension and is not part of base R):We can turn this into a GRanges object with
unlist()
:and add some useful metadata columns to it:
This final version of
first_exons
contains the genomic location of the 1st exon of each known transcript and each exon is mapped to the corresponding transcript name, gene name, and transcript length.Hope this helps,
H.
I should add that the fact that some genomic ranges are repeated multiple times in
first_exons
(e.g. the first 3 exons are the same) simply reflects the fact that several transcripts in a gene can share the same first exon. So if you're interested in knowing the location of the first exon for each gene, you can start by getting rid of the repeated exons with:Even after doing this
first_exons
can still contain more than one "first exon" per gene (e.g. gene FBgn0067779 contains 4 "first exons"). To keep only one "first exon" per gene, I guess the natural thing to do is to go after "the most upstream first exon" for each gene, that is, the exon with the smaller start position for genes on the + strand and the exon with the greatest end position for genes on the minus strand. Here is one way to do this:H.
One more thing (and I'll stop replying to myself).
The very first exon in each gene can actually be obtained more simply with:
This method is actually doing the right thing if the goal is to extract the most upstream exon for each gene in the TxDb object. Unlike in
very_first_exon_in_gene
, all the known genes (i.e. all the genes in the TxDb object) are represented invery_first_exon_in_each_gene
.The previous method was loosing some genes along the way due to the fact that some genes can share the same most upstream exon and that we were calling
unique()
at some point. For example genes FBgn0000055 and FBgn0000056 share the same most upstream exon:which is why FBgn0000056 didn't make it to
very_first_exon_in_gene
:H.
Thank you very much, Hervé. Really helpful!
Hi Hervé, I am trying to design the genome-wide gRNA libraries for several different species. One paper described a pipeline for the design. Basically, they first select the exonic guide sites fitting the pattern G(N16–19)NGG, and then annotate these guide sites as targeting Ensembl GRCh37 genes models to generate candidate guides towards each gene. which R package would you recommend for this purpose. Thanks
I don't know. This sounds like is a slightly different topic though so I would recommend that maybe you try to ask this as a new question.