Question

ISO a solution to rank-order introns by transcript

0

Entering edit mode

mat149 ▴ 80

@mat149-11450

Last seen 9 weeks ago

United States

I need help with a method/operation to rank each intron by transcript -- based on the granges start/end. Specifically, I'm looking for a way to generate a unique identifier corresponding to: 1st intron, 2nd intron, ... last intron for each transcript, and place them in a granges mcols() Thanks

tx<-txdbmaker::makeTxDbFromGFF("genomic.gff",format="auto",chrominfo=s,dataSource="GCA_007735645.1",organism="Venturia.effusa",taxonomyId=50376,dbxrefTag="GENEID")

int<-GenomicFeatures::intronsByTranscript(tx,use.names=TRUE)
tail(int)

GRangesList object of length 6:
$`rna-gnl|PRJNA551043|FKW77_001900-T1_mrna`
GRanges object with 1 range and 0 metadata columns:
        seqnames        ranges strand
           <Rle>     <IRanges>  <Rle>
  [1] CP042204.1 460547-460594      -
  -------
  seqinfo: 21 sequences from ASM773564v1 genome

$`rna-gnl|PRJNA551043|FKW77_001938-T1_mrna`
GRanges object with 6 ranges and 0 metadata columns:
        seqnames        ranges strand
           <Rle>     <IRanges>  <Rle>
  [1] CP042204.1 466517-467022      -
  [2] CP042204.1 467072-467209      -
  [3] CP042204.1 467341-468214      -
  [4] CP042204.1 468703-468750      -
  [5] CP042204.1 469072-469125      -
  [6] CP042204.1 469246-469709      -
  -------
  seqinfo: 21 sequences from ASM773564v1 genome

$`rna-gnl|PRJNA551043|FKW77_001959-T1_mrna`
GRanges object with 2 ranges and 0 metadata columns:
        seqnames        ranges strand
           <Rle>     <IRanges>  <Rle>
  [1] CP042204.1 471047-471094      -
  [2] CP042204.1 471781-471826      -
  -------
  seqinfo: 21 sequences from ASM773564v1 genome

...
<3 more elements>



sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Vinaequalis.NCBI.ASM976944v1_1.0.0
 [2] BSgenome.Vnashicola.NCBI.ASM452265v2_1.0.0 
 [3] BSgenome.Veffusa.NCBI.ASM773564v1_1.0.0    
 [4] shape_1.4.6.1                              
 [5] RMariaDB_1.3.2                             
 [6] modeest_2.4.0                              
 [7] seqLogo_1.70.0                             
 [8] GenomicDistributions_1.12.0                
 [9] MotifDb_1.46.0                             
[10] rlist_0.4.6.2                              
[11] msaR_0.6.0                                 
[12] deepredeff_0.1.1                           
[13] seqinr_4.2-36                              
[14] ape_5.8                                    
[15] msa_1.36.1                                 
[16] OrganismDbi_1.46.0                         
[17] BSgenomeForge_1.4.0                        
[18] BSgenome_1.72.0                            
[19] BiocIO_1.14.0                              
[20] karyoploteR_1.30.0                         
[21] regioneR_1.36.0                            
[22] biovizBase_1.52.0                          
[23] Gviz_1.48.0                                
[24] AnnotationForge_1.46.0                     
[25] AnnotationHub_3.12.0                       
[26] BiocFileCache_2.12.0                       
[27] dbplyr_2.5.0                               
[28] UniProt.ws_2.44.0                          
[29] RSQLite_2.3.7                              
[30] gridExtra_2.3                              
[31] ggbio_1.52.0                               
[32] Rsubread_2.18.0                            
[33] ShortRead_1.62.0                           
[34] BiocParallel_1.38.0                        
[35] Rhisat2_1.20.0                             
[36] trackViewer_1.40.0                         
[37] rtracklayer_1.64.0                         
[38] R.utils_2.12.3                             
[39] R.oo_1.26.0                                
[40] R.methodsS3_1.8.2                          
[41] RColorBrewer_1.1-3                         
[42] scatterplot3d_0.3-44                       
[43] ggplot2_3.5.1                              
[44] gplots_3.1.3.1                             
[45] Cairo_1.6-2                                
[46] motifStack_1.48.0                          
[47] openxlsx_4.2.6.1                           
[48] GenomicFeatures_1.56.0                     
[49] AnnotationDbi_1.66.0                       
[50] GenomicAlignments_1.40.0                   
[51] Rsamtools_2.20.0                           
[52] Biostrings_2.72.1                          
[53] XVector_0.44.0                             
[54] SummarizedExperiment_1.34.0                
[55] Biobase_2.64.0                             
[56] MatrixGenerics_1.16.0                      
[57] matrixStats_1.3.0                          
[58] GenomicRanges_1.56.1                       
[59] GenomeInfoDb_1.40.1                        
[60] IRanges_2.38.1                             
[61] S4Vectors_0.42.1                           
[62] BiocGenerics_0.50.0                        
[63] devtools_2.4.5                             
[64] usethis_3.0.0                              

loaded via a namespace (and not attached):
  [1] fs_1.6.4                    ProtGenerics_1.36.0        
  [3] bitops_1.0-8                DirichletMultinomial_1.46.0
  [5] TFBSTools_1.42.0            httr_1.4.7                 
  [7] InteractionSet_1.32.0       profvis_0.3.8              
  [9] tools_4.4.1                 backports_1.5.0            
 [11] utf8_1.2.4                  R6_2.5.1                   
 [13] lazyeval_0.2.2              rhdf5filters_1.16.0        
 [15] urlchecker_1.0.1            withr_3.0.1                
 [17] prettyunits_1.2.0           GGally_2.2.1               
 [19] cli_3.6.3                   SGSeq_1.38.0               
 [21] grImport_0.9-7              readr_2.1.5                
 [23] txdbmaker_1.0.1             foreign_0.8-87             
 [25] dichromat_2.0-0.1           sessioninfo_1.2.2          
 [27] plotrix_3.8-4               rstudioapi_0.16.0          
 [29] generics_0.1.3              hwriter_1.3.2.1            
 [31] gtools_3.9.5                dplyr_1.1.4                
 [33] zip_2.3.1                   GO.db_3.19.1               
 [35] Matrix_1.7-0                interp_1.1-6               
 [37] fansi_1.0.6                 abind_1.4-5                
 [39] lifecycle_1.0.4             yaml_2.3.10                
 [41] rhdf5_2.48.0                SparseArray_1.4.8          
 [43] blob_1.2.4                  promises_1.3.0             
 [45] crayon_1.5.3                pwalign_1.0.0              
 [47] miniUI_0.1.1.1              lattice_0.22-6             
 [49] annotate_1.82.0             KEGGREST_1.44.1            
 [51] pillar_1.9.0                knitr_1.48                 
 [53] statip_0.2.3                rjson_0.2.21               
 [55] codetools_0.2-20            strawr_0.0.92              
 [57] glue_1.7.0                  data.table_1.15.4          
 [59] remotes_2.5.0               vctrs_0.6.5                
 [61] png_0.1-8                   gtable_0.3.5               
 [63] poweRlaw_0.80.0             cachem_1.1.0               
 [65] xfun_0.46                   S4Arrays_1.4.1             
 [67] mime_0.12                   pracma_2.4.4               
 [69] timeDate_4032.109           ellipsis_0.3.2             
 [71] nlme_3.1-165                bit64_4.0.5                
 [73] progress_1.2.3              filelock_1.0.3             
 [75] fBasics_4032.96             KernSmooth_2.23-24         
 [77] rpart_4.1.23                splitstackshape_1.4.8      
 [79] colorspace_2.1-1            DBI_1.2.3                  
 [81] Hmisc_5.1-3                 nnet_7.3-19                
 [83] ade4_1.7-22                 tidyselect_1.2.1           
 [85] timeSeries_4032.109         bit_4.0.5                  
 [87] compiler_4.4.1              curl_5.2.1                 
 [89] httr2_1.0.2                 graph_1.82.0               
 [91] rjsoncons_1.3.1             htmlTable_2.4.3            
 [93] bezier_1.1.2                xml2_1.3.6                 
 [95] DelayedArray_0.30.1         checkmate_2.3.2            
 [97] scales_1.3.0                caTools_1.18.2             
 [99] spatial_7.3-17              RBGL_1.80.0                
[101] rappdirs_0.3.3              stringr_1.5.1              
[103] digest_0.6.36               rmarkdown_2.28             
[105] htmltools_0.5.8.1           pkgconfig_2.0.3            
[107] jpeg_0.1-10                 base64enc_0.1-3            
[109] stabledist_0.7-2            fastmap_1.2.0              
[111] ensembldb_2.28.0            rlang_1.1.4                
[113] htmlwidgets_1.6.4           UCSC.utils_1.0.0           
[115] shiny_1.9.1                 jsonlite_1.8.8             
[117] VariantAnnotation_1.50.0    RCurl_1.98-1.16            
[119] magrittr_2.0.3              Formula_1.2-5              
[121] GenomeInfoDbData_1.2.12     Rhdf5lib_1.26.0            
[123] munsell_0.5.1               Rcpp_1.0.13                
[125] reticulate_1.38.0           bamsignals_1.36.0          
[127] stringi_1.8.4               stable_1.1.6               
[129] zlibbioc_1.50.0             MASS_7.3-61                
[131] plyr_1.8.9                  pkgbuild_1.4.4             
[133] ggstats_0.6.0               parallel_4.4.1             
[135] deldir_2.0-4                CNEr_1.40.0                
[137] hms_1.1.3                   igraph_2.0.3               
[139] RUnit_0.4.33                rmutil_1.1.10              
[141] reshape2_1.4.4              biomaRt_2.60.1             
[143] pkgload_1.4.0               TFMPvalue_0.0.9            
[145] BiocVersion_3.19.1          XML_3.99-0.17              
[147] evaluate_0.24.0             latticeExtra_0.6-30        
[149] BiocManager_1.30.23         tzdb_0.4.0                 
[151] httpuv_1.6.15               tidyr_1.3.1                
[153] purrr_1.0.2                 clue_0.3-65                
[155] BiocBaseUtils_1.6.0         xtable_1.8-4               
[157] restfulr_0.0.15             AnnotationFilter_1.28.0    
[159] later_1.3.2                 tibble_3.2.1               
[161] memoise_2.0.1               cluster_2.1.6

GenomicRanges GenomicFeatures • 656 views

ADD COMMENT • link 9 months ago • updated 6 months ago mat149 ▴ 80

score 2 · Accepted Answer · 2024-09-26

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 59 minutes ago

United States

Something like

> z <- intronsByTranscript(tx, use.names = TRUE)
> z <- z[lengths(z) > 0L]
> znam <- sapply(strsplit(names(z), "\\|"), "[", 3)
> zz <- unlist(z)
> zz$intron <- paste(rep(znam, lengths(z)), sapply(lengths(z), seq_len), sep = "_")
> head(zz$intron, 30)
 [1] "FKW77_000178-T1_mrna_1"  
 [2] "FKW77_000228-T1_mrna_1"  
 [3] "FKW77_000243-T1_mrna_1"  
 [4] "FKW77_000249-T1_mrna_1:3"
 [5] "FKW77_000249-T1_mrna_1:2"
 [6] "FKW77_000249-T1_mrna_1:3"
 [7] "FKW77_000264-T1_mrna_1:2"
 [8] "FKW77_000264-T1_mrna_1:4"
 [9] "FKW77_000300-T1_mrna_1"  
[10] "FKW77_000300-T1_mrna_1:2"
[11] "FKW77_000300-T1_mrna_1:2"
[12] "FKW77_000331-T1_mrna_1:2"
[13] "FKW77_000331-T1_mrna_1"  
[14] "FKW77_000372-T1_mrna_1:4"
[15] "FKW77_000372-T1_mrna_1:3"
[16] "FKW77_000372-T1_mrna_1"  
[17] "FKW77_000372-T1_mrna_1:2"
[18] "FKW77_000383-T1_mrna_1"  
[19] "FKW77_000400-T1_mrna_1:5"
[20] "FKW77_000400-T1_mrna_1:5"
[21] "FKW77_000451-T1_mrna_1:2"
[22] "FKW77_000451-T1_mrna_1:2"
[23] "FKW77_000460-T1_mrna_1:7"
[24] "FKW77_000460-T1_mrna_1:8"
[25] "FKW77_000469-T1_mrna_1"  
[26] "FKW77_000584-T1_mrna_1:2"
[27] "FKW77_000584-T1_mrna_1:2"
[28] "FKW77_000584-T1_mrna_1:2"
[29] "FKW77_000584-T1_mrna_1:2"
[30] "FKW77_000602-T1_mrna_1"

ADD COMMENT • link 9 months ago James W. MacDonald 68k

1

Entering edit mode

There's an error in the last line. Note that sapply(lengths(z), seq_len) will return a list because the lengths vary. We need a vector, so have to convert that list to a vector. There are (at least) two ways to do that, using either unlist or (my preference) do.call, which is (probably) more likely to do what you expect.

> zz$intron <- paste(rep(znam, lengths(z)), do.call(c, sapply(lengths(z), seq_len)), sep = "_")
> head(zz$intron, 30)
 [1] "FKW77_000178-T1_mrna_1"
 [2] "FKW77_000228-T1_mrna_1"
 [3] "FKW77_000243-T1_mrna_1"
 [4] "FKW77_000249-T1_mrna_1"
 [5] "FKW77_000249-T1_mrna_2"
 [6] "FKW77_000249-T1_mrna_3"
 [7] "FKW77_000264-T1_mrna_1"
 [8] "FKW77_000264-T1_mrna_2"
 [9] "FKW77_000300-T1_mrna_1"
[10] "FKW77_000300-T1_mrna_2"
[11] "FKW77_000300-T1_mrna_3"
[12] "FKW77_000331-T1_mrna_1"
[13] "FKW77_000331-T1_mrna_2"
[14] "FKW77_000372-T1_mrna_1"
[15] "FKW77_000372-T1_mrna_2"
[16] "FKW77_000372-T1_mrna_3"
[17] "FKW77_000372-T1_mrna_4"
[18] "FKW77_000383-T1_mrna_1"
[19] "FKW77_000400-T1_mrna_1"
[20] "FKW77_000400-T1_mrna_2"
[21] "FKW77_000451-T1_mrna_1"
[22] "FKW77_000451-T1_mrna_2"
[23] "FKW77_000460-T1_mrna_1"
[24] "FKW77_000460-T1_mrna_2"
[25] "FKW77_000469-T1_mrna_1"
[26] "FKW77_000584-T1_mrna_1"
[27] "FKW77_000584-T1_mrna_2"
[28] "FKW77_000584-T1_mrna_3"
[29] "FKW77_000584-T1_mrna_4"
[30] "FKW77_000602-T1_mrna_1"

ADD REPLY • link 9 months ago James W. MacDonald 68k

0

Entering edit mode

Hey, In revisiting this question, the following code works for the '+' strand. I am having issues with demarcating intron rank by transcript on '-' strand introns. Negative strand intron ranks by transcript are in the reverse orientation....(e.g. the output is 1-2-3-4-5 when it should be 5-4-3-2-1) as granges 'end' and 'start' should be inverted for the negative strand intron ranking convention. Is there a subsetting operation that can address this?

z <- intronsByTranscript(tx, use.names = TRUE)
z <- z[lengths(z) > 0L]
znam <- sapply(strsplit(names(z), "\\|"), "[", 3)
zz <- unlist(z)
ind[,7]<-unlist(sapply(lengths(z), seq_len))
colnames(ind)[7]<-"IntronRankByTranscript"

head(ind)
            seqnames  start    end width strand          GID IntronRankByTranscript
VEintron1 CP042185.1 102664 102719    56      + FKW77_000178                      1
VEintron2 CP042185.1 114417 114472    56      + FKW77_000228                      1
VEintron3 CP042185.1 117912 117961    50      + FKW77_000243                      1
VEintron4 CP042185.1 118970 119025    56      +         CYT1                      1
VEintron5 CP042185.1 119184 119241    58      +         CYT1                      2
VEintron6 CP042185.1 119905 119957    53      +         CYT1                      3

tail(ind)
                seqnames  start    end width strand          GID IntronRankByTranscript
VEintron20729 CP042204.1 501893 502450   558      - FKW77_002098                      1
VEintron20730 CP042204.1 502552 502598    47      - FKW77_002098                      2
VEintron20731 CP042204.1 502620 502666    47      - FKW77_002098                      3
VEintron20732 CP042204.1 502765 502810    46      - FKW77_002098                      4
VEintron20733 CP042204.1 502858 503688   831      - FKW77_002098                      5
VEintron20734 CP042204.1 512952 513671   720      - FKW77_002145                      1

ADD REPLY • link 6 months ago mat149 ▴ 80