Grange object subsetting using seqnames
1
0
Entering edit mode
rtrivedi1 ▴ 10
@4899789e
Last seen 11 months ago
United States

Hi,

I have created this GRanges object from data frame. I want to select for seqnames Chr1- chr22 and chrX.

> x2
GRanges object with 232184 ranges and 1 metadata column:
                         seqnames        ranges strand |        gene
                            <Rle>     <IRanges>  <Rle> | <character>
       [1]                   chr1   11868-14409      + |     DDX11L1
       [2]                   chr1   12009-13670      + |     DDX11L1
       [3]                   chr1   14403-29570      - |      WASH7P
       [4]                   chr1   17368-17436      - |   MIR6859-1
       [5]                   chr1   29553-31097      + | MIR1302-2HG
       ...                    ...           ...    ... .         ...
  [232180] chr22_KI270734v1_ran..   59710-60316      + |  AC007325.3
  [232181] chr22_KI270734v1_ran..   72410-74814      + |  AC007325.1
  [232182] chr22_KI270734v1_ran.. 131493-137392      + |  AC007325.4
  [232183] chr22_KI270734v1_ran.. 138081-161750      - |  AC007325.2
  [232184] chr22_KI270734v1_ran.. 138081-161852      - |  AC007325.2

> seqnames(x2)

factor-Rle of length 232184 with 47 runs
  Lengths:                   20376                   16951 ...                       5
  Values : chr1                    chr2                    ... chr22_KI270734v1_random
Levels(47): chr1 chr2 chr3 chr4 ... chrUn_KI270442v1 chrUn_KI270744v1 chrUn_KI270750v1

I created a vector for all chromosomes(1-22 and X).

>Total_chr  <-  c(paste("chr", 1:22, sep=''), 'chrX')
>Total_chr
[1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10" "chr11"
[12] "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22"
[23] "chrX"

When I am subsetting x2 using Total_chr, it gives only for chr1-chr22 not for chrX. However when I check in my GRanges object chrX is present.

> x3 <- x2[seqnames(x2) == Total_chr]

GRanges object with 10048 ranges and 1 metadata column:
          seqnames            ranges strand |        gene
             <Rle>         <IRanges>  <Rle> | <character>
      [1]     chr1       11868-14409      + |     DDX11L1
      [2]     chr1     139789-140339      - |  AL627309.2
      [3]     chr1     365388-366151      - |  AL732372.2
      [4]     chr1     498280-499175      - |  AL732372.2
      [5]     chr1     632324-632413      - |  AC114498.2
      ...      ...               ...    ... .         ...
  [10044]    chr22 50546414-50547652      + |     KLHDC7B
  [10045]    chr22 50578962-50582824      - |        CHKB
  [10046]    chr22 50625017-50628170      - |        ARSA
  [10047]    chr22 50744143-50744910      + |         ACR
  [10048]    chr22 50783849-50799441      + |   RPL23AP82

> seqnames(x3)

factor-Rle of length 10048 with 23 runs
  Lengths:   886   737   628   414   480   461   478 ...   596   195   600   237   129   218
  Values : chr1  chr2  chr3  chr4  chr5  chr6  chr7  ... chr17 chr18 chr19 chr20 chr21 chr22
Levels(47): chr1 chr2 chr3 chr4 ... chrUn_KI270442v1 chrUn_KI270744v1 chrUn_KI270750v1

However when I check in my GRanges object 'chrX' is present.

x2[seqnames(x2) == 'chrX']
GRanges object with 7169 ranges and 1 metadata column:
         seqnames              ranges strand |        gene
            <Rle>           <IRanges>  <Rle> | <character>
     [1]     chrX       253742-255091      + |  AL954722.1
     [2]     chrX       276321-303353      + |      PLCXD1
     [3]     chrX       276323-291537      + |      PLCXD1
     [4]     chrX       276352-291629      + |      PLCXD1
     [5]     chrX       281054-288869      + |      PLCXD1
     ...      ...                 ...    ... .         ...
  [7165]     chrX 156022785-156023531      + |      WASH6P
  [7166]     chrX 156023366-156025666      + |      WASH6P
  [7167]     chrX 156023823-156025554      + |      WASH6P
  [7168]     chrX 156024070-156025554      + |      WASH6P
  [7169]     chrX 156025663-156027877      - |    DDX11L16
  -------
  seqinfo: 47 sequences from an unspecified genome; no seqlengths

>sessionInfo()

R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggplot2_3.3.5                              BiocParallel_1.24.1                       
 [3] MEDIPSData_1.26.0                          BSgenome.Hsapiens.UCSC.hg38.masked_1.3.993
 [5] BSgenome.Hsapiens.UCSC.hg19_1.4.3          org.Hs.eg.db_3.12.0                       
 [7] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0   GenomicFeatures_1.42.3                    
 [9] AnnotationDbi_1.52.0                       Biobase_2.50.0                            
[11] annotatr_1.16.0                            BSgenome.Hsapiens.UCSC.hg38_1.4.3         
[13] BSgenome_1.58.0                            rtracklayer_1.50.0                        
[15] Biostrings_2.58.0                          XVector_0.30.0                            
[17] GenomicRanges_1.42.0                       GenomeInfoDb_1.26.7                       
[19] IRanges_2.24.1                             S4Vectors_0.28.1                          
[21] BiocGenerics_0.36.1                        qsea_1.16.0                               

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                  matrixStats_0.59.0            bit64_4.0.5                  
 [4] progress_1.2.2                httr_1.4.2                    tools_4.0.5                  
 [7] utf8_1.2.1                    R6_2.5.0                      DBI_1.1.1                    
[10] colorspace_2.0-2              withr_2.4.2                   tidyselect_1.1.1             
[13] prettyunits_1.1.1             bit_4.0.4                     curl_4.3.2                   
[16] compiler_4.0.5                xml2_1.3.2                    DelayedArray_0.16.3          
[19] scales_1.1.1                  readr_1.4.0                   askpass_1.1                  
[22] rappdirs_0.3.3                stringr_1.4.0                 digest_0.6.27                
[25] Rsamtools_2.6.0               pkgconfig_2.0.3               htmltools_0.5.1.1            
[28] MatrixGenerics_1.2.1          highr_0.9                     regioneR_1.22.0              
[31] dbplyr_2.1.1                  fastmap_1.1.0                 limma_3.46.0                 
[34] rlang_0.4.11                  rstudioapi_0.13               RSQLite_2.2.7                
[37] shiny_1.6.0                   generics_0.1.0                zoo_1.8-9                    
[40] gtools_3.9.2                  dplyr_1.0.7                   RCurl_1.98-1.3               
[43] magrittr_2.0.1                GenomeInfoDbData_1.2.4        Matrix_1.3-4                 
[46] Rcpp_1.0.6                    munsell_0.5.0                 fansi_0.5.0                  
[49] lifecycle_1.0.0               stringi_1.6.2                 yaml_2.2.1                   
[52] SummarizedExperiment_1.20.0   zlibbioc_1.36.0               plyr_1.8.6                   
[55] HMMcopy_1.32.0                BiocFileCache_1.14.0          AnnotationHub_2.22.1         
[58] grid_4.0.5                    blob_1.2.1                    promises_1.2.0.1             
[61] crayon_1.4.1                  lattice_0.20-44               hms_1.1.0                    
[64] knitr_1.33                    pillar_1.6.1                  reshape2_1.4.4               
[67] biomaRt_2.46.3                XML_3.99-0.6                  glue_1.4.2                   
[70] BiocVersion_3.12.0            data.table_1.14.0             BiocManager_1.30.16          
[73] vctrs_0.3.8                   httpuv_1.6.1                  gtable_0.3.0                 
[76] openssl_1.4.4                 purrr_0.3.4                   assertthat_0.2.1             
[79] cachem_1.0.5                  xfun_0.24                     mime_0.11                    
[82] xtable_1.8-4                  later_1.2.0                   tibble_3.1.2                 
[85] GenomicAlignments_1.26.0      memoise_2.0.0                 ellipsis_0.3.2               
[88] interactiveDisplayBase_1.28.0

Why is it so ? What I am missing while subsetting ?

Thanks

Rakesh

Seqnames makeGRangesFromDataFrame GRanges Subsetting GenomicRanges • 432 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

You should use the existing functionality rather than rolling your own. As an example,

> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> tx <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene)
> length(seqlevels(tx))
[1] 595
> tx2 <- keepSeqlevels(tx, standardChromosomes(tx)[1:23], pruning.mode = "coarse")
> seqlevels(tx2)
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX" 
> tx2[seqnames(tx2) == standardChromosomes(tx)[1:23]]
GRanges object with 10048 ranges and 2 metadata columns:
          seqnames              ranges strand |     tx_id           tx_name
             <Rle>           <IRanges>  <Rle> | <integer>       <character>
      [1]     chr1         11869-14409      + |         1 ENST00000456328.2
      [2]     chr1       631074-632616      + |        24 ENST00000414273.1
      [3]     chr1       778937-784429      + |        47 ENST00000609830.1
      [4]     chr1       827604-859085      + |        70 ENST00000670780.1
      [5]     chr1       860226-866720      + |        93 ENST00000671208.1
      ...      ...                 ...    ... .       ...               ...
  [10044]     chrX 154508572-154516209      - |    231012 ENST00000442929.1
  [10045]     chrX 154658741-154659643      - |    231035 ENST00000437536.2
  [10046]     chrX 154789989-154805498      - |    231058 ENST00000369531.1
  [10047]     chrX 155065321-155147775      - |    231081 ENST00000362018.2
  [10048]     chrX 156025664-156027877      - |    231104 ENST00000507418.6
  -------
  seqinfo: 23 sequences from hg38 genome
ADD COMMENT
0
Entering edit mode

Also, note that this

tx2[seqnames(tx2) == standardChromosomes(tx)[1:23]]

doesn't work the way you might think. Using equality on a vector isn't really a thing. Instead you should use %in%

> tx2[seqnames(tx2) %in% standardChromosomes(tx)[1:23]]
GRanges object with 231104 ranges and 2 metadata columns:
           seqnames              ranges strand |     tx_id           tx_name
              <Rle>           <IRanges>  <Rle> | <integer>       <character>
       [1]     chr1         11869-14409      + |         1 ENST00000456328.2
       [2]     chr1         12010-13670      + |         2 ENST00000450305.2
       [3]     chr1         29554-31097      + |         3 ENST00000473358.1
       [4]     chr1         30267-31109      + |         4 ENST00000469289.1
       [5]     chr1         30366-30503      + |         5 ENST00000607096.1
       ...      ...                 ...    ... .       ...               ...
  [231100]     chrX 155828585-155829576      - |    231100 ENST00000412936.6
  [231101]     chrX 155978992-155979325      - |    231101 ENST00000456370.6
  [231102]     chrX 155985370-155986249      - |    231102 ENST00000421233.6
  [231103]     chrX 156014623-156016837      - |    231103 ENST00000399966.9
  [231104]     chrX 156025664-156027877      - |    231104 ENST00000507418.6
  -------
  seqinfo: 23 sequences from hg38 genome

## which is the same as if I didn't subset

> tx2
GRanges object with 231104 ranges and 2 metadata columns:
           seqnames              ranges strand |     tx_id           tx_name
              <Rle>           <IRanges>  <Rle> | <integer>       <character>
       [1]     chr1         11869-14409      + |         1 ENST00000456328.2
       [2]     chr1         12010-13670      + |         2 ENST00000450305.2
       [3]     chr1         29554-31097      + |         3 ENST00000473358.1
       [4]     chr1         30267-31109      + |         4 ENST00000469289.1
       [5]     chr1         30366-30503      + |         5 ENST00000607096.1
       ...      ...                 ...    ... .       ...               ...
  [231100]     chrX 155828585-155829576      - |    231100 ENST00000412936.6
  [231101]     chrX 155978992-155979325      - |    231101 ENST00000456370.6
  [231102]     chrX 155985370-155986249      - |    231102 ENST00000421233.6
  [231103]     chrX 156014623-156016837      - |    231103 ENST00000399966.9
  [231104]     chrX 156025664-156027877      - |    231104 ENST00000507418.6
  -------
  seqinfo: 23 sequences from hg38 genome
ADD REPLY
0
Entering edit mode

Hi James,

Thanks !! I will stick to use the existing functionality rather than introducing my own.

Just for my understanding, I tried using standardChromosomes() on my GRanges object but still it shows only chr1 - chr22 but no chrX

> standardChromosomes(x1)[1:23]

 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10" "chr11"
[12] "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22"
[23] "chrX" 

> x2 <- x1[x1@seqnames %in% standardChromosomes(x1)[1:23]]
> x2
GRanges object with 231104 ranges and 1 metadata column:
           seqnames            ranges strand |        gene
              <Rle>         <IRanges>  <Rle> | <character>
       [1]     chr1       11868-14409      + |     DDX11L1
       [2]     chr1       12009-13670      + |     DDX11L1
       [3]     chr1       14403-29570      - |      WASH7P
       [4]     chr1       17368-17436      - |   MIR6859-1
       [5]     chr1       29553-31097      + | MIR1302-2HG
       ...      ...               ...    ... .         ...
  [231100]    chr22 50783796-50789353      + |   RPL23AP82
  [231101]    chr22 50783811-50789172      + |   RPL23AP82
  [231102]    chr22 50783849-50799441      + |   RPL23AP82
  [231103]    chr22 50783861-50801309      + |   RPL23AP82
  [231104]    chr22 50798654-50799123      + |   RPL23AP82
  -------
  seqinfo: 47 sequences from an unspecified genome; no seqlengths

> seqnames(x2)
factor-Rle of length 231104 with 23 runs
  Lengths: 20376 16951 14438  9533 11020 10604 11016 ... 13708  4497 13790  5456  2983  4996
  Values : chr1  chr2  chr3  chr4  chr5  chr6  chr7  ... chr17 chr18 chr19 chr20 chr21 chr22
Levels(47): chr1 chr2 chr3 chr4 ... chrUn_KI270442v1 chrUn_KI270744v1 chrUn_KI270750v1

I am wondering why it didn't work even using the standard functionality.

Thanks !!

Rakesh

ADD REPLY
0
Entering edit mode

I would bet that the order of your GRanges object has chrX somewhere in the middle. For example, you have this:

> seqnames(x2)
factor-Rle of length 231104 with 23 runs
  Lengths: 20376 16951 14438  9533 11020 10604 11016 ... 13708  4497 13790  5456  2983  4996
  Values : chr1  chr2  chr3  chr4  chr5  chr6  chr7  ... chr17 chr18 chr19 chr20 chr21 chr22
Levels(47): chr1 chr2 chr3 chr4 ... chrUn_KI270442v1 chrUn_KI270744v1 chrUn_KI270750v1

Which indicates you have 23 chromosomes, some of which are not shown (they are in the ellipsis). And if you have chr1-22, but a total of 23, what's the 23rd one? What do you get from

levels(seqnames(x2))
ADD REPLY
0
Entering edit mode

23rd one is 'chrX'.

> levels(seqnames(x2))

 [1] "chr1"                    "chr2"                    "chr3"                   
 [4] "chr4"                    "chr5"                    "chr6"                   
 [7] "chr7"                    "chr8"                    "chr9"                   
[10] "chr10"                   "chr11"                   "chr12"                  
[13] "chr13"                   "chr14"                   "chr15"                  
[16] "chr16"                   "chr17"                   "chr18"                  
[19] "chr19"                   "chr20"                   "chr21"                  
[22] "chr22"                   "chrX"                    "chrY"                   
[25] "chrM"                    "chr1_KI270711v1_random"  "chr1_KI270713v1_random" 
[28] "chr11_KI270721v1_random" "chr14_GL000009v2_random" "chr14_GL000194v1_random"
[31] "chr14_GL000225v1_random" "chr14_KI270726v1_random" "chr15_KI270727v1_random"
[34] "chr16_KI270728v1_random" "chr17_GL000205v2_random" "chr22_KI270731v1_random"
[37] "chr22_KI270733v1_random" "chr22_KI270734v1_random" "chrUn_GL000195v1"       
[40] "chrUn_GL000213v1"        "chrUn_GL000216v2"        "chrUn_GL000218v1"       
[43] "chrUn_GL000219v1"        "chrUn_GL000220v1"        "chrUn_KI270442v1"       
[46] "chrUn_KI270744v1"        "chrUn_KI270750v1" 


## If subset only [23], then output have 'chrX' only

> x1[x1@seqnames %in% standardChromosomes(x1)[23]]
GRanges object with 7169 ranges and 1 metadata column:
         seqnames              ranges strand |        gene
            <Rle>           <IRanges>  <Rle> | <character>
     [1]     chrX       253742-255091      + |  AL954722.1
     [2]     chrX       276321-303353      + |      PLCXD1
     [3]     chrX       276323-291537      + |      PLCXD1
     [4]     chrX       276352-291629      + |      PLCXD1
     [5]     chrX       281054-288869      + |      PLCXD1
     ...      ...                 ...    ... .         ...
  [7165]     chrX 156022785-156023531      + |      WASH6P
  [7166]     chrX 156023366-156025666      + |      WASH6P
  [7167]     chrX 156023823-156025554      + |      WASH6P
  [7168]     chrX 156024070-156025554      + |      WASH6P
  [7169]     chrX 156025663-156027877      - |    DDX11L16
  -------
  seqinfo: 47 sequences from an unspecified genome; no seqlengths

# If x2 is subset for 'chrX, then also output is chrX 

> x2[x2@seqnames=='chrX']
GRanges object with 7169 ranges and 1 metadata column:
         seqnames              ranges strand |        gene
            <Rle>           <IRanges>  <Rle> | <character>
     [1]     chrX       253742-255091      + |  AL954722.1
     [2]     chrX       276321-303353      + |      PLCXD1
     [3]     chrX       276323-291537      + |      PLCXD1
     [4]     chrX       276352-291629      + |      PLCXD1
     [5]     chrX       281054-288869      + |      PLCXD1
     ...      ...                 ...    ... .         ...
  [7165]     chrX 156022785-156023531      + |      WASH6P
  [7166]     chrX 156023366-156025666      + |      WASH6P
  [7167]     chrX 156023823-156025554      + |      WASH6P
  [7168]     chrX 156024070-156025554      + |      WASH6P
  [7169]     chrX 156025663-156027877      - |    DDX11L16

"It's really a insightful discussion". Thanks James !!

Best Rakesh

ADD REPLY

Login before adding your answer.

Traffic: 223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6