Question: Genes found on particular chromosome using two methods in TxDb.Hsapiens.UCSC.hg19.knownGene were not identical
gravatar for li lilingdu
22 months ago by
li lilingdu450
li lilingdu450 wrote:

Hello, I want to find gene on chromosome 6 using TxDb.Hsapiens.UCSC.hg19.knownGene package, and I use two different methods, but the genes found were not identical.

Following was the code:

load the library


and find gens at chromosome 6

length(onchr6)   ####1207 genes on chromosome 6

And then I use seqlevels command to limit the txdb to the chromosome 6.

runValue(unlist(seqnames(ts2)))  ###on all chromosome 6

and I find genes found by the above two methods were not fully identical.

length(missed)  #####207 genes were missed

###for one example

Do anyone point out any wrong or missing in my code? or this is a bug of the package?

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936  LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.24.5                  AnnotationDbi_1.34.4                    Biobase_2.32.0                         
[5] GenomicRanges_1.24.2                    GenomeInfoDb_1.8.3                      IRanges_2.6.1                           S4Vectors_0.10.2                       
[9] BiocGenerics_0.18.0                    

loaded via a namespace (and not attached):
 [1] XML_3.98-1.4               Rsamtools_1.24.0           Biostrings_2.40.2          GenomicAlignments_1.8.4    bitops_1.0-6               DBI_0.4-1                 
 [7] RSQLite_1.0.0              zlibbioc_1.18.0            XVector_0.12.0             BiocParallel_1.6.3         tools_3.3.1                biomaRt_2.28.0            
[13] RCurl_1.95-4.8             rtracklayer_1.32.2         SummarizedExperiment_1.2.3

ADD COMMENTlink modified 22 months ago by Martin Morgan ♦♦ 21k • written 22 months ago by li lilingdu450

And indeed I found the 1000 gene found by both methods were located at chromosome 6

while the 207 genes missed by the 2nd methods were located at 

chr6_apd_hap1  ,chr6_cox_hap2  ,chr6_dbb_hap3 ,

chr6_mann_hap4  ,chr6_mcf_hap5  ,chr6_qbl_hap6 ,chr6_ssto_hap7 

besides chr6 (all 207 genes on chrormosome 6 and subsets of the 207 genes located at the chr6_XX_hapX )

ADD REPLYlink written 22 months ago by li lilingdu450

And further problems might be occur:

we want to find some overlaps with particular genomic fragement

load the library


the fragment


and we want found genes overlapping with the above fragment:

length(gene1)   ###164 genes were found

and using seqlevel to limit to chromsomome 6 and then find overlaps:

Although we know that genes on chromosome 6 filtered by this methods were less than the real number of genes (1000 vs 1207 genes pointed as above explanations ). We still use these 1000 genes to get the overlaps, and further mistakes occured!:

transcriptsBy(txdb,by='gene')->ts2  #1000 genes on chr6

length(gene2)   ### only 2 genes were found!!!

only 2 genes (compared with the 164 genes detected above) were found to overlap the fragment. And this might be an obvious impossibility.

Any suggestions?

ADD REPLYlink modified 22 months ago by Martin Morgan ♦♦ 21k • written 22 months ago by li lilingdu450
gravatar for Martin Morgan
22 months ago by
Martin Morgan ♦♦ 21k
United States
Martin Morgan ♦♦ 21k wrote:

Your first method finds the 1207 genes that have any transcripts on chr6

> table(sum(seqnames(ts) == "chr6") != 0)

22252  1207 

while the second finds the 1000 genes have all transcripts on chr6.

> table(sum(seqnames(ts) == "chr6") == lengths(ts))

22459  1000 

There are 207 genes with some transcripts on chr6, and some transcripts on another 'chromosome' (typically the partial assemblies chr6_* you note).

Likewise, you find overlaps of genes with any or all transcripts overlapping  the region of interest. Here are the overlaps of individual transcripts, regrouped by gene

> strand(ir) = "*"
> hits = relist(unlist(ts) %over% ir, ts)

There are 164 genes that have any transcripts that overlap the region of interest

> table(sum(hits) != 0)

23295   164 

but only two that have all transcripts overlapping the region of interest

> table(sum(hits) == lengths(ts))

23457     2 

Here is one of the 162 genes excluded by the second approach

> head(ts[sum(hits) & (sum(hits) != lengths(ts))], 1)
GRangesList object of length 1:
GRanges object with 7 ranges and 2 metadata columns:
            seqnames               ranges strand |     tx_id     tx_name
               <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]           chr6 [30552109, 30552194]      + |     24173  uc021yud.1
  [2]  chr6_cox_hap2 [ 2064162,  2064247]      + |     79273  uc021zkn.1
  [3]  chr6_dbb_hap3 [ 1845750,  1845835]      + |     79923  uc021zna.1
  [4] chr6_mann_hap4 [ 1900181,  1900266]      + |     80502  uc021zou.1
  [5]  chr6_mcf_hap5 [ 1933963,  1934048]      + |     81022  uc021zre.1
  [6]  chr6_qbl_hap6 [ 1845017,  1845102]      + |     81628  uc021zto.1
  [7] chr6_ssto_hap7 [ 1884391,  1884476]      + |     82258  uc021zvz.1

seqinfo: 93 sequences (1 circular) from hg19 genome


ADD COMMENTlink modified 22 months ago • written 22 months ago by Martin Morgan ♦♦ 21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 112 users visited in the last hour