Question

Genes found on particular chromosome using two methods in TxDb.Hsapiens.UCSC.hg19.knownGene were not identical

2

Entering edit mode

li lilingdu ▴ 450

@li-lilingdu-1884

Last seen 6.8 years ago

Hello, I want to find gene on chromosome 6 using TxDb.Hsapiens.UCSC.hg19.knownGene package, and I use two different methods, but the genes found were not identical.

Following was the code:

load the library

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene->txdb

and find gens at chromosome 6

transcriptsBy(txdb,by='gene')->ts
ts[seqnames(ts)=='chr6']->onchr6
onchr6[elementNROWS(onchr6)>0]->onchr6
length(onchr6)   ####1207 genes on chromosome 6
sum(duplicated(names(onchr6)))

And then I use seqlevels command to limit the txdb to the chromosome 6.

seqlevels(txdb)<-'chr6'
transcriptsBy(txdb,by='gene')->ts2
runValue(unlist(seqnames(ts2)))  ###on all chromosome 6

and I find genes found by the above two methods were not fully identical.

setdiff(names(onchr6),names(ts2))->missed
length(missed)  #####207 genes were missed

###for one example
onchr6[["81696"]]
ts2[["81696"]]

Do anyone point out any wrong or missing in my code? or this is a bug of the package?

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936  LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.24.5                  AnnotationDbi_1.34.4                    Biobase_2.32.0                         
[5] GenomicRanges_1.24.2                    GenomeInfoDb_1.8.3                      IRanges_2.6.1                           S4Vectors_0.10.2                       
[9] BiocGenerics_0.18.0                    

loaded via a namespace (and not attached):
 [1] XML_3.98-1.4               Rsamtools_1.24.0           Biostrings_2.40.2          GenomicAlignments_1.8.4    bitops_1.0-6               DBI_0.4-1                 
 [7] RSQLite_1.0.0              zlibbioc_1.18.0            XVector_0.12.0             BiocParallel_1.6.3         tools_3.3.1                biomaRt_2.28.0            
[13] RCurl_1.95-4.8             rtracklayer_1.32.2         SummarizedExperiment_1.2.3

txdb.hsapiens.ucsc.hg19.knowngene txdb • 1.5k views

ADD COMMENT • link updated 8.6 years ago by Martin Morgan 25k • written 8.6 years ago by li lilingdu ▴ 450

0

Entering edit mode

And indeed I found the 1000 gene found by both methods were located at chromosome 6

while the 207 genes missed by the 2nd methods were located at

chr6_apd_hap1 ,chr6_cox_hap2 ,chr6_dbb_hap3 ,

chr6_mann_hap4 ,chr6_mcf_hap5 ,chr6_qbl_hap6 ,chr6_ssto_hap7

besides chr6 (all 207 genes on chrormosome 6 and subsets of the 207 genes located at the chr6_XX_hapX )

ADD REPLY • link 8.6 years ago li lilingdu ▴ 450

0

Entering edit mode

And further problems might be occur:

we want to find some overlaps with particular genomic fragement

load the library

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene->txdb

the fragment

GRanges('chr6',IRanges(29624758,33160245),strand="+")->ir
genome(ir)<-'hg19'

and we want found genes overlapping with the above fragment:

transcriptsBy(txdb,by='gene')->ts
findOverlaps(ir,ts,ignore.strand=T)->map
ts[to(map)]->gene1
length(gene1)   ###164 genes were found

and using seqlevel to limit to chromsomome 6 and then find overlaps:

Although we know that genes on chromosome 6 filtered by this methods were less than the real number of genes (1000 vs 1207 genes pointed as above explanations ). We still use these 1000 genes to get the overlaps, and further mistakes occured!:

seqlevels(txdb)<-'chr6'
transcriptsBy(txdb,by='gene')->ts2  #1000 genes on chr6

findOverlaps(ir,ts2,ignore.strand=T)->map
ts2[to(map)]->gene2
length(gene2)   ### only 2 genes were found!!!

only 2 genes (compared with the 164 genes detected above) were found to overlap the fragment. And this might be an obvious impossibility.

Any suggestions?

ADD REPLY • link updated 8.6 years ago by Martin Morgan 25k • written 8.6 years ago by li lilingdu ▴ 450

score 1 · Answer 1 · 2016-08-10

Your first method finds the 1207 genes that have any transcripts on chr6

> table(sum(seqnames(ts) == "chr6") != 0)

FALSE  TRUE 
22252  1207

while the second finds the 1000 genes have all transcripts on chr6.

> table(sum(seqnames(ts) == "chr6") == lengths(ts))

FALSE  TRUE 
22459  1000

There are 207 genes with some transcripts on chr6, and some transcripts on another 'chromosome' (typically the partial assemblies chr6_* you note).

Likewise, you find overlaps of genes with any or all transcripts overlapping the region of interest. Here are the overlaps of individual transcripts, regrouped by gene

> strand(ir) = "*"
> hits = relist(unlist(ts) %over% ir, ts)

There are 164 genes that have any transcripts that overlap the region of interest

> table(sum(hits) != 0)

FALSE  TRUE 
23295   164

but only two that have all transcripts overlapping the region of interest

> table(sum(hits) == lengths(ts))

FALSE  TRUE 
23457     2

Here is one of the 162 genes excluded by the second approach

> head(ts[sum(hits) & (sum(hits) != lengths(ts))], 1)
GRangesList object of length 1:
$100126314 
GRanges object with 7 ranges and 2 metadata columns:
            seqnames               ranges strand |     tx_id     tx_name
               <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]           chr6 [30552109, 30552194]      + |     24173  uc021yud.1
  [2]  chr6_cox_hap2 [ 2064162,  2064247]      + |     79273  uc021zkn.1
  [3]  chr6_dbb_hap3 [ 1845750,  1845835]      + |     79923  uc021zna.1
  [4] chr6_mann_hap4 [ 1900181,  1900266]      + |     80502  uc021zou.1
  [5]  chr6_mcf_hap5 [ 1933963,  1934048]      + |     81022  uc021zre.1
  [6]  chr6_qbl_hap6 [ 1845017,  1845102]      + |     81628  uc021zto.1
  [7] chr6_ssto_hap7 [ 1884391,  1884476]      + |     82258  uc021zvz.1

-------
seqinfo: 93 sequences (1 circular) from hg19 genome