Search
Question: Genes found on particular chromosome using two methods in TxDb.Hsapiens.UCSC.hg19.knownGene were not identical
2
gravatar for li lilingdu
15 months ago by
li lilingdu450
li lilingdu450 wrote:

Hello, I want to find gene on chromosome 6 using TxDb.Hsapiens.UCSC.hg19.knownGene package, and I use two different methods, but the genes found were not identical.

Following was the code:

load the library

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene->txdb

and find gens at chromosome 6

transcriptsBy(txdb,by='gene')->ts
ts[seqnames(ts)=='chr6']->onchr6
onchr6[elementNROWS(onchr6)>0]->onchr6
length(onchr6)   ####1207 genes on chromosome 6
sum(duplicated(names(onchr6)))

And then I use seqlevels command to limit the txdb to the chromosome 6.

seqlevels(txdb)<-'chr6'
transcriptsBy(txdb,by='gene')->ts2
runValue(unlist(seqnames(ts2)))  ###on all chromosome 6

and I find genes found by the above two methods were not fully identical.

setdiff(names(onchr6),names(ts2))->missed
length(missed)  #####207 genes were missed

###for one example
onchr6[["81696"]]
ts2[["81696"]]

Do anyone point out any wrong or missing in my code? or this is a bug of the package?

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936  LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.24.5                  AnnotationDbi_1.34.4                    Biobase_2.32.0                         
[5] GenomicRanges_1.24.2                    GenomeInfoDb_1.8.3                      IRanges_2.6.1                           S4Vectors_0.10.2                       
[9] BiocGenerics_0.18.0                    

loaded via a namespace (and not attached):
 [1] XML_3.98-1.4               Rsamtools_1.24.0           Biostrings_2.40.2          GenomicAlignments_1.8.4    bitops_1.0-6               DBI_0.4-1                 
 [7] RSQLite_1.0.0              zlibbioc_1.18.0            XVector_0.12.0             BiocParallel_1.6.3         tools_3.3.1                biomaRt_2.28.0            
[13] RCurl_1.95-4.8             rtracklayer_1.32.2         SummarizedExperiment_1.2.3

ADD COMMENTlink modified 15 months ago by Martin Morgan ♦♦ 20k • written 15 months ago by li lilingdu450

And indeed I found the 1000 gene found by both methods were located at chromosome 6

while the 207 genes missed by the 2nd methods were located at 

chr6_apd_hap1  ,chr6_cox_hap2  ,chr6_dbb_hap3 ,

chr6_mann_hap4  ,chr6_mcf_hap5  ,chr6_qbl_hap6 ,chr6_ssto_hap7 

besides chr6 (all 207 genes on chrormosome 6 and subsets of the 207 genes located at the chr6_XX_hapX )

ADD REPLYlink written 15 months ago by li lilingdu450

And further problems might be occur:

we want to find some overlaps with particular genomic fragement

load the library

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene->txdb

the fragment

GRanges('chr6',IRanges(29624758,33160245),strand="+")->ir
genome(ir)<-'hg19'

and we want found genes overlapping with the above fragment:

transcriptsBy(txdb,by='gene')->ts
findOverlaps(ir,ts,ignore.strand=T)->map
ts[to(map)]->gene1
length(gene1)   ###164 genes were found

and using seqlevel to limit to chromsomome 6 and then find overlaps:

Although we know that genes on chromosome 6 filtered by this methods were less than the real number of genes (1000 vs 1207 genes pointed as above explanations ). We still use these 1000 genes to get the overlaps, and further mistakes occured!:

seqlevels(txdb)<-'chr6'
transcriptsBy(txdb,by='gene')->ts2  #1000 genes on chr6

findOverlaps(ir,ts2,ignore.strand=T)->map
ts2[to(map)]->gene2
length(gene2)   ### only 2 genes were found!!!

only 2 genes (compared with the 164 genes detected above) were found to overlap the fragment. And this might be an obvious impossibility.

Any suggestions?

ADD REPLYlink modified 15 months ago by Martin Morgan ♦♦ 20k • written 15 months ago by li lilingdu450
1
gravatar for Martin Morgan
15 months ago by
Martin Morgan ♦♦ 20k
United States
Martin Morgan ♦♦ 20k wrote:

Your first method finds the 1207 genes that have any transcripts on chr6

> table(sum(seqnames(ts) == "chr6") != 0)

FALSE  TRUE 
22252  1207 

while the second finds the 1000 genes have all transcripts on chr6.

> table(sum(seqnames(ts) == "chr6") == lengths(ts))

FALSE  TRUE 
22459  1000 

There are 207 genes with some transcripts on chr6, and some transcripts on another 'chromosome' (typically the partial assemblies chr6_* you note).

Likewise, you find overlaps of genes with any or all transcripts overlapping  the region of interest. Here are the overlaps of individual transcripts, regrouped by gene

> strand(ir) = "*"
> hits = relist(unlist(ts) %over% ir, ts)

There are 164 genes that have any transcripts that overlap the region of interest

> table(sum(hits) != 0)

FALSE  TRUE 
23295   164 

but only two that have all transcripts overlapping the region of interest

> table(sum(hits) == lengths(ts))

FALSE  TRUE 
23457     2 

Here is one of the 162 genes excluded by the second approach

> head(ts[sum(hits) & (sum(hits) != lengths(ts))], 1)
GRangesList object of length 1:
$100126314 
GRanges object with 7 ranges and 2 metadata columns:
            seqnames               ranges strand |     tx_id     tx_name
               <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]           chr6 [30552109, 30552194]      + |     24173  uc021yud.1
  [2]  chr6_cox_hap2 [ 2064162,  2064247]      + |     79273  uc021zkn.1
  [3]  chr6_dbb_hap3 [ 1845750,  1845835]      + |     79923  uc021zna.1
  [4] chr6_mann_hap4 [ 1900181,  1900266]      + |     80502  uc021zou.1
  [5]  chr6_mcf_hap5 [ 1933963,  1934048]      + |     81022  uc021zre.1
  [6]  chr6_qbl_hap6 [ 1845017,  1845102]      + |     81628  uc021zto.1
  [7] chr6_ssto_hap7 [ 1884391,  1884476]      + |     82258  uc021zvz.1

-------
seqinfo: 93 sequences (1 circular) from hg19 genome

 

ADD COMMENTlink modified 15 months ago • written 15 months ago by Martin Morgan ♦♦ 20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 338 users visited in the last hour